Purely Technical: Normalizing Fullwidth Unicode Characters for Search

Over the past couple of weeks, I’ve been working on MediaWiki’s search functionality. First, I fixed a bug I found with Unicode normalization in the lucene-search extension used by Wikipedia itself. Later, I adapted the normalization that Chinese and Japanese language MediaWiki’s have been doing for full-width characters to all languages. If you’re like me when I started this, you don’t really have a clue what I mean when I say “Fullwidth Latin Characters” even if you know what Unicode is. If you’re familiar with Unicode, you know that the first section is compatible with ASCII. The alphabetical characters contained here are called “halfwidth”. They don’t mix easily with “fullwidth” Japanese and Chinese characters. In order to accomodate the mixture of latinate languages (like English) with these ideogram-based languages there are fullwidth forms available. These fullwidth forms (ABC vs ABC) occupy a different area of the Unicode character set and, if your software doesn’t normalize them, then searching for “Mark” won’t match “Mark”. (And, if you look over the Wikipedia page titled “Latin characters in Unicode”, you’ll find several copies of the alphabet in Unicode space.)

One thought on “Purely Technical: Normalizing Fullwidth Unicode Characters for Search”

  1. So *that’s* why the Chinese and Japanese Wikipedias have this weird font even for interface elements written in English (I chose the English interface language in my preferences). Thanks for explaining this!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.