Japanese Text Handling Problem in Unicode Collation Algorithm

2009/10/12
Satoshi Nakagawa /

Summary

In the current Unicode Collation Algorithm, (U+3063) is considered as a small form of (U+3064). Then, these characters are treated as the same just like a and A in the primary strength and in the secondary strength.

But in Japanese semantics, (U+3063) and (U+3064) should be always treated as different characters even in a case insensitive context.

Description

As you know in English, abc and ABC are treated as the same in a case insensitive context.

But in Japanese, for example, あった and あつた are different words in any contexts. Because in Japanese semantics, is not considered as a small form of . These characters are never treated as the same characters.

In the current Unicode Collation Algorithm, the character pairs in the following list are treated as the same in the primary strength and in the secondary strength. But these character pairs should be always treated as different characters.

あ (U+3042) ぁ (U+3041)
い (U+3044) ぃ (U+3043)
う (U+3046) ぅ (U+3045)
え (U+3048) ぇ (U+3047)
お (U+304A) ぉ (U+3049)
か (U+304B) ゕ (U+3095)
け (U+3051) ゖ (U+3096)
つ (U+3064) っ (U+3063)
や (U+3084) ゃ (U+3083)
ゆ (U+3086) ゅ (U+3085)
よ (U+3088) ょ (U+3087)
わ (U+308F) ゎ (U+308E)
ア (U+30A2) ァ (U+30A1)
イ (U+30A4) ィ (U+30A3)
ウ (U+30A6) ゥ (U+30A5)
エ (U+30A8) ェ (U+30A7)
オ (U+30AA) ォ (U+30A9)
カ (U+30AB) ヵ (U+30F5)
ケ (U+30B1) ヶ (U+30F6)
ツ (U+30C4) ッ (U+30C3)
ヤ (U+30E4) ャ (U+30E3)
ユ (U+30E6) ュ (U+30E5)
ヨ (U+30E8) ョ (U+30E7)
ワ (U+30EF) ヮ (U+30EE)
ア (U+FF71) ァ (U+FF67)
イ (U+FF72) ィ (U+FF68)
ウ (U+FF73) ゥ (U+FF69)
エ (U+FF74) ェ (U+FF6A)
オ (U+FF75) ォ (U+FF6B)
ヤ (U+FF94) ャ (U+FF6C)
ユ (U+FF95) ュ (U+FF6D)
ヨ (U+FF96) ョ (U+FF6E)

Reproduction

You can see this problem in the search filter in Safari.

Because WebKit uses ICU for searching. ICU has a very strict implementation of the Unicode Collation Algorithm. Therefore this problem is reproducible in Safari.