Problem in Unicode Line Breaking Algorithm

2008/2/24
Satoshi Nakagawa /

Summary

(U+3001) and subsequent ASCII token should be breakable into two lines.

(U+3002) has the same problem as well.

Description

For example, in Japanese writing, は、abc should be breakable into two lines like:

は、
abc

Because Japanese people use and just like comma and period in English. We can break a line after comma or period in English. But the current Unicode line breaking algorithm doesn't allow this behavior for and .

I think it's a problem of the Unicode line breaking algorithm standard.

See Unicode Standard Annex #14 Line Breaking Properties.

CL: Closing Punctuation (XB)

3001..3002 IDEOGRAPHIC COMMA..IDEOGRAPHIC FULL STOP

(U+3001) and (U+3002) are specified as CL characters.

LB30: Do not break between letters, numbers, or ordinary symbols and opening or closing punctuation.
CL × (AL | NU)

It says CL and a subsequent alphabetic or numeric token is not breakable. In the result we cannot break at any positions in は、abc.

In my opinion and should not be treated as CL. Because we cannot apply the LB30 rule to them. In conclusion they should be considered as a different class.

Related Problem

WebKit uses ICU for line breaking. ICU has a very strict implementation of the Unicode line breaking algorithm. Therefore this problem is reproducible in Safari.