Problem in Unicode Line Breaking Algorithm

、 (U+3001) and subsequent ASCII token should be breakable into two lines.

。 (U+3002) has the same problem as well.

For example, in Japanese writing, は、abc should be breakable into two lines like:

は、
abc

Because Japanese people use 、 and 。 just like comma and period in English. We can break a line after comma or period in English. But the current Unicode line breaking algorithm doesn't allow this behavior for 、 and 。.

I think it's a problem of the Unicode line breaking algorithm standard.

See Unicode Standard Annex #14 Line Breaking Properties.

CL: Closing Punctuation (XB)
3001..3002 IDEOGRAPHIC COMMA..IDEOGRAPHIC FULL STOP

、 (U+3001) and 。 (U+3002) are specified as CL characters.

LB30: Do not break between letters, numbers, or ordinary symbols and opening or closing punctuation.
CL × (AL | NU)

It says CL and a subsequent alphabetic or numeric token is not breakable. In the result we cannot break at any positions in は、abc.

In my opinion 、 and 。 should not be treated as CL. Because we cannot apply the LB30 rule to them. In conclusion they should be considered as a different class.

Line Breaking Problem of WebKit

WebKit uses ICU for line breaking. ICU has a very strict implementation of the Unicode line breaking algorithm. Therefore this problem is reproducible in Safari.

Problem in Unicode Line Breaking Algorithm

Summary

Description

CL: Closing Punctuation (XB)

Related Problem