This is a static dump of issues in the old "Flyspray" bugtracker for DokuWiki. Bugs and feature requests
are now tracked at the issue tracker at Github.
Closed
Implemented
FS#2143 IDX_ASIAN2
UTF-8/Unicode
2011-01-20danny0838
CJK Extension B, C, and D should be treated as Chinese characters (i.e. indexed and searched seperatedly) but are not listed in constant IDX_ASIAN2 in indexer.php. Suggest adding them.
(Since \x{#####} is not available for 4-byte UTF-8 chars, so use double quote with UTF-8 sequence instead.)
define('IDX_ASIAN2','['.
'\x{2E80}-\x{3040}'. // CJK -> Hangul
'\x{309D}-\x{30A0}'.
'\x{30FD}-\x{31EF}\x{3200}-\x{D7AF}'.
'\x{F900}-\x{FAFF}'. // CJK Compatibility Ideographs
'\x{FE30}-\x{FE4F}'. // CJK Compatibility Forms
"\xF0\xA0\x80\x80-\xF0\xAA\x9B\x9F". // CJK Extension B
"\xF0\xAA\x9C\x80-\xF0\xAB\x9C\xBF". // CJK Extension C
"\xF0\xAB\x9D\x80-\xF0\xAB\xA0\x9F". // CJK Extension D
"\xF0\xAF\xA0\x80-\xF0\xAF\xAB\xBF". // CJK Compatibility Supplement
']');
It would also help if the following common used symbols are added, most of which are not listed and excluded in $UTF8_SPECIAL_CHARS2 in utf8.php:
'\x{2500}-\x{25FF}'. // Common Symbols used by CJK