This is a static dump of issues in the old "Flyspray" bugtracker for DokuWiki. Bugs and feature requests
are now tracked at the issue tracker at Github.
Closed
None
FS#2947 Syntax recognition patterns are not unicode aware
UTF-8/Unicode
2014-02-24ameise
The pattern
$this->Lexer->addSpecialPattern('\b_?\p{Lu}\p{M}*\p{Lu}\p{M}*[-_\p{L}\p{M}]*\b', $mode, ...);
fails to recognize strings that contain letters like ÄÖÜÇÁÀ.
Reason is, that _getPerlMatchingFlags lexer.php(216) does not handle unicode.
Here the fix:
function _getPerlMatchingFlags() {
return ($this->_case ? "umsS" : "umsSi");
}
2014-02-25ChrisS
I believe (vaguely recall) we didn't want to use unicode aware patterns for efficiency reasons - they were much slower than non-unicode aware patterns. I also vaguely recall in more recent versions of PHP the problem is less bad than it used to be. I guess we have a couple of choices beyond leaving things as they are:
1. run some benchmarking to see what difference 'u' flag makes, if there is no significant difference (and no problems) make the change.
2. provide some sort of setting to allow an admin[a] and/or a plugin[b] to make the lexer unicode aware.
[a] if using local configurable patterns with unicode properties, e.g. interwiki, acronyms, entities or smileys (is this likely?)
[b] if the plugin is using unicode properties.
2014-02-25ameise
I didn't find problems so far and also no significant time penalties while editing cms pages using a lot of plugins. But thats no benchmark.
[a] is not feasible, since patterns that match multi byte characters are properties of the plugin.