-
2005-09-15
matthias_k
It seems that the parsing of acronyms has changed in the latest devel snapshot, in that they are now parsed much more aggressively. For example, if you write the word "specifically", the acronym "spec" will be parsed from the first part of the word. Is that intentional? Because it doesn't really make sense.
-
2005-09-15
andi
Partly fixed. However there is still a bug - the following line gets both occurences of 'spec' marked as acronym and I don't know why:
This is the specification also called spec.
Only the last occurence of spec should be marked but both are. If the latter is missing the first one isn't marked as well.
Any help appreciated.
-
2005-09-15
matthias_k
Where exactly are the expressions parsed into acronyms? It might just be a flawed regex. My PHP skills are rather lacking though, so I don't think I'm of much help here.
-
2005-09-15
ChrisS
Its not a flaw in the regex. The bug is in the way the lexer locates the text that matched the regex. It does a simple search for the position the matching text using strpos. If the same combination of letters occurs before the actual match strpos will return their position. For most of the regexes this won't be a problem, but for any that use look-ahead or look-behind there will be issues.
I'll see if I can work out a fix.
-
2005-09-15
ChrisS
If you up the requirement for Dokuwiki to PHP 4.3.0 we can use the PREG_OFFSET_CAPTURE flag.
Other than that:
- can the preceding boundary character be simplified to a ^ or SPACE. If so you could return both boundary characters. And let the acronym handler work out what to strip off before writing the render instruction.
-
2005-09-15
ChrisS
PREG_OFFSET_CAPTURE is no good. Even with the /u option its not utf-8 aware, it simply returns the byte offset of the match.
-
2005-09-16
ChrisS
I have the basics of a solution.
(1) PREG_OFFSET_CAPTURE looks like it will work, as long as the main "raw" string is split by byte aware rather than character aware functions.
(2) preg_split can be used. Once the index of the matching pattern has been determined, that pattern can be used to split "raw". By taking the length of the string before the split, Doku_LexerParallelRegex->match() can be made to behave as it would if using PREG_OFFSET_CAPTURE. Since all Doku_Lexer->_reduce does after the match is to split "raw" some rejigging to allow Doku_LexerParallelRegex->match() to return a split "raw" or to allow Doku_Lexer->_reduce() to access the successful pattern would avoid a second (albeit much faster substr) string split in Doku_Lexer->_reduce().
-
2005-09-16
matthias_k
I am still wondering what is actually causing this bug, since it was not present in the last stable release?
And, though I'm not sure what kind of parser is used (maybe LL-1?), why isn't it just possible to check in the lexer whether the next character is a whitespace when a sequence matches an acronym? That way, only complete, separate words would ever match the pattern.
-
2005-09-16
ChrisS
Andi will have to respond to where it comes from - my guess is in solving another bug (Andi did mention something on the mailing list about word boundaries not working 100% with utf-8 characters) this has come to light.
You can't only check against white space. All words can be followed by a many punctuation marks.
The lexer successfully pattern matches against the acronym. It is what comes next which is the problem. preg_match doesn't (prior to php 4.3) say where in the string the match occured - the lexer has to use some other technique to locate the match. The chosen technique - strpos() - is inadequate where the matching pattern has used look-ahead or look-behind assertions. In these circumstances it is possible that a series of characters identical to the match, but without the correct look-ahead/look-behind assertions will be located instead of the match.
That defines the bug. We have seen it only recently on acronyms. A work around could probably be found for acronyms by removing the look-ahead/look-behind assertions from the pattern. However the underlying bug in the lexer would still exist. I think its better to come up with a solution for that underlying bug than to introduce a work around.
-
2005-09-16
ChrisS
I have a solution worked out - it may even be faster than before :)
I'll do some further testing later on and some profiling to see if its worth attempting to make use of PREG_OFFSET_CAPTURE if its available and send a patch as soon as I can.
-
2005-09-16
ChrisS
using PREG_OFFSET_CAPTURE if anything was slower - the difference was less than 1%.
Patch sent.
I run Gentoo/Apache/PHP 4.4. In my installation, preg_match only fills the matches array with information on the sub-patterns it has checked - i.e. the last entry in the match array MUST correspond to the matching pattern. I took advantage of that behaviour. It probably needs to be tested to make sure it is universal.
-
2005-09-16
andi
Works perfect with my 4.3.10. Great work!