FS#2270 Indexer tokenizer fails to lowercase words, breaking fulltext search

Fulltext search is not working on uppercase/mixedcase words. The issue seems to come from the Indexer's tokenizer : the array returned by the tokenizer() function contains mixedcase words, whereas it should only contain lowercase words. Those mixedcase words can be found in the word-index files (data/index/w*.idx).

In file inc/indexer.php, the function Doku_Indexer::tokenizer() finishes with a loop in charge of lowercasing the words. The loop is a 'foreach' using a reference on the value, and for some reasons using 'unset' inside the loop breaks the reference and the array is not updated.

Maybe the bug comes from php (i'm using php 5.2.0-8+etch16 on debian) but I can't upgrade to test it.

Here is the patch I'm using to workaround the problem:

--- inc/indexer.php.orig 2011-05-30 11:27:32.786493000 +0200
+++ inc/indexer.php 2011-05-30 11:26:43.153392000 +0200
@@ -444,8 +444,8 @@
$text = utf8_stripspecials($text, ' ', '\._\-:'.$wc);

$wordlist = explode(' ', $text);
- foreach ($wordlist as $i => &$word) {
- $word = (preg_match('/[^0-9A-Za-z]/u', $word)) ?
+ foreach ($wordlist as $i => $word) {
+ $wordlist[$i] = (preg_match('/[^0-9A-Za-z]/u', $word)) ?
utf8_strtolower($word) : strtolower($word);
if ((!is_numeric($word) && strlen($word) < IDX_MINWORDLENGTH)
|| array_search($word, $stopwords) !== false)

FS#2270 Indexer tokenizer fails to lowercase words, breaking fulltext search

Backend