Bug #306
Use of Lucene StandardAnaylzer treats _ as word delimter
| Status: | New | Start: | 04/16/2010 | |
| Priority: | Normal | Due date: | ||
| Assigned to: | - | % Done: | 0% |
|
| Category: | Indexer | |||
| Target version: | 0.7.1 Dione | |||
Description
I am indexing a .c file with a codebase that uses _ to form functions, in the 'StandardAnaylzer' of Lucene, _ is considered a white space character. As such a search for say:
'Is_variable_on' would become 'is variable on' this combined with a minimum word length means that one can not search for 'is_variable_on' instead one is only searching for 'variable' since 'is' and 'on' are below the minimum word length.
History
Updated by Karl Heinz Marbaise 656 days ago
- Target version set to 0.6.2 Calypso
Updated by Karl Heinz Marbaise 638 days ago
Hi Doug,
after enhanced my test repository with an example your problem. I can't see that the queries don't find this kind of names. May be i oversight something? May be you can give more details to make sure that i test the correct things.
Thanks.
Updated by Doug Warren 633 days ago
Sorry I thought I was on notification for bugs, but didn't get an E-Mail. If you wish I can compress and throw on-line example repositories and output as to how I'm calling it that demonstrate the issue.
Updated by Karl Heinz Marbaise 632 days ago
Hi Doug,
Sorry I thought I was on notification for bugs, but didn't get an E-Mail.
No problem.
If you wish I can compress and throw on-line example repositories and output as to how I'm calling it that demonstrate the issue.
Great. That would be very helpful, cause my Unit test in the current branch can not reproduce the problem.
Thanks in advance.
Kind regards
Karl Heinz Marbaise
Updated by Doug Warren 621 days ago
Karl Heinz Marbaise wrote:
Hi Doug,
Sorry I thought I was on notification for bugs, but didn't get an E-Mail.
No problem.
If you wish I can compress and throw on-line example repositories and output as to how I'm calling it that demonstrate the issue.
Great. That would be very helpful, cause my Unit test in the current branch can not reproduce the problem. Thanks in advance. Kind regards Karl Heinz Marbaise
I haven't had the time to reduce the set so I've taken our current live repository (The directory 'Tsunami-Search') and a sample search script to show how we invoke it (search.php, which just calls popen as: '/home/thebigwavenet/supose-0.6.2RC1-447/bin/supose search --index /home/thebigwavenet/Tsunami-Search --query $searchString') From the original bug report as reported to me the string is_aggressive_to is getting false matches in as an example players/wildcat/skull-/8.c which from the unix command line:
grep is 8.c
move_object(mon,this_object());
grep to 8.c
grep aggressive 8.c
mon->set_aggressive(1);
has only a call to set_aggressive and does not contain 'to' at all which supports my two theories of 1) _ is acting as a word boundary, and 2) under 3 letter words are being dropped/ignored.
You can find the file at http://dougwarren.org/supose.tgz and I apologize ahead of time that it's 70+ megs :(
Updated by Doug Warren 621 days ago
Sorry to follow myself up but I forgot to mention how scan is done:
There's a daily script run that basically goes as:
svn update
/home/thebigwavenet/supose-0.6.2RC1-447/bin/supose scan --url file:///home/svn/tsunami-repo --index /home/thebigwavenet/Tsunami-Search --fromrev `cat /tmp/lastsvnrevision`
svn info | grep Revision | awk '{print $2}' > /tmp/lastsvnrevision
The initial repository also was created with the same command line only with --create and without --fromrev.
Also using the above files you can see my other issues. The /archives directory does not exist in HEAD, yet many searches will return results from it.
Updated by Karl Heinz Marbaise 603 days ago
Hi Doug,
You can find the file at http://dougwarren.org/supose.tgz and I apologize ahead of time that it's 70+ megs :(
Have you removed that file ?
Kind regards
Karl Heinz Marbaise
Updated by Karl Heinz Marbaise 603 days ago
- Target version changed from 0.6.2 Calypso to 0.7.1 Dione