Solr integration
Added by Julian Hochstetter 590 days ago
Hello Kama,
i want to use SupoSE to scan and index our Repositories but instead directly to the lucene index we want to use our already running Solr server. (Frontend is in pipeline... ;)
To contribute this feature i've rewritten some parts of SupoSE which let you differ between lucene or solr as the storage backend at index and search time.
I think this is a welcome feature because the planed SupoSE webfrontend could also benefit from the great solr extensions like faceting.
So let me you give a short overview which things i've changed or done already
- Use ResultEntry instead of Lucene Document to take up data during scanning. Solr could directly work with the now annotated ResultEntry and LuceneIndexWriter does a conversion to a Lucene Document, same behaviour when querying the index.
- Writer and Reader interfaces and a IndexerFactory are now used to store/read data using lucene or solr
- Created an annotation which makes a ResultEntry field analyzed when index it
- Add solr parameter to scan and search command
- Updated tika to recent version and api, use its capability to parse all known rich text documents.
- ResultEntry and FieldNames cleanup: no differ between PDF, XLS or other rich text metadata like author, subject etc
- TikaParser does a mapping between Tika Metadata and ResultEntry fields
- Add antlr PHP grammar because we have a lot of PHP Projects in our repositories.
- Do Lucene and Solr unit tests using a TestNG factory to create two separate test runs
- Dynamic fields in solr which stores the svn properties
- Tokenize and filtering path and filename in solr to get all unit tests working
- Document ID creation to update an existing index and delete duplicates
- Solr unit test needs a running solr server, maybe fire up an embedded solr only for tests
- Schedules jobs could not use Solr, till now only scan and search command implemented
Because the recent tika version does the parsing much better and let you set an composite parser which is used to parse package content, you need to have Tika-0.8-SNAPSHOT compiled from source in local maven repository.
Another library update is org.apache.apache-solr-1.4.0 and org.slf4j.slf4j-log4j12-1.6.0 which are both in the maven repository.
I know, its not complete finished but i want to provide this work early because i want to head your feedback and maybe some hints or questions...?
What does you think? Are these advantages which you want to have in SupoSE?
Maybe you can create a branch and give me access to write on it so i can contribute the work continuously.
Kind regards from Gießen, Hessen ;)
Julian Hochstetter
SupoSE-SolrIntegration-525.diff - SupoSE SolrIntegration Patch against trunk rev 525 (190.7 KB)
Replies
RE: Solr integration - Added by Redmine Admin 588 days ago
Hi Julian,
first of all many thanks for that patch...
that looks great and brings me to an idea...to refactor the app in the way to be able to inject the storage as a kind of Plugin ...
i want to use SupoSE to scan and index our Repositories but instead directly to the lucene index we want to use our already running Solr server. (Frontend is in pipeline... ;)
To contribute this feature i've rewritten some parts of SupoSE which let you differ between lucene or solr as the storage backend at index and search time. I think this is a welcome feature because the planed SupoSE webfrontend could also benefit from the great solr extensions like faceting.
The current web-front-end is simple but works..(simple Web-App which uses an existing index and get some queries in there)...on the 0.7.1 branch...
- Writer and Reader interfaces and a IndexerFactory are now used to store/read data using lucene or solr
- Created an annotation which makes a ResultEntry field analyzed when index it
- Add solr parameter to scan and search command
- Updated tika to recent version and api, use its capability to parse all known rich text documents.
Updates planned ...fine...
- ResultEntry and FieldNames cleanup: no differ between PDF, XLS or other rich text metadata like author, subject etc
That's a good idea to change this...
- TikaParser does a mapping between Tika Metadata and ResultEntry fields
- Add antlr PHP grammar because we have a lot of PHP Projects in our repositories.
Hey cool...am i allowed to use that PHP Parser ?
What does not work as expected till now:
- Do Lucene and Solr unit tests using a TestNG factory to create two separate test runs
- Dynamic fields in solr which stores the svn properties
- Tokenize and filtering path and filename in solr to get all unit tests working
I have to think about the path, filename, method fields...
- Document ID creation to update an existing index and delete duplicates
- Solr unit test needs a running solr server, maybe fire up an embedded solr only for tests
- Schedules jobs could not use Solr, till now only scan and search command implemented
Because the recent tika version does the parsing much better and let you set an composite parser which is used to parse package content, you need to have Tika-0.8-SNAPSHOT compiled from source in local maven repository. Another library update is org.apache.apache-solr-1.4.0 and org.slf4j.slf4j-log4j12-1.6.0 which are both in the maven repository.
I know, its not complete finished but i want to provide this work early because i want to head your feedback and maybe some hints or questions...?
Sounds great ...and a real good idea...brings me to think about different storages ...
What does you think? Are these advantages which you want to have in SupoSE?
Yes ...
Maybe you can create a branch and give me access to write on it so i can contribute the work continuously.
Good idea...
I have to change the setup, cause currently i only have a read-only copy on supose.org and the real one on my private server...
Kind regards from Gießen, Hessen ;)
Kind regards from Aachen, NRW ...
Julian Hochstetter
Kind regards
Karl Heinz Marbaise
RE: Solr integration - Added by Redmine Admin 587 days ago
Hi Julian,
can be so good to create the patch file a second time, cause it cause many failures (Chunks not found etc.).
Thanks in advance.
Kind regards
Karl Heinz Marbaise
RE: Solr integration - Added by Julian Hochstetter 587 days ago
Hello Karl Heinz,
i've attached a new version of the patch. This time using diff utils instead of the svn diff command to produce the patch, don't why svn diff is producing garbage....
Kind regards,
Julian Hochstetter
SupoSE-SolrIntegration.diff (201.5 KB)
RE: Solr integration - Added by Redmine Admin 586 days ago
Hi Julian,
the patch now works like a charm...
Kind regards
Karl Heinz Marbaise
RE: Solr integration - Added by Karl Heinz Marbaise 559 days ago
Hi Julian,
now i've started to refactor the code based on your ideas to make improvements like solr integration simpler...and of course it makes the code simpler and better to understand.
Kind regards
Karl Heinz Marbaise
RE: Solr integration - Added by Julian Hochstetter 533 days ago
Hello Karl Heinz,
as a result of getting closer with the Solr DataImportHandler for other document sources, i decided to do not use SupoSE to index our SVN repositories. The reason is that we want all the import stuff together as possible.
The DIH is such a great piece of software and it wasn't a big thing to write an SVNImportDataSource and an associated SVNEntityProcessor.
So sorry that i will not contribute in future to your project. If you are interessted in the DIH stuff write an eMail to me.
Regards,
Julian Hochstetter