Index by title

The result of this milestone can be found at source:tags/RELEASE-0.2.0.1



The implemented features are by now:

You can do a scanning of a repository by command line, a searching in the result of a scanning. You can create
a configuration to let SupoSE do the work of time based scanning.

The exact source code state for:

Source can be accessed via source:branches/B_0.4.0


Test to integrate the Apache Framework Tika to see if we can use it.


0.5.0.1 Ceres

Bug Fix Release to fix problem with commons collections (source:/repositories/browse/supose/branches/R_0.5.0.1)


050 (Ceres) Alpha 1

This milestone is the first which has been done with support of Redmine instead of trac.

0.5.0 Ceres

We would like to improve the search capabilities etc. Commnad line behaviour and usability on command line.

0.5.1 Europa

Here we go on...


0.5.2 Bergelmir

The Release 0.5.2 contains fixes for different things. Now it's possible to search for files etc. by using things like this:
--query "+filename:*Repos*"
or
--query "+filename:*.java"

The first one will search for parts in the filename or in the directory which contains the term.
If you are search for files which contain SCM in this kind of case you have to give it exactly this way, cause if you are
using scm it would not find any occurance of this.

0.5.3 Bebhionn


0.6.1 Bestla

New Release


Release 0.6.2 Calypso RC1


0.7.0 Tethys

This new milestone is intended to create a different project structure to represent planned features like SOAP, RESTlet interfaces etc. The development of this milestone is done on the following branch source:/branches/B_0.7.0


Release 0.7.1 Dione

The current source can be found for the 0.7.1 release source:/trunk


Branches

The usual pattern for branches

------------------------------------------------------------------------
r242 | kama | 2009-04-02 21:02:01 +0200 (Do, 02 Apr 2009) | 1 line
Changed paths:
   A /branches/F184 (from /trunk:241)

- Branch to implemente Feature #184
------------------------------------------------------------------------


Command Line Description

Scanning

Start from scratch

You would like to start from scratch with SupoSE.

supose scan --url URLRepository --create --index index.Test

Scan only parts of a repository

supose scan --url URLRepository --create --index index.Test --torev 500
* The second time you call the scanner you have to be a little bit careful about the *--create* option. This option should be
given only once, cause it will destroy any previously existing index.
supose scan --url URLRepository --fromrev 501 --index index.Test

Searching

Searching is more or less simple to define. You just give the index you like to use and the query you have.

 supose search --index index.Test --query "QUERY" 

If you like to define which fields will be printed out in the result you have to use the *--fields* option. Their you can give different field names to define that they should be printed out. So the following command will search in the index with the given query and will print out the fields revision and message.

 supose search --index index.Test --fields revision message --query "QUERY" 

Scheduling

You can schedule the scanning process if you like to get rid of the hand scanning (or may be you thought about) cron based scanning of the repositories. The more comfortable way will be to let SupoSE do the work for you.

Command line parameters are

supose schedule --configuration ./repositories.ini --configbase ./

This will start program which will read the configuration from the given repository.ini file and start a cron based program which will be running till you kill it by using CTRL-C on command line or the real kill command (Task Manager on Windows or kill on Unix command line).

Merging

If you got multiple index files and want to merge them into one, you've got to enter something like the following

supose merge --destination ./mergedindexfolder --index ./firstindexfolder ./secondindexfolder ./thirdindexfolder

Help

If you would like to see what command line options are also possible, just type

supose --help

(1) For all examples you have to be aware to append ".sh" or ".cmd" for the appropiate platform you are working on. For Linx/Unix you have to append ".sh" and for Windows you have to append ".cmd".

(2) I suggest you to use file:// protocol for the scanning to get maximum performance, but http:// protocol is working too, but a little slower.


Configuration

Overview

The configuration for the scheduler uses a single file which defines the different information to access the repositories, which URL, which
authorization information will be used and when you like to run the scanning process and of course where to put the scanned information.

The configuration file

Let us start with the following simple configuration file for a single repository which should be scanned on a time based system

[SupoSE]
url = http://svn.soebes.de/supose
indexusername = anonymous
indexpassword = guest
fromrev = 1
torev = HEAD
resultindex = summary

The first part which is given in square brackets, describes a unique name for the repository. This name is stored in a field which can later be used to make a separation in search queries. An other need for the name is to distinguish different entries from each other in
the configuration file.
The url defines the URL which is used to access the repository. Here you can use http/https/file/svn/svn+ssh/ protocol whatever you need.
The next parts indexuser and indexpassword will define the authorization information to access the repository. The fromrev and the torev will define from which revision till which revision the scan process will run for the first time.
The last part will define where to put the result of the scanning process.

Multiple Repositories

You can configure multiple repositories to be scanned by SupoSE. To do this you need to give the information for every repository in it's own section
of the configuration file. This means you have to enhance the above configuration file for every new repository.

[SupoSE]
url = http://svn.soebes.de/supose
indexusername = anonymous
indexpassword = guest
fromrev = 1
torev = HEAD
[AntSVK]
url = http://svn.soebes.de/antsvk
indexusername = anonymous
indexpassword = guest
fromrev = 1
torev = HEAD
[GForge]
url = http://svn.soebes.de/gforge
indexusername = anonymous
indexpassword = guest
fromrev = 1
torev = HEAD

When will be scanned?

If we assume the above configuration file, the default for the time of scanning is defined as every minute. This means in other words every minute the repositories will be checked if something has changed in the repository. If you don't like to check your repositories every minute you can give a supplemental information about the time when to scan the repositories.
[SupoSE]
url = http://svn.soebes.de/supose
indexusername = anonymous
indexpassword = guest
fromrev = 1
torev = HEAD
cron = 0 * * ? * *
[AntSVK]
url = http://svn.soebes.de/antsvk
indexusername = anonymous
indexpassword = guest
fromrev = 1
torev = HEAD
cron = 0 0 2 ? * *
[GForge]
url = http://svn.soebes.de/gforge
indexusername = anonymous
indexpassword = guest
fromrev = 1
torev = HEAD
cron = 0 0 18 ? * *

In the example above the expression cron = 0 * * ? * * (default), means that the given repository will be checked every minute on every day.
If you like to change this behaviour you only need to change the given expression.
This can be defined independent for every repository. So you can define to check the
repository A only on Mondays and repository B every day at 5 o'clock in the morning.

Detailed explanations of the cron expression can be found here and here and further details can be seen here

Index

The index is a result of the scanning process for every repository every time it will be scanned. SupoSE is designed in that way to put the result of every scan process into a result index which can be used to do the real search.

If you like you can configure to scan e.g. three repositories and put the results of the scanning into a single index but if you like you can put the results into two different indexes e.g. Repository 1 and 2 can be put into an index result1 and the result of Repository 3 can be put into an index result2.
This can simply be defined by giving different targets within the configuration file as in the following example:
[SupoSE]
url = http://svn.soebes.de/supose
indexusername = anonymous
indexpassword = guest
fromrev = 1
torev = HEAD
resultindex = indexresult1
[AntSVK]
url = http://svn.soebes.de/antsvk
indexusername = anonymous
indexpassword = guest
fromrev = 1
torev = HEAD
resultindex = indexresult1
[GForge]
url = http://svn.soebes.de/gforge
indexusername = anonymous
indexpassword = guest
fromrev = 1
torev = HEAD
resultindex = indexresult2

Document Handler

During the scanning process, all files will be read from the Subversion Repository and it is checked if for a particular document type (decided by the extension) exists a special document handler.
This document handler will get the whole contents of the file and can do with it what it's like to do. The basic idea is to scan e.g. Word, Excel files, using 3rd party libraries like POI etc. to extract the text information from such kind of files. The scanned information will be stored in the index and can
be searched later.

If you have an idea for an particular document type just give me a hint about it and what kind of data will be of interest.


Download Area

Release (Mars) 0.4.0 RC2

Release (Mars) 0.4.0 RC1

Release (Earth) 0.3.0 RC2

Release (Earth) 0.3.0 RC1

Older Releases are available via the archive.


Download Archive of older Releases =
Release (Venus) 0.2.0 ==

Release (Mercury) 0.1.0


Examples

Just download the package and extract the contents.

After that you can just call the supose.sh (supose.cmd on Windows based systems) and give a command line like this:

  supose.sh scan --url http://svn.soebes.de/supose

The given Repository is just the Subversion repository of SupoSE itself.
The username is anonymous and the password for the repository is guest.

Some performance hints. Whenever you can try to use an file:// access method, cause it's faster than http protocol. Don't blame me with
questions: "It's too slow...".but ftp requires the passwords.....need to know them before this ..donot blame me.


This is an aggregation Milestone to summarize all feature request which today are not scheduled for a particular milestone. And of course feature requests which had made by users via the web site.


Fields

The following fields will be added during indexing of every document (file/directory):

The following supplemental fields will be added if an Excel file is indexed: The following supplemental fields will be added if an PDF file is indexed: The following supplemental fields will be added if an Word file is indexed: The following supplemental fields will be added if an PowerPoint file is indexed: The following supplemental fields will be added if an Java source code file is indexed:

Installation

Starting with the Release 0.6.0 the installation has been simplified. Just download the current release (0.6.0 or later) for Windows (*-bin.zip) or Unix (*-bin-unix.tar.gz) and unzip/untar the distribution package. Now you have two choices:

  1. You can put the bin directory with the absolute path into the PATH
    1. Windows
      set path=%path%;C:\supose-0.5.3\bin
    2. Unix
      export PATH=$PATH:/home/username/supose-0.5.3/bin
  2. You can simply call the supose.bat (Windows) / supose(unix) and start using the tool as describe in the command line description.

Or the other opportunity is to call SupoSE directory from the bin directory of the distribution.


Known Bugs

Release 0.1.0 / 0.2.0

Currently you will get in trouble if you try to scan large repositories (more 5000-6000 revisions) with large files (100 mibi and above), cause you will get an OutOfHeapSpace exception.
This bug is fixed with Release 0.2.0.1.


Maven Tags

Maven Tags will be created by using Maven as part of the Release process mvn release:prepare
------------------------------------------------------------------------
r234 | kama | 2009-02-21 21:17:01 +0100 (Sa, 21 Feb 2009) | 1 line
Changed paths:
   A /tags/R_0.5.1 (from /trunk:232)
   R /tags/R_0.5.1/pom.xml (from /trunk/pom.xml:233)

[maven-release-plugin]  copy for tag R_0.5.1
------------------------------------------------------------------------

The first milestone has brought up the following features:

Scanning multiple Repositories

The basic idea to scan multiple repositories is to have scanning part which configurable by parameters to scan different repositories and store the resulting index into different directories.

The second idea behind the scene which is coming up with such an approach is, how to configure the scanning process handling multiple repositories.

Support things like "ParentPath" like Apache Module supports for multiple repositories.

Merge the indexed of different repositories together to get a single index which is searchable.

May be we make it possible to merge indexes by configuration and can define an searchable result for different combinations of repositories.


OpenOffice uses a file format called "Open Document Format" or ODF (ISO/IEC 26300).

The format is based on XML and available for everyone.
For more details on ODF, see wikipedia.


Performance

The current release 0.6.1 can scan a replicated repository via file protocol in relatively short time.

The Subversion Repository The SupoSE Repository

The following has been tested with SupoSE (0.6.2 RC1 - 447)

The "Apache Software Foundation Repository" (02.April 2010)

Here i have documented the circumstances of the test. Currently working on improvements #309

After the improvements the new release needed only 23 hours to index the whole repository incl.
merging the indexes together.

Revision indexing Index merging (3877 seconds)

So this means after the performance improvements the whole indexing process for the ASF Repository took less than 23 hours (ok ok 6 mins less ;-)).


Release Notes

Release 0.7.1

Release 0.6.2

Release 0.6.1

Release 0.6.0


Releases

The current planning of Releases is given at the Roadmap. There you can see which release are planned and which
features, Bugs etc. will be solved for the particular milestones.

The current state of the Milestones are more or less beta state. Any information about bugs/improvements or feature requirements etc.
are welcome. You can find the releases in the download area I have started a list with known bugs.

Milestones


Release Management

(TODO: Finish this....)

This page will describe how releases will be named and which naming convention is behind that.

Basically the Release number consist of the following part:

major.minor.patch[.bugfix]-RevNumber

Currently we have the major number 0, so this means everything can be changed....

I hope to stabelize the first things in 0.6.0 may be in 0.7.0

The plan is to have Major Release (x.0.0)

new features etc. may be break backwards compatibility and

Minor releases can break backwards compatibility.
Changes allowed

This page will outline how releases will be handled
Major Release (x.0.0)

SupoSE currently doesn't has a Major releases (1.0.0).

Minor releases can break backwards compatibility.
Changes allowed

Changes not allowed

None at this moment.

Patch Release (0.0.x)

Patch releases shouldn't break backwards compatibility.
Changes allowed

Changes not allowed


Questions to my repository

Never answered 'til yet...

The queries will comprise of Fields and their needed values. A more detailed description of the query syntax can be found on the Lucene Query Syntax page. The available command etc. be found in the command description.

Simple kind of questions:


The Requirements

Requirement to run SupoSE

Requirement to Build SupoSE from Source

You have to checkout the source
If you like to build SupoSE from Source you need

If you like to see the result of a current Maven 2 build just take a look here.

What is used to build SupoSE?

SupoSE is based on the following 3rd Party libraries, which are NOT part of the source distribution:

For the Unit Testing are we use the TestNG framework.

The command line analyzer uses the Apache Commons CLI2 library of which unfortunately no release package has been released until now. So if you like to build SupoSE from source you have to download the package and create a .jar which should be put into the local maven repository by using the following

mvn install:install-file
  -Dfile=commons-cli2-2.0-dev.jar
  -DpomFile=commons-cli2-2.0.pom
  -Dclassifier=dev

You can download those needed files here.


Scanning Java Files

Starting with the [milestone:"0.4.0 Mars" Milestone 0.4.0 Mars], i have introduced a first Java parser which is able to extract any kind of comments and the method names from a given Java source code file. These informations will be put into the created index during
the scanning process.

Take a look at the following example file:

package com.soebes.supose.parse.java;

/*
* Default comment.
 */
public class Test1 {

    private String value;

    /** This is a JavaDoc comment */
    public Test1 () {
    }

    /* Comment voidMethod1
     */
    public void voidMethod1() {
    }

    //Line Comment staticMethod1
    public static void staticMethod1() {
    }

    public string getValue() {
        return value;
    }
    public void setValue(String value) {
        this.value = value;
    }

    private void helperMethod() {
        setValue("test");
    }
}

The current parser will extract all the comments and all the method names, except the constructor. May be i will change this if more is needed.


Searching for contents

If you like to search for contents of files you can use the contents field which contains the content of the files.

By using the contents field within the search query you now define the pattern you would like search for. The pattern can contain wildcard as you might expect they to work e.g. known by the command line.

The following query will find all entries which contain the word Company in the contents field of all revisions in all paths.

+contents:Company

If you like to search multiple words you can simply use the following:

+contents:Word +contents:Word2

If you need to search for a particular phrase you can use this:

+contents:"This is the Phrase" 

Searching for files

If you like to Search for filenames you can use the filename field which contains the filename.

By using the filename field within the search query you now define the pattern you would like search for.
The pattern can contain wildcard as you might expect they to work e.g. known by the command line.

The following query will find all filenames which contain the given pattern in all revisions and all paths within the repository.

+filename:*.txt
A common search pattern to search for all existing word files (.doc)
+filename:*.doc

or if you are a little bit more up-to-date you might use 2007 office files (.docx)
+filename:*.docx

If you know only parts of your filename you are searching for just simply define those parts in the query. The following query will search for file which contains the scm inside. This will find filenames which contain uppercase or lowercase written characters or in other words the search is
non-case-sensitive.

+filename:*scm*.doc

Searching for Properties

The following will search for the svn:externals property which contains https.
In other words it will search for any entry which contains an svn:externals which uses https protocol.

+svn\:externals:*https\:*
The following will narrow down the above to entries which reference svn.apache as part of their externals reference.
+svn\:externals:*https\://svn\.apache*
Will search any revision/path etc. if the property svk:merge has been used
+svk\:merge:*
You can of course search with more informations in like the following
which will search for any entry which contains the /subversion/branches/1.5.x in the svn:merge property.
+svn\:mergeinfo:*/subversion/branches/1.5.x*

Searching for Revisions

If you like to search for particular revisions in the repository you can use the revision field which contains the revision number of the repository.

The following query will find all entries which are related to revision 20 in the repository. This is more or less equivalent to svn log -r 20 URL -v.

+revision:20

If you like to use multiple revisions you can simply use:

+revision:20 +revision:30

If you like to use a revision range you can use the following which is more or less equivalent to svn log -r1:200 -v URL.

+revision:[1 TO 200]

This will search in the revisions from 1 till 200.


Searching for Tags/Branches

Overview

If you like to search for tags or branches you can use the tag or branch field which contains the names of the tags or branches.

Searching for tags

By using the tag field within the search query you will define the pattern you would like search for. The pattern can contain wild card as you might expect they to work e.g. known by the command line.

The following query will find all tags which are existing in all revisions within the repository.

+tag:*

The result list will contain all existing tags incl. Maven Tags which have a particular pattern.

If you like to see only the list of tags without the Maven Tags just extend the query as follows:

+tag:* -maventag:*

If you like to search for Subversion Tags (particular type used by the Subversion Team) you can use a search query as follows:

+subversiontag:*

Searching for branches

If you like to search for a branch you can use the branch field to define the pattern for the branch name you would like to search for.

+branch:*

Subversion Tags

A Subversion Tag is a particular kind of tags which is used by the Subversion Development team to label their releases.
They use a complex tag to define a new release. This complex tag comprises of the Tag Name and a modified file svn_version.h.
------------------------------------------------------------------------
r34864 | hwright | 2008-12-19 20:58:27 +0100 (Fr, 19 Dez 2008) | 1 line
Changed paths:
   A /tags/1.5.5 (from /branches/1.5.x:34862)
   M /tags/1.5.5/subversion/include/svn_version.h

Tagging 1.5.5 with svn_version.h matching tarball.
------------------------------------------------------------------------

SupoSE

SupoSE is an abbreviation for *Su*bversion Re*po*sitory *S*earch *E*ngine.


SupoSEWeFE

This is an abbreviation for SupoSE *We*b-*F*ront-*E*nd.


We will start a simple Web-Front-End for SupoSE which is called SupoSEWeFE


Tags

The usual pattern for Tags is:

------------------------------------------------------------------------
r226 | kama | 2009-02-20 16:09:02 +0100 (Fr, 20 Feb 2009) | 2 lines
Changed paths:
   A /tags/R_0.5.0.1 (from /branches/R_0.5.0.1:225)

- Release 0.5.0.1
  - Bugfix release for Issue #169
------------------------------------------------------------------------


Users guide

Under active development (If you find things which are not clear etc. just drop a ticket).

Installation

Features

Searching

Performance


Welcome to the Subversion Repository Search Engine (SupoSE)

Overview

This is a Java based approach to do real searching within a complete Subversion repository. Based on performance issues and so on, I have decided not to do a real time scanning within the Repository. I have decided to do a scanning of the whole content of a Subversion repository. The result, called index can be used to do real searching. An other purpose of this approach is to be able to search through multiple repositories instead of one.

With the exception of binary files where no particular document handler exists, all files will be indexed. This means we do index Word, Excel and Powerpoint files (2007 Office variants as well).

This means we do not index only the trunk or the HEAD revision, we index all revisions on all paths within a whole Repository. Filename, path, log message, properties etc. are made searchable (see Fields for further details).

If you think for what you can search take a look at
the questions you never tried to ask your repository

The up-to-date builds can be looked at here: http://78.46.16.202:8080/jenkins/job/SupoSE-default/ or http://78.46.16.202:8080/jenkins/job/SupoSE-site/

Features / Usage

You can find a detailed description of the features etc. in the users guide.

Plan

Planned Features / Implemented features:

An overview about the release plan and the currently existing can be found here.

Source Code

The requirements describe what is needed to build SupoSE from source code.

If you like to checkout the current state of development you can simply check the Subversion Repository

License

The source code and the application which i have written is licensed under the The GNU General Public License Version 2. Other parts of the application in particular the 3rd party libraries have different licenses.

Download

In the Download Area you find all current releases of the project.

Archive

In the archive you can find older releases if you like to take a look into.

Builds

Currently a nightly build can give you an up-to-date release before real delivery date.
Information about the changes which had been made can be seen at the bulletin board.

An overview about the releases can be found in the Release Notes.

Release Management

The release management is described on the ReleaseManagement page in detail.

Authentication

If you have any problems during a scan with the authentication you can use the following java settings:
-Dsvnkit.http.methods=Basic,Digest,NTLM
A detailed explanation can be found here.

Searching / Scanning

The searching and scanning of repositories is currently done by a command line interface (at the moment). The description
of the Command Line describe what you can do and which options are available. You can of course use
Luke too. Or take a look at the command line examples. For a detailed
description of the queries and features take a look at the users guide.

Requirements / Questions / Bugs

If you like to post feature requests, questions, suggestions, bugs or anything about SupoSE please use the ticket system to check if the bug might have been reported already or you if you like to report a new one just use the new issue area
or just write an email to me.

It would be nice if you give an email address so i can get in contact with to ask question etc.