net.jot.search.simpleindexer
Class JOTSimpleSearchEngine

java.lang.Object
  extended by net.jot.search.simpleindexer.JOTSimpleSearchEngine

public class JOTSimpleSearchEngine
extends java.lang.Object

Implement a simple search engine using a text/keyword index Use(or extend) to index/search plain text pfiles. This intends to be 'barebone' and decoupled from teh ui part of presesnting the results.

Author:
thibautc

Field Summary
protected static JOTSearchSorter defaultSorter
           
 java.io.File indexRoot
          index property file
protected static java.util.regex.Pattern pattern
          Pattern matching "words" a single word is considered any letter or number (unicode case insensitive) as well as - and _
protected  java.io.File propFile
           
 JOTPropertiesPreferences props
           
protected  int WORD_BATCH_SIZE
          Max words to process in memory before writing to file Too low, and performance will be slower Too high and it will use more memory.
 
Constructor Summary
JOTSimpleSearchEngine(java.io.File indexRoot)
           
 
Method Summary
protected  int commitFromMemory(java.lang.String id, java.util.Hashtable hash)
          Writes the temporary -in memory- hash to the index files.
 int indexFile(java.io.File textFile)
          Index the file using the filepath as the unique key, and only reindexing if file timestamp was updated
 int indexFile(java.io.File textFile, boolean onlyIfModified)
          Index the file using the filepath as the unique key
 int indexFile(java.io.File textFile, java.lang.String uniqueId)
          Index the file, only if the timestamp chnaged since the last indexing.
 int indexFile(java.io.File textFile, java.lang.String uniqueId, boolean onlyIfModified)
          index a file(update if already indexed)
protected  int indexLineInMemory(java.util.Hashtable hash, java.lang.String lineNb, java.lang.String s)
          mem is the hashtable storing the keyword data.
static void main(java.lang.String[] args)
          for testing / Example
static java.lang.String[] parseQueryIntoKeywords(java.lang.String queryString)
          Utility method to parse a user typed query (ex: "a java server pAGes ") into keywords ex: [java,server,pages]
 JOTRawSearchResult[] performRawSearch(java.lang.String[] keywords)
          return an array of rawSearchResults (one rawsearchresult per keyword, in the same order as the keywords).
 JOTSearchResult[] performSearch(java.lang.String[] keywords, JOTSearchSorter sorter)
          return sorted list of files(uniqueIds) and score (1-5)
 int removeFile(java.io.File textFile, java.lang.String uniqueId)
          remove a file from the index
protected  void updateKeywordsCount(int nbNewKeywords)
           
static void whipeoutIndex(java.io.File indexRoot)
          completely whipeout the index, so you can reindex from scratch Simply deletes everyhting in the indexRoot folder !
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

indexRoot

public java.io.File indexRoot
index property file


props

public JOTPropertiesPreferences props

propFile

protected java.io.File propFile

pattern

protected static java.util.regex.Pattern pattern
Pattern matching "words" a single word is considered any letter or number (unicode case insensitive) as well as - and _


defaultSorter

protected static JOTSearchSorter defaultSorter

WORD_BATCH_SIZE

protected int WORD_BATCH_SIZE
Max words to process in memory before writing to file Too low, and performance will be slower Too high and it will use more memory.

Constructor Detail

JOTSimpleSearchEngine

public JOTSimpleSearchEngine(java.io.File indexRoot)
                      throws java.lang.Exception
Parameters:
indexRoot: - root folder where the index data is/will go (empty folder)
Throws:
java.lang.Exception
Method Detail

indexFile

public int indexFile(java.io.File textFile)
              throws java.lang.Exception
Index the file using the filepath as the unique key, and only reindexing if file timestamp was updated

Parameters:
textFile -
Returns:
Throws:
java.lang.Exception

indexFile

public int indexFile(java.io.File textFile,
                     boolean onlyIfModified)
              throws java.lang.Exception
Index the file using the filepath as the unique key

Parameters:
textFile -
onlyIfModified - if true only update if file timestamp chnaged since last indexing
Returns:
Throws:
java.lang.Exception

indexFile

public int indexFile(java.io.File textFile,
                     java.lang.String uniqueId)
              throws java.lang.Exception
Index the file, only if the timestamp chnaged since the last indexing.

Parameters:
textFile -
uniqueId: - a unique id for the file, ie: absolutepath, md5 etc .... if null absolutepath will be used.
Returns:
Throws:
java.lang.Exception

indexFile

public int indexFile(java.io.File textFile,
                     java.lang.String uniqueId,
                     boolean onlyIfModified)
              throws java.lang.Exception
index a file(update if already indexed)

Parameters:
textFile -
onlyIfModified - if true only update the file if file timestamp changed since last indexing
uniqueId - a unique id for the file, ie: absolutepath, md5 etc .... if null absolutepath will be used.
Returns:
number of new keywords added to Index
Throws:
java.lang.Exception

commitFromMemory

protected int commitFromMemory(java.lang.String id,
                               java.util.Hashtable hash)
                        throws java.lang.Exception
Writes the temporary -in memory- hash to the index files.

Parameters:
hash -
uniqueId -
Returns:
numberOfNewKeywords
Throws:
java.lang.Exception

indexLineInMemory

protected int indexLineInMemory(java.util.Hashtable hash,
                                java.lang.String lineNb,
                                java.lang.String s)
mem is the hashtable storing the keyword data. (keyword -> Vector(lineNumber(String))) index one line of text return number of words found in the line.


removeFile

public int removeFile(java.io.File textFile,
                      java.lang.String uniqueId)
               throws java.lang.Exception
remove a file from the index

Parameters:
textFile -
uniqueId - the unique id for the file(used in indexFile), ie: absolutepath, md5 etc .... if null absolutepath will be used.
Returns:
number of keywords removed from Index
Throws:
java.lang.Exception

updateKeywordsCount

protected void updateKeywordsCount(int nbNewKeywords)
                            throws java.lang.Exception
Throws:
java.lang.Exception

whipeoutIndex

public static void whipeoutIndex(java.io.File indexRoot)
completely whipeout the index, so you can reindex from scratch Simply deletes everyhting in the indexRoot folder !


performSearch

public JOTSearchResult[] performSearch(java.lang.String[] keywords,
                                       JOTSearchSorter sorter)
                                throws java.lang.Exception
return sorted list of files(uniqueIds) and score (1-5)

Parameters:
keywords -
Returns:
Throws:
java.lang.Exception

parseQueryIntoKeywords

public static java.lang.String[] parseQueryIntoKeywords(java.lang.String queryString)
Utility method to parse a user typed query (ex: "a java server pAGes ") into keywords ex: [java,server,pages]

Parameters:
qeryString -
Returns:

performRawSearch

public JOTRawSearchResult[] performRawSearch(java.lang.String[] keywords)
                                      throws java.lang.Exception
return an array of rawSearchResults (one rawsearchresult per keyword, in the same order as the keywords).

Parameters:
keywords: - keywords should be space separated: ie: "java server pages"
Returns:
Throws:
java.lang.Exception

main

public static void main(java.lang.String[] args)
for testing / Example

Parameters:
args -