pitt.search.semanticvectors
Class LuceneUtils

java.lang.Object
  extended by pitt.search.semanticvectors.LuceneUtils

public class LuceneUtils
extends java.lang.Object

Class to support reading extra information from Lucene indexes, including term frequency, doc frequency.


Constructor Summary
LuceneUtils(java.lang.String path)
           
 
Method Summary
 int getGlobalTermFreq(org.apache.lucene.index.Term term)
          Gets the global term frequency of a term, i.e.
 float getGlobalTermWeight(org.apache.lucene.index.Term term)
          Gets the global term weight for a term, used in query weighting.
 float getGlobalTermWeightFromString(java.lang.String termString)
          This is a hacky wrapper to get an approximate term weight for a string.
protected  boolean termFilter(org.apache.lucene.index.Term term, java.lang.String[] desiredFields, int nonAlphabet, int minFreq)
          Filters out non-alphabetic terms and those of low frequency
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LuceneUtils

public LuceneUtils(java.lang.String path)
            throws java.io.IOException
Parameters:
path - - path to lucene index
Throws:
java.io.IOException
Method Detail

getGlobalTermFreq

public int getGlobalTermFreq(org.apache.lucene.index.Term term)
Gets the global term frequency of a term, i.e. how may times it occurs in the whole corpus

Parameters:
term - whose frequency you want

getGlobalTermWeightFromString

public float getGlobalTermWeightFromString(java.lang.String termString)
This is a hacky wrapper to get an approximate term weight for a string.


getGlobalTermWeight

public float getGlobalTermWeight(org.apache.lucene.index.Term term)
Gets the global term weight for a term, used in query weighting. Currently returns some power of inverse document frequency - you can experiment.

Parameters:
term - whose frequency you want

termFilter

protected boolean termFilter(org.apache.lucene.index.Term term,
                             java.lang.String[] desiredFields,
                             int nonAlphabet,
                             int minFreq)
                      throws java.io.IOException
Filters out non-alphabetic terms and those of low frequency

Parameters:
term - - Term to be filtered.
desiredFields - - Terms in only these fields are filtered in
nonAlphabet - - number of allowed non-alphabetic characters in the term -1 if we want no character filtering
minFreq - - min global frequency allowed Thanks to Vidya Vasuki for refactoring and bug repair
Throws:
java.io.IOException