pitt.search.semanticvectors
Class TermTermVectorsFromLucene

java.lang.Object
  extended by pitt.search.semanticvectors.TermTermVectorsFromLucene
All Implemented Interfaces:
VectorStore

public class TermTermVectorsFromLucene
extends java.lang.Object
implements VectorStore

Implementation of vector store that creates term by term cooccurence vectors by iterating through all the documents in a Lucene index. This class implements a sliding context window approach, as used by Burgess and Lund (HAL) and Schutze amongst others Uses a sparse representation for the basic document vectors, which saves considerable space for collections with many individual documents.

Author:
Trevor Cohen, Dominic Widdows.

Constructor Summary
TermTermVectorsFromLucene(java.lang.String indexDir, int seedLength, int minFreq, int nonAlphabet, int windowSize, VectorStore basicTermVectors, java.lang.String[] fieldsToIndex)
           
 
Method Summary
 java.util.Enumeration getAllVectors()
           
 VectorStore getBasicTermVectors()
           
 java.lang.String[] getFieldsToIndex()
           
 org.apache.lucene.index.IndexReader getIndexReader()
           
 int getNumVectors()
           
 float[] getVector(java.lang.Object term)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TermTermVectorsFromLucene

public TermTermVectorsFromLucene(java.lang.String indexDir,
                                 int seedLength,
                                 int minFreq,
                                 int nonAlphabet,
                                 int windowSize,
                                 VectorStore basicTermVectors,
                                 java.lang.String[] fieldsToIndex)
                          throws java.io.IOException,
                                 java.lang.RuntimeException
Parameters:
indexDir - Directory containing Lucene index.
seedLength - Number of +1 or -1 entries in basic vectors. Should be even to give same number of each.
minFreq - The minimum term frequency for a term to be indexed.
windowSize - The size of the sliding context window.
fieldsToIndex - These fields will be indexed. If null, all fields will be indexed.
Throws:
java.io.IOException
java.lang.RuntimeException
Method Detail

getIndexReader

public org.apache.lucene.index.IndexReader getIndexReader()
Returns:
The object's indexReader.

getBasicTermVectors

public VectorStore getBasicTermVectors()
Returns:
The object's basicTermVectors.

getFieldsToIndex

public java.lang.String[] getFieldsToIndex()

getVector

public float[] getVector(java.lang.Object term)
Specified by:
getVector in interface VectorStore
Parameters:
term - the object whose vector you want to look up
Returns:
a vector (of floats)

getAllVectors

public java.util.Enumeration getAllVectors()
Specified by:
getAllVectors in interface VectorStore
Returns:
an enumeration of all the object vectors in the store.

getNumVectors

public int getNumVectors()
Specified by:
getNumVectors in interface VectorStore
Returns:
a count of the number of vectors in the store.