pitt.search.semanticvectors
Class TermVectorsFromLucene

java.lang.Object
  extended by pitt.search.semanticvectors.TermVectorsFromLucene
All Implemented Interfaces:
VectorStore

public class TermVectorsFromLucene
extends java.lang.Object
implements VectorStore

Implementation of vector store that creates term vectors by iterating through all the terms in a Lucene index. Uses a sparse representation for the basic document vectors, which saves considerable space for collections with many individual documents.

Author:
Dominic Widdows, Trevor Cohen.

Constructor Summary
TermVectorsFromLucene(java.lang.String indexDir, int seedLength, int minFreq, int nonAlphabet, VectorStore basicDocVectors, java.lang.String[] fieldsToIndex)
           
 
Method Summary
 java.util.Enumeration getAllVectors()
           
 VectorStore getBasicDocVectors()
           
 java.lang.String[] getFieldsToIndex()
           
 org.apache.lucene.index.IndexReader getIndexReader()
           
 int getNumVectors()
           
 float[] getVector(java.lang.Object term)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TermVectorsFromLucene

public TermVectorsFromLucene(java.lang.String indexDir,
                             int seedLength,
                             int minFreq,
                             int nonAlphabet,
                             VectorStore basicDocVectors,
                             java.lang.String[] fieldsToIndex)
                      throws java.io.IOException,
                             java.lang.RuntimeException
Parameters:
indexDir - Directory containing Lucene index.
seedLength - Number of +1 or -1 entries in basic vectors. Should be even to give same number of each.
minFreq - The minimum term frequency for a term to be indexed.
basicDocVectors - The store of basic document vectors. Null is an acceptable value, in which case the constructor will build this table. If non-null, the identifiers must correspond to the Lucene doc numbers.
fieldsToIndex - These fields will be indexed. If null, all fields will be indexed.
Throws:
java.io.IOException
java.lang.RuntimeException
Method Detail

getBasicDocVectors

public VectorStore getBasicDocVectors()
Returns:
The object's basicDocVectors.

getIndexReader

public org.apache.lucene.index.IndexReader getIndexReader()
Returns:
The object's indexReader.

getFieldsToIndex

public java.lang.String[] getFieldsToIndex()
Returns:
The object's list of Lucene fields to index.

getVector

public float[] getVector(java.lang.Object term)
Specified by:
getVector in interface VectorStore
Parameters:
term - the object whose vector you want to look up
Returns:
a vector (of floats)

getAllVectors

public java.util.Enumeration getAllVectors()
Specified by:
getAllVectors in interface VectorStore
Returns:
an enumeration of all the object vectors in the store.

getNumVectors

public int getNumVectors()
Specified by:
getNumVectors in interface VectorStore
Returns:
a count of the number of vectors in the store.