• Skip to content
  • Skip to link menu
KDE 4.2 API Reference
  • KDE API Reference
  • API Reference
  • Sitemap
  • Contact Us
 

NepomukDaemons

Nepomuk::CLuceneTokenizer

Nepomuk::CLuceneTokenizer Class Reference

A grammar-based tokenizer constructed with JavaCC. More...

#include <clucenetokenizer.h>

Inheritance diagram for Nepomuk::CLuceneTokenizer:

Inheritance graph
[legend]

List of all members.


Public Member Functions

CL_NS(util) CLuceneTokenizer (CL_NS(util)::Reader *reader)
bool next (CL_NS(analysis)::Token *token)
bool ReadAlphaNum (const TCHAR prev, CL_NS(analysis)::Token *t)
bool ReadApostrophe (CL_NS(util)::StringBuffer *str, CL_NS(analysis)::Token *t)
bool ReadAt (CL_NS(util)::StringBuffer *str, CL_NS(analysis)::Token *t)
bool ReadCJK (const TCHAR prev, CL_NS(analysis)::Token *t)
bool ReadCompany (CL_NS(util)::StringBuffer *str, CL_NS(analysis)::Token *t)
bool ReadNumber (const TCHAR *previousNumber, const TCHAR prev, CL_NS(analysis)::Token *t)
 ~CLuceneTokenizer ()

Detailed Description

A grammar-based tokenizer constructed with JavaCC.

This should be a good tokenizer for most European-language documents:

  • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
  • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

Definition at line 54 of file clucenetokenizer.h.


Constructor & Destructor Documentation

CL_NS (util) Nepomuk::CLuceneTokenizer::CLuceneTokenizer ( CL_NS(util)::Reader *  reader  ) 

Nepomuk::CLuceneTokenizer::~CLuceneTokenizer (  ) 

Definition at line 132 of file clucenetokenizer.cpp.


Member Function Documentation

bool Nepomuk::CLuceneTokenizer::next ( CL_NS(analysis)::Token *  token  ) 

Returns the next token in the stream, or false at end-of-stream.

The returned token's type is set to an element of CLuceneTokenizerConstants::tokenImage.

bool Nepomuk::CLuceneTokenizer::ReadAlphaNum ( const TCHAR  prev,
CL_NS(analysis)::Token *  t 
)

bool Nepomuk::CLuceneTokenizer::ReadApostrophe ( CL_NS(util)::StringBuffer *  str,
CL_NS(analysis)::Token *  t 
)

bool Nepomuk::CLuceneTokenizer::ReadAt ( CL_NS(util)::StringBuffer *  str,
CL_NS(analysis)::Token *  t 
)

bool Nepomuk::CLuceneTokenizer::ReadCJK ( const TCHAR  prev,
CL_NS(analysis)::Token *  t 
)

bool Nepomuk::CLuceneTokenizer::ReadCompany ( CL_NS(util)::StringBuffer *  str,
CL_NS(analysis)::Token *  t 
)

bool Nepomuk::CLuceneTokenizer::ReadNumber ( const TCHAR *  previousNumber,
const TCHAR  prev,
CL_NS(analysis)::Token *  t 
)


The documentation for this class was generated from the following files:
  • clucenetokenizer.h
  • clucenetokenizer.cpp

NepomukDaemons

Skip menu "NepomukDaemons"
  • Main Page
  • Namespace List
  • Class Hierarchy
  • Alphabetical List
  • Class List
  • File List
  • Namespace Members
  • Class Members
  • Related Pages

API Reference

Skip menu "API Reference"
  • KCMShell
  • KNotify
  • KStyles
  • Nepomuk Daemons
Generated for API Reference by doxygen 1.5.7
This website is maintained by Adriaan de Groot and Allen Winter.
KDE® and the K Desktop Environment® logo are registered trademarks of KDE e.V. | Legal