FTP archive directory /os/Linux/distr/salix/sbo/15.0/libraries/libexttextcat/

Welcome to ftp.nluug.nl
Current directory: /os/Linux/distr/salix/sbo/15.0/libraries/libexttextcat/

Current bandwidth utilization 770.33 Mbit/s

Contents of README:

Libtextcat is a library with functions that implement the
classification technique described in Cavnar & Trenkle, "N-Gram-Based
Text Categorization". It was primarily developed for language
guessing, a task on which it is known to perform with near-perfect
accuracy.

The central idea of the Cavnar & Trenkle technique is to calculate a
"fingerprint" of a document with an unknown category, and compare this
with the fingerprints of a number of documents of which the categories
are known. The categories of the closest matches are output as the
classification. A fingerprint is a list of the most frequent n-grams
occurring in a document, ordered by frequency. Fingerprints are
compared with a simple out-of-place metric. See the article for more
details.

Considerable effort went into making this implementation fast and
efficient. The language guesser processes over 100 documents/second on
a simple PC, which makes it practical for many uses. It was developed
for use in our webcrawler and search engine software, in which it it
handles millions of documents a day.

 Name                                      Last modified      Size  
 Parent Directory                                               -   
 README                                    14-Jun-2018 16:01  1.0K  
 libexttextcat.SlackBuild                  11-Mar-2022 06:34  3.2K  
 libexttextcat.info                        11-Mar-2022 06:34  335   
 slack-desc                                14-Jun-2018 16:01  1.0K

NLUUG - Open Systems. Open Standards
Become a member and get discounts on conferences and more, see the NLUUG website!