ucsintro - A first introduction to UCS/Perl
UCS is a set of libraries and tools intended for the empirical study of cooccurrence statistics. Its major uses are to apply such statistics, called association measures, to cooccurrence data obtained from a corpus, and to evaluate the resulting association scores and rankings against (manually annotated) reference data.
The frequency data extracted from a given corpus for a given type of cooccurrences consists of a list of pair types with their frequency signatures (i.e. joint and marginal frequencies), and is referred to as a data set. See (Evert 2004) for a detailed explanation of these concepts, different types of cooccurrences, and correct methods for obtaining frequency data. Data sets, stored in a special .ds file format, are the fundamental objects of the UCS toolkit. Most UCS programs manipulate or display such data set files.
The UCS implementation relies heavily on the programming language Perl (http://www.perl.com/) and the free statistical environment R (http://www.r-project.org/) as a library of mathematical and statistical functions. The core of UCS is written in Perl (the UCS/Perl part), but there is also a small library of R functions for interactive work within R (the UCS/R part). UCS/Perl uses R as a back-end, making the most important statistical functions available through a Perl module.
UCS/Perl is mainly a collection of Perl modules that perform the following tasks:
Most UCS programs will be custom-built scripts, using the library of support functions provided by the UCS/Perl modules. Loading a data set, annotating it with association scores from one or more measures, and sorting it in various ways can be done with a few lines of Perl code. There are also some ready-made programs in UCS/Perl that perform such standard tasks, operating on data set files. A substantial part of the UCS/Perl functionality is thus accessible from the command-line, at the cost of some additional overhead compared to a custom script (which operates on in-memory representations).
Below, you will find a list of the general documentation files, Perl modules, and programs that are included in the UCS/Perl distribution. Manpages for all modules and programs (as well as the general documentation) are easily accessible with the ucsdoc program, and can also be formatted for printing.
ucsdoc ucsintro # this introduction
ucsdoc ucsfile # description of the UCS data set file format (.ds)
ucsdoc ucsexp # UCS expressions and wildcards
ucsdoc ucsam # overview of built-in association measures
use UCS; # core library
use UCS::File; # file access utilities
use UCS::R; # interface to UCS/R
use UCS::SFunc; # special functions and statistical distributions
use UCS::Expression; # Perl code interspersed with UCS variables
use UCS::Expression::Func; # utility functions available in UCS expressions
use UCS::AM; # implementations of various association measures
use UCS::AM::HTest; # add-on package: variants of hypothesis tests
use UCS::AM::Parametric; # add-on package: parametric association measures
use UCS::DS; # data sets ...
use UCS::DS::Stream; # i/o streams for data set files
use UCS::DS::Memory; # in-memory representation of data sets
use UCS::DS::Format; # ASCII formatter (+ other formats)
See the respective manpages (ucsdoc ModuleName
) for more information.
ucsdoc # front-end to perldoc
ucs-config # automatic configuration of UCS/Perl scripts
ucs-tool # find and run user-contributed UCS/Perl scripts
ucs-list-am # list built-in association measures & add-on packages
ucs-make-tables # compute frequency signatures from list of pair tokens
ucs-summarize # print (statistical) summaries for selected variables
ucs-select # select rows and/or columns from a data set file
ucs-add # add variables to a data set file
ucs-join # combine rows and/or columns from two data sets
ucs-sort # sort data set file by specified attribute(s)
ucs-info # display information from header of data set file
ucs-print # format data set as ASCII table (for viewing and printing)
See the respective manpages (ucsdoc ProgramName
) for more information.
UCS stands for Utilities for Cooccurrence Statistics.
Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, University of Stuttgart, Germany.
On-line repository of association measures: http://www.collocations.de/
(http://www.collocations.de)Copyright (C) 2004 by Stefan Evert.
This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.