ucsam - Association measures in UCS/Perl
The statistical analysis of cooccurrence data is usually based on association measures, mathematical formulae that compute an association score from the joint and marginal frequencies of a pair type (which are called a frequency signature in UCS. This score is a single floating-point number indicating the amount of statistical association between the components of the pair type. Association measures can often be written conveniently in terms of a contingency table of observed frequencies the corresponding expected frequencies under the null hypothesis that there is no association.
For instance, the word pair black box occurs 123 times in the British National Corpus (BNC), so its joint frequency is f = 123. The adjective black has a total of 13,168 occurrences, and the noun box has 1,810 occurrences, giving marginal frequencies of f1 = 13,168 and f2 = 1,810. From these data, the MI measure computes an association score of 1.4, while the log.likelihood measure computes a score of 567.72. Both scores indicate a clear positive association, but they cannot be compared directly: each measure has its own scale.
A more detailed explanation of contingency tables and association scores as well as a comprehensive inventory of association measures with equations given in terms of observed and expected frequencies can be found on-line at http://www.collocations.de/AM/. Also see the ucsfile manpage to find out how frequency signatures, contingency tables and association scores are represented in UCS data set files.
UCS/Perl supports more than 40 different association measures and variants. In order to keep them managable, the measures are organised in several packages: a core set of widely-used "standard" measures is complemented by add-on packages for advanced users. Each package is implemented by a separate Perl module. Consult the module's manpage for a full listing of measures in the package and detailed descriptions. Listings of add-on packages, association measures, and some additional information can also be printed with the ucs-list-am program (see the ucs-list-am manpage).
Currently, there are two add-on packages in addition to the standard measures.
In UCS/Perl scripts both the standard measures and the add-on packages have to be loaded with use statements (e.g.
use UCS::AM;
for the core set).
Association measures are implemented as UCS::Expression objects (see the UCS::Expression manpage).
The UCS module maintains a registry of loaded measures with additional information and an evaluation function (see Section "ASSOCIATION MEASURE REGISTRY" in the UCS manpage).
When one of the packages above is loaded,
its measures are automatically added to this registry.
Association scores can be computed more efficiently for in-memory data sets,
using the add method in the UCS::DS::Memory module (see the UCS::DS::Memory manpage).
In the ucs-add program,
the standard measures are pre-defined,
and extension packages can be loaded with the -x
option.
Only the last part of the package name has to be specified here (e.g.
HTest
for the UCS::AM::HTest package).
It is case-insensitive and may be abbreviated to a unique prefix (so both -x htest
and -x ht
work as well).
See the ucs-add manpage for more information on how to compute association scores with the ucs-add program.
This section briefly lists the most well-known association measures available in UCS/Perl, all of which are defined in the "standard" package UCS::AM. See the on-line resource at http://www.collocations.de/AM/ for fully equations and the UCS::AM manpage for details.
Church, K. W. and Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22-29.
Church, K. W.; Gale, W.; Hanks, P.; Hindle, D. (1991). Using statistics in lexical analysis. In: Lexical Acquisition: Using On-line Resources to Build a Lexicon, Lawrence Erlbaum, pages 115-164.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61-74.
Evert, S. and Krenn, B. (2001). Methods for the qualitative evaluation of lexical association measures. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pages 188-195.
Pedersen, T. (1996). Fishing for exactness. In: Proceedings of the South-Central SAS Users Group Conference, Austin, TX.
Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics 19(1), 143-177.
UCS/Perl uses some conventions for the names of association measures and the computed association scores, which are described in this section. It is important to be aware of such conventions, especially when they deviate from those used by other software packages.
The names of association measures are taken from the on-line inventory at http://www.collocations.de/AM/.
Hyphen characters (-
) are replaced by periods (.
) to conform with the UCS standards (see the ucsfile manpage).
Capitalisation is preserved (MI
and Fisher.pv
,
but log.likelihood
) and subscripts are included in the name,
separated by a period (chi.squared.corr
,
where corr
is a subscript in the original name).
Association scores are always arranged so that higher scores indicate stronger (positive) association,
applying a transformation to the original values if necessary.
In the one-sided versions of two-sided tests (e.g.
chi.squared
and log.likelihood
),
negative scores indicate negative association (while positive scores indicate positive association).
Scores close to zero are a sign of statistical independence.
Some other measures such as MI
also have this property,
but many do not (e.g.
Fisher.pv
or Dice
).
"Explicit" logarithms in the equation of an association measure are usually taken to the base 10 (e.g.
in the MI
measure).
This is not the case when the association score is not interpreted as a logarithm (e.g.
the log.likelihoood
,
which is a test statistic approximating a known limiting distribution) and the natural logarithm is required for correct interpretation.
The use of base 10 logarithms is always pointed out in the documentation (see the UCS::AM manpage).
The logarithm of infinity if represented by a large floating-point value returned by the inf function (from the UCS::Expression::Func module).
Comparison with +inf()
and -inf()
can be used to detect a positive or negative infinite value.
The scores of association measures with the extension .pv
represent a p-value (from an exact test or the approximate p-value of an asymptotic test).
Unlike most other scores,
p-values can be compared directly between different measures.
They are represented as negative base 10 logarithms,
so the association score 3.0 corresponds to a p-value of 0.001 = 1e-3 (+inf()
stands for zero probability,
usually the result of an underflow error).
Copyright (C) 2004 by Stefan Evert.
This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.