UCS::DS::Memory - In-memory representation of data sets
use UCS::DS::Memory;
$ds = new UCS::DS::Memory; # empty data set
$ds = new UCS::DS::Memory $filename; # read from file (using UCS::DS::Stream)
# access & edit variables, comments, and globals with UCS::DS methods
$pairs = $ds->size; # number of pair types
$ds->set_size($pairs); # truncate or extend data set
$value = $ds->cell($var, $n); # read entry from data set table
$ds->set_cell($var, $n, $value); # set entry in data set table
$rowdata = $ds->row($n); # returns hashref (varname => value)
$ds->set_row($n, $rowdata); # set row data (ignores missing vars)
$ds->set_row($n, "f1"=>$f1, "f2"=>$f2, ...);
$ds->append_row($n, $rowdata); # append row to data set
$ds->delete_rows($from, $to); # delete a range of rows from the data set
$vector = $ds->column($var); # reference to data vector of $var
$vector->[$n] = $value; # fast direct access to cells
$ds->eval($var, $exp) # evaluate expression on data set & store in $var
unless $ds->missing($exp); # check first whether all reqd. variables are available
$ds->add($var); # auto-compute variable (derived variable or registered AM)
$stats = $ds->summary($var); # statistical summary of numerical variable
$ds->where($idx, $exp); # define index: rows matching UCS expression
$n = $ds->count($exp); # number of rows matching expression
$vector = $ds->index($idx); # returns reference to array of row numbers
$ds->make_index($idx, $row1, $row2, ...); # define index: explicit list of row numbers
$ds->make_index($idx, $vector); # or array reference (will be duplicated)
$ds->activate_index($idx); # activate index (will be used by most access methods)
$ds->activate_index(); # de-activate index
$ds->delete_index($idx); # delete index
$ds2 = $ds->copy; # make physical copy of data set (using index if activated)
$ds2 = $ds->copy("*", "am.%"); # copy selected variables only (in specified order)
$ds->renumber; # renumber/add ID values as increasing sequence 1 .. size
$ds->sort($idx, $var1, $var2, ...); # sort data set on $var1, breaking ties by $var2 etc.
$ds->sort($idx, "-$var1", "+$var2"); # - = descending, + = ascending (default depends on variable type)
$ds->rank($ranking, $key1, ...); # compute ranking (with ties) and store in data set variable $ranking
$ds->save($filename); # save data set to file (using index if activated)
$dict = $ds->dict($var1, $var2, ...); # lookup hash for variable(s) (UCS::DS::Memory::Dict object)
($max, $average) = $dict->multiplicity; # maximum / average number of rows for each key
if ($dict->unique) { ... } # whether every key identifies a unique row
@rows = $dict->lookup($x1, $x2, ...); # look up key in dictionary, returns all matching rows
$row = $dict->lookup($x1, $x2, ...); # in scalar context, returns first matching row
@rows = $dict->lookup($other_ds, $n); # look up row $n from other data set
$n_rows = $dict->multiplicity($x1, $x2, ...); # takes same arguments as lookup()
@keys = $dict->keys; # return unsorted list of keys entered in dictionary
This module implements an in-memory representation of UCS data sets. When a data set file has been loaded into a UCS::DS::Memory object (or a new empty data set has been created), then variable names, comments, and globals can be accessed and modified with the respective UCS::DS methods (see the UCS::DS manpage).
Additional methods in the UCS::DS::Memory class allow the user to:
The individual methods are detailed in the following sections. In all methods, columns are identified by the respective variable names, whereas rows (corresponding to pair types) are identified by row numbers. NB: Row numbers start with 1 (like R vectors, but unlike Perl arrays)!
'-na'
disables missing value support (which is enabled by default), so that NA
values in the data set file will be replaced by 0 or the empty string, depending on the data type. Use '+na'
to enable missing value support explicitly.=>
$value pairs. Any variables that do not exist in the data set $ds are silently ignored. This method is faster than calling set_cell repeatedly, especially when a new row is added to the data set.E11
or an association score such as am.t.score
(see the ucsfile manpage for details).INT
or DOUBLE
). $stats is a hash reference representing a data structure with the following fields: MIN ... minimum value
MAX ... maximum value
ABSMIN ... smallest non-zero absolute value
ABSMAX ... largest absolute value
SUM ... sum of all value
MEAN ... mean (= average)
MEDIAN ... median (= 50% quantile)
VAR ... empirical variance
SD ... empirical standard deviation (sq. root of variance)
STEP ... smallest non-zero difference between any two values
NA ... number of missing values (undef's)
$ds2 = $ds;
would just give another handle on the same data set). Comments and globals are copied to $ds2 as well. Optionally, a list of variable names and/or wildcard patterns (see the ucsexp manpage) can be specified. In this case, only the selected columns will be copied. NB: If there is an active row index, the copy will only include the rows selected by the index, and they will be arranged in the corresponding order. However, no row indices are copied to $ds2. (See the section "ROW INDEX METHODS" below for more information on row indices.)id
variable are preserved (and can be used to match rows against the correspond entries in the original data set). When an independent numbering is desired, the renumber method can be used to re-compute the id
values so that they form an uninterrupted sequence starting from 1. NB: The renumbering ignores an activated row index.A row index is an array reference containing a list of row numbers (starting from 1, unlike Perl arrays). Row indices are used to select rows from an in-memory data set, or to represent a re-ordering of the rows (or both). They are usually created by the where and sort methods, but can also be constructed explicitly. An arbitrary number of named row indices can be stored in a UCS::DS::Memory object.
A row index can be activated, creating a "virtual" data set containing only the rows selected by the index, arranged in the corresponding order. Most UCS::DS::Memory methods will then operate on this virtual data set. All exceptions are marked clearly in this manpage. In particular, the where method selects a subset of the activated index, and sort can be used to reorder it. There can only be one active row index at a time. There is no way of localising the activation (so that a previously active index is restored at the end of a block), so it is highly recommended to use active indices only locally and de-activate them afterwards.
Index names must be valid UCS identifiers, i.e. they may only contain alphanumeric characters (A-Z a-z 0-9
) and periods (.
) (cf. "VARIABLES" in ucsfile). Note that index names beginning with a period are reserved for internal use.
$ds->where("high.freq", new UCS::Expression '%f% >= 10');
$ds->where("high.freq", '%f% >= 10');
+
or -
character to select ascending or descending sort order, respectively. The default order is descending for Boolean variables and association scores, and ascending for all other variables. The sort keys 'l1'
and 'l2'
sort in alphabetical order, while 'f-'
puts the most frequent pair types first. # order pair types by frequency (descending), breaking ties randomly
if (not $ds->var("am.random")) {
$ds->add("am.random");
$ds->temporary("am.random", 1); # temporary, don't save to disk
}
$ds->sort("by.freq", "f-", "am.random");
A data set dictionary is a hash structure listing all the different values that a given variable assumes in the data set (or all the different value combinations of several variables). For each value (or value combination), which is called a key of the dictionary, the corresponding row numbers in the data set can be retrieved (called a lookup of the key). In the terminology of relational databases, such a dictionary is referred to as an index. Be careful not to confuse this notion with the row index described above, which is used for subsetting and/or reordering the rows of a data set.
A dictionary can be created for any variable (or combination of variables) with the dict method, and is returned in the form of a UCS::DS::Memory::Dict object. NB: This dictionary is only valid as long as the data set itself is not modified (which includes activation or deactivation of a row index). Unlike a database index, the dictionary is not updated automatically. It is therefore important to keep operations on the data set under strict control while a dictionary is in use. It is always possible to add, modify, and delete variables that are not included in the dictionary, though. For the same reason (as well as to save working memory), dictionaries should be deleted when they are no longer needed.
The main purpose of a dictionary is to look up keys and find the matching rows in the data set efficiently (the ucs-join program is an example of a typical application). It is often desirable to choose variables in such a way that every key identifies a unique row in the data set (for instance, the values of l1
and l2
identify a pair type, which should have only one entry in a data set). A dictionary with this property is called unique. Both unique and non-unique dictionaries are supported (unique dictionaries are represented in a memory-efficient fashion). Lookup and similar operations are implemented as methods of the UCS::DS::Memory::Dict object.
Although mainly intended for string values, dictionaries support all data types. Boolean variables will usually be of interest only in combination with other variables (possibly also Boolean ones), and dictionaries are rarely useful for floating-point values.
The keys method and the ability to use the returned internal representations in the lookup method provide an easy way to compute the (empirical) distribution of a data set variable, i.e. a list of different values and their multiplicities. (Note that calling lookup in scalar context cannot be used to determine the multiplicity of a key because it returns the first matching row in this case.)
# frequency table for variable $v on data set $ds
$dict = $ds->dict($v);
@distribution =
# sort values by multiplicity
sort { $b->[1] <=> $a->[1] or $a->[0] cmp $b->[0] }
# compute multiplicity for each value
map { [$_, $dict->multiplicity($_)] }
# for a single variable $v, internal keys are simply the values
$dict->keys;
undef $dict; # always erase dictionary after use
The following example is a bare-bones version of the ucs-join command, annotating the pair types of a data set $ds1 with a variable $var from another data set $ds2 (matching rows according to the pair types they represent, i.e. using the variables l1
and l2
). Typically, $ds2 will be an annotation database.
$ds1->add_variables($var); # assuming $var hasn't previously exist in $ds1
$dict = $ds2->dict($var);
$dict->unique
or die "Not unique -- can't look up pair types.";
foreach $n (1 .. $ds1->size) {
$row = $dict->lookup($ds1, $n);
$ds1->set_cell($var, $n, $ds2->cell($var, $row))
if defined $row;
}
undef $dict;
The ucsfile manpage for general information about UCS data sets and the data set file format, the ucsexp manpage for an introduction to UCS expressions (which are used extensively in the UCS::DS::Memory module) and wildcard patterns, the UCS::Expression manpage for information on how to compile UCS expressions, and the UCS::DS manpage for methods that manipulate the layout of a data set and its header information.
Copyright 2004 Stefan Evert.
This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.