UCS::DS::Stream - I/O streams for data set files
use UCS::DS::Stream;
$ds = new UCS::DS::Stream::Read $filename;
die "format error" unless defined $ds;
# access variables, comments, and globals with UCS::DS methods
while ($ds->read) {
die "read/format error"
unless $ds->valid; # valid row data available?
$n = $ds->row; # row number
$idx = $ds->var_index("am.logl"); # see 'ucsdoc UCS::DS'
$logl = $ds->columns->[$idx]; # $ds->columns returns arrayref
$logl = $ds->value("am.logl"); # short and safe, but slower
$rowdata = $ds->data; # returns hashref (varname => value)
$logl = $rowdata->{"am.logl"}; # == $ds->value("am.logl")
}
ds->close;
$ds = new UCS::DS::Stream::Write $filename;
# set up variables, comments, and globals with UCS::DS methods
$ds->open; # write data set header
foreach $i (1 .. $N) {
$ds->data("id"=>$i, "l1"=>$l1, ...);# takes hashref or list of pairs
$ds->data("am.logl"=>$logl, ...); # may be called repeatedly to add data
$ds->columns($i, $l1, $l2, ...); # complete list of column data
$ds->write; # write row and clear data cache
}
$ds->close;
UCS data set streams are used to read and write data set files one row at a time. When an input stream is created, the corresponding data set file is opened immediately and its header is read in. The header information can then be accessed through UCS::DS methods. Each read method call loads a single row from the data set file into an internal representation, from which it is available to the main program.
An output stream creates / overwrites its associated data set file only when the open method is called. This allows the main program to set up variables and header data with UCS::DS method calls. After opening the file, the data for each row is first stored in an internal representation, and then written to disk with the write method.
Note that there are no objects of class UCS::DS::Stream. Both input and output streams inherit directly from the UCS::DS class.
Input streams are implemented as UCS::DS::Stream::Read objects. When an input stream is created, the header of the associated data set file is read in. Header data and information about the variables in the data set can then be accessed using UCS::DS methods.
The actual data set table is then loaded one row (= pair type) at a time by calling the read method. The row data are extracted into an internal representation where they can be accessed with various methods (some of them being safe, others more efficient).
The na method controls whether missing values (represented by the string NA
in the data set file) are recognised and stored internally as undefs, or whether they are silently translated into 0 (BOOL
, INT
, and DOUBLE
variables) and the empty string (STRING
variables), respectively.
/
or ./
nor a command pipe) and the file is not found in the current working directory, the standard UCS libary is automatically searched for a data set with this name.NA
(as used by R). When enabled, missing values are represented by undefs. Otherwise, they will be silently translated into 0 (BOOL
, INT
, and DOUBLE
variables) and the empty string (STRING
variables), respectively. Use $ds->na(0);
to disable missing value support, which is by default activated.while ($ds->read) {...}
.$cols->[$idx]
. Since index lookup can be moved out of the row processing loop, this access method is much more efficient than its alternatives. NB: the array @$rowdata is not reused for the next line of input and can safely be integrated into user-defined data structures.$rowdata->{$varname}
, similar to using the value method. Access with the data method is convenient for copying row data to an output stream. It is relatively slow, though, and should not be used in tight loops.Output streams are implemented as UCS::DS::Stream::Write objects. After creating an output stream object, variables and header data are set up with the UCS::DS methods. The data set header is written to disk when the open method is called.
After that, the actual data set table is generated one row at a time. Row data is first stored in the internal presentation (using the data or the columns method), and then written to disk when the write method is called.
NA
.The recommended way of copying rows from one data set file to another is to use the data methods of both streams, so that variables are copied by name rather than column position. It would be more efficient to pass row data directly (using the columns methods), but this approach is prone to lead to errors when the order of the columns is different between the input and output data sets.
The following example makes a copy of a data set file, adding an (enumerative) id
variable if it is not present in the source file.
$in = new UCS::DS::Stream::Read $input_file;
die "$input_file: format error"
unless defined $in;
@vars = $in->vars;
$add_id = not $in->var("id");
$out = new UCS::DS::Stream::Write $output_file;
$out->copy_comments($in); # copy comments and
$out->copy_globals($in); # global variables from input file
$out->add_vars("id") # conventionally, the "id" variables
if $add_id; # is in the first column
$out->add_vars(@vars);
$out->open; # writes header to $output_file
while ($in->read) {
die "read/format error"
unless $in->valid;
$out->data($in->data); # copy row data by field name
$out->data("id" => $in->row) # use row number as ID value
if $add_id;
$out->write;
}
$in->close;
$out->close;
Copyright 2004 Stefan Evert.
This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.