README
for Excel spreadsheet file r/w access scripts for octave (>
3.4.0)
Copyright (C) 2009 - 2011 Philip Nienhuis <prnienhuis
at users.sf.net>
This version February 16, 2011
EXCEL
.XLS SUPPORT FILES
doc/README-XLS.html
This file.
- xlsread.m
-
All-in-one function for
writing data from one specific worksheet in an Excel spreadsheet file.
This script has Matlab-compatible functionality.
xlswrite.m-
All-in-one function for
writing data to one specific worksheet in an Excel spreadsheet file.
This script has Matlab-compatible functionality.
xlsfinfo.m -
All-in-one function for
exploring basic properties of an Excel spreadsheet file. This script
has Matlab-compatible functionality.
xlsopen.m -
Function for "opening"
(= providing a handle to) an Excel spreadsheet file ("workbook").
This function sorts out which interface to use for .xls access
(i.e.,COM; Java & Apache POI; JexcelAPI; etc.), but it's choice
can be overridden.
xls2oct.m -
Function for reading
data from a specific worksheet pointed to in a struct created by
xlsopen.m. xls2oct can be called multiple times consecutively using
the same pointer struct, each time allowing to read data from
different ranges and/or worksheets. Data are returned in the form of
a 2D heterogeneous cell array that can be parsed by parsecell.m.
xls2oct is a mere wrapper for interface-dependent scripts that do
the actual low-level reading .
oct2xls.m -
Function for writing
data to a specific worksheet pointed to in a struct created by
xlsopen.m. octxls can be called multiple times consecutively using
the same pointer struct, each time allowing to write data to
different ranges and/or worksheets. oct2xls is a mere wrapper for
interface-dependent scripts that do the actual low-level
writing.
xlsclose.m -
Function for closing
(the handle to) an Excel workbook. When data have been written to
the workbook oct2xls will write the workbook to disk. Otherwise, the
file pointer is simply closed and possibly used interfaces for Excel
access (COM/ActiveX/Excel.exe) will be shut down
properly.
parsecell.m -
Function for separating
the data in raw arrays returned by xls2oct, into numerical/logical
and text (cell) arrays.
-
- spsh_chkrange.m
-
spsh_prstype.m
-
getusedrange.m
-
calccelladdress.m
-
parse_sp_range.m
-
Support files called by
the scripts and not meant for direct invocation by users.
REQUIRED
SUPPORT SOFTWARE
-
For the Excel/COM
interface:
-
A windows computer with
Excel installed
-
Octave-forge
Windows-1.0.8 or later package WITH LATEST SVN PATCHES APPLIED
For the Java /
Apache POI / JExcelAPI interfaces (general):
-
octave-forge java-1.2.8
package or later version on Linux
-
octave-forge java-1.2.8
with latest svn fixes on Windows/MingW
-
Java jre or jdk >
1.6.0 (hasn't been tested with earlier versions)
- Apache POI specific:
- JExcelAPI specific:
- Hint: simply put the
relevant javaaddpath statements in the .octaverc file.
USAGE
-
xlsread and
xlswrite are mere wrappers for
xlsopen-xls2oct-xlsclose-parsecell and
xlsopen-oct2xls-xlsclose sequences, resp. They
exist for the sake of Matlab compatibility.
xlsfinfo
can be used for finding out what worksheet names exist in the file.
For OOXML files you either need MS-Excel 2007 for Windows (or later
version) installed, and/or the input parameter REQINTF should be
specified with a value of 'poi' (case-insensitive) and -obviously-
the complete POI interface must have been installed.
Invoking
xlsopen/..../xlsclose directly provides for much more
flexibility, speed, and robustness than xlsread / xlswrite.
Indeed, using the same file handle (pointer struct) you can mix
reading & writing before writing the workbook out to disk using
xlsclose.
And: xlsopen / xlsclose hide the gory
interface details from the user.
Currently only .xls files
(BIFF8) can be read/written; using JExcelAPI BIFF5 can be read as
well. For OOXML files either Excel 2007 for Windows (or higher)
and/or the complete Apache POI interface must be installed (and
probably the REQINTF parameter specified with a value of
'poi').
When using xlsopen....xlsclose be sure
to keep track of the file handle struct.
A possible
scenario:
xlh = xlsopen (<excel_filename> , [rw],
[<requested interface>])
# Set rw to 1 if you want
to write to a workbook immediately.
# In that case the check for
file existence is skipped and
# -if needed- a new workbook
created.
# If you really want an other interface than
auto-selected
# by xlsopen you can request that. But xlsopen
still checks
# proper support for your choice.
#
Read some data
[ rawarr1, xlh ] = xls2oct (xlh,
<SomeWorksheet>, <Range>)
# Be sure to specify
xlh as output argument as xls2oct keeps
# track of changes and
the need to write the workbook to disk
# in the xlhstruct. And
the origin range is conveyed through
# the xlh pointer struct.
#
Separate data into numeric and text data
[ numarr1,
txtarr1, lim1 ] = parsecell (rawarr1)
# Get more
data from another worksheet in the same workbook
[ rawarr2,
xlh ] = xls2oct (xlh, <SomeOtherWorksheet>, <Range>)
[
numarr2, txtarr2, lim2 ] = parsecell (rawarr2)
# <...
Analysis and preparation of new data in cell array Newdata....>
#
Add new data to spreadsheet
xlh = oct2xls (Newdata, xlh,
<AnotherWorksheet>, <Range>)
# Close the
workbook and write it to disk; then clear the handle
xlh =
xlsclose (xlh)
clear xlh -
- When not using the
COM interface, specify a value of 'POI' for parameter REQINTF when
accessing OOXML files in xlsread, xlswrite, xlsopen, xlsfinfo (and
be sure the complete Apache POI interface is installed). If you
haven't got ActiveX installed (i.e., not having MS-Excel under
Windows) specifying 'POI' may not be needed as in such cases Apache
POI is the next default interface.
-
- When using JExcelAPI
(JXL), after writing into a worksheet you MUST save the file –
adding data to the same or another worksheet is no more possible
after the first call to oct2xls(). This is a limitation of
JExcelAPI.
-
SPREADSHEET
FORMULA SUPPORT
-
When using the POI
and JXL interfaces you can:
-
(When reading, xls2oct)
either read spreadsheet formula results (like in COM interface), or
the literal formula text strings;
-
(When writing, oct2xls)
either enter formulas in the worksheet as formulas, or enter them as
literal text strings. The former is also like in COM.
In short, you can
enter spreadsheet formulas and in a later stage read them back,
change them and re-enter them in the worksheet. - The behaviour is
controlled by an option structure options
which for now has only
one (logical) field:
-
options.formulas_as_text
= 0 (the default)
implies enter formulas as formulas and read back formula results
-
options.formulas_as_text
=1 (or
any positive integer) means enter formulas as text strings and read
them back as text strings.
- Be aware that
there's no formula evaluator in JExcelAPI (JXL). So if you create
formulas in your spreadsheet using oct2xls or xlswrite with 'JXL',
do not expect meaningful results when reading those files later on
unless
you open them in Excel
and write them back to disk.
- While both Apache
POI and JExcelAPI feature a formula validator, not all spreadsheet
functions present in Excel have been implemented (yet).
-
Worse, older Excel
versions feature less functions than newer versions. So be wary as
this may make for interesting confusion.
MATLAB
COMPATIBILITY
-
xlsread,
xlswrite and xlsfinfo are for the most part
Matlab-compatible. Some small differences are mentioned below. When
using the Java interfaces octave supplies some formula manipulation
support.
xlsread
Matlab's xlsread
supports invoking extra functions while reading ("passing
function handle"); octave not. But this can be simulated
outside xlsread.
Matlab's xlsread flags some
spreadsheet errors, octave-forge just returns blank
cells.
Octave-forge returns info about the actual (rather
than the requested) cell range where the data came from. Personally
I find it very useful to know from what part of a worksheet the data
originate so I've put quite some effort in it :-) -
Matlab can't, due to
Excel automatically trimming returned arrays from empty outer
columns and rows. Octave is more clever but the Visual Basic call
used for determining the actually used range has some limitations:
(1) it relies on cached range values and thus may be out-of-date,
and (2) it counts empty formatted cells too. When using ActiveX/COM,
if octave's xlsfinfo.m returns wrong data ranges it is most
often an overestimation.
-
- Matlab's xlsread
ignores all non-numeric data values outside the smallest rectangle
encompassing all numerical values. Octave's xlsread doesn't.
This means that Matlab ignores all row/column headers, not very
user-friendly IMO.
When using the Java interface, reading
and writing xls-files by octave-forge is platform-independent. On
systems w/o installed Excel, Matlab can only read Excel 95 formatted
.xls files (written using ML xlswrite's 'Basic” option)
– and then differently than under Windows..... -
- Matlab's xlsread
returns strings for cells containing date values. This makes for
endless if-then-elseif-else-end constructs to catch all expected
date formates. Octave returns numerical data (where 0 = 1/1/1900 –
you can easily transfer them into proper octave date values yourself
using e.g. datestr(), see bottom of this document for more
info).
Matlab's xlsread invokes csvread if no
Excel interface is present. Octave-forge's xlsread
doesn't.
xlswrite
Octave-forge's xlswrite
works on systems w/o Excel support, Matlab's doesn't (properly). -
- When specifying a
sheet number larger than the number of existing sheets in an .xls
file, Matlab's xlswrite adds empty sheets until the new sheet
number is created; Octave's xlswrite only adds one sheet
called “Sheet<number>” where <number> is the
specified sheet number.
- Even better (IMO)
while M's xlswrite always creates Sheet1/Sheet2/Sheet3 when
creating a new spreadsheet, octave's xlswrite only creates
the requested worksheet. (Did you know that you can instruct Excel
to create spreadsheets with just one, or any number of, worksheets?
Look in Tools | Options, General tab.)
-
Oh and octave doesn't
touch the “active sheet” - but that's not automatically
an advantage.
-
- If the specified
write range is larger than the actual data array, Matlab's xlswrite
adds #N/A cells to fill up the lowermost rows and rightmost columns;
octave-forge's xlswrite doesn't.
xlsfinfo
When
invoking Excel/COM interface, octave-forge's xlsfinfo also
echoes the type of sheet (worksheet, chart), not just the sheet
names. Using Java I haven't found similar functionality (yet). -
COMPARISON
OF INTERFACES & USAGE
-
- Using Excel itself
(through COM / ActiveX on Windows systems) is probably the most
robust and versatile and especially FAST option. There's one gotcha:
in case of some type of COM errors Excel will keep running
invisibly; you can only end it through Task Manager.
A tiny
problem is that one cannot find out easily through COM what file
types are supported; xls, wks, wk1, xlsx, etc.
Another -obvious-
limitation is that COM Excel access only works on Windows systems
where Excel is installed.
JExcelAPI (Java-based and therefore
platform-independent) is proven technology but switching between
reading and writing is quite involved and memory-hungry when
processing large spreadsheets. As the docs state, JExcelAPI is
optimized for reading and it does do that well - but still slower
than Excel/COM. The fact that upon a switch from reading to writing
the existing spreadsheet is overwritten in place by a blank
one and that you can only get the contents back wen writing out all
of the changes is worrying - and any change after the first write()
is lost as a next write() doesn't seem to work, worse yet, you may
completely loose the spreadsheet in question. The first is by
JExcelAPI design, the second is probably a bug (in octave-forge/Java
or JExcelAPI ? I don't know). Adding data to existing spreadsheets
does work, but IMO undue user confidence is needed. -
JExcelAPI supports BIFF5
(only reading) and BIFF8 (Excel 95 and Excel 97-2003, respectively).
Upon overwriting, BIFF5 spreadsheets are converted silently to
BIFF8.
-
JexcelAPI, unlike
ApachePOI, doesn't evaluate functions while reading but instead
relies on cached results (i.e. results computed by Excel itself).
Depending on Excel settings (“Automatic calculation” ON
or OFF) this may or may not yield incorrect (or expected)
results.
Apache POI (Java-based and platform-independent too)
is based on the OpenOffice.org I/O Excel r/w routines. It is a more
versatile than JExcelAPI, while it doesn't support BIFF5 it does
support BIFF8 (Excel 97 – 2003) and OOXML (Excel 2007). -
It is slower than native
JXL let alone Excel & COM but it features active formula
evaluation, although at the moment (v. 3.7) not all Excel functions
have been implemented. I've made the relevant subfunction
(xls2jpoi2oct) fall back to cached formula results (and yield a
suitable warning) for non-implemented Excel functions while reading
Excel files.
All in all, of the two Java options I'd prefer
Apache POI rather than JexcelAPI. But the latter is indispensable
for BIFF5 formats. -
- Some notes on the
choice for Java:
-
It saves a LOT of
development time to use ready-baked Java classes rather than
developing your own routines and thus effectively reinvent the
wheel.
-
A BIG advantage is that
a Java-based solution is platform-independent (“portable”).
-
But Java is known to be
not very conservative with resources, especially not when processing
XML-based formats.
- So Java is a
compromise between portability and rapid development time versus
capacity (and speed).
-
But IMO data sets larger
than 5.105 cells should not be kept in spreadsheets
anyway. Better use real databases for such data sets.
A
NOTE ON JAVA MEMORY USAGE
Java memory
pool allocation size
The Java virtual machine
(JVM), when initialized by octave, reserves one big chunk of your
computer's RAM in which all java classes and methods etc. are to be
loaded: the java memory pool. It does this because java has a very
sophisticated “garbage collection” system. At least on
Windows, the initial size is 2MB and the maximum size is 16 MB. On
Linux this allocated size might differ (e.g., my Mandriva box with
openJDK has a 512 MB default max setting). This part of memory is
where the Java-based XLS/ODS octave routines live and keep their
variables etc.
For transferring large
pieces of information to and from spreadsheets you might hit the
limits of this pool. E.g. to be able to handle I/O of an array of
around 500,000 cells I needed a memory pool size of 512 MB.
The memory size can be
increased by inserting a file called “java.opts”
(without quotes) in the directory
./share/octave/packages/java-<version> (where the script file
javaclasspath.m is located), containing just the following lines:
-Xms16m
-Xmx512m
(where 16 =
initial size, 512 = maximum size (in this example), m stands for
Megabyte. This maximum is system-dependent. E.g., I have a 1 GB
setting).
After processing
a large chunk of spreadsheet information you might notice that
octave's memory footprint does not shrink so it looks like Java's
memory pool does not shrink back; but rest assured, the memory
footprint is the allocated (reserved) memory size, not the
actual used size. After the JVM has done its garbage collection,
only the so-called “working set” of the memory
allocation is really in use and that is a trimmed-down part of the
memory allocation pool. On Windows systems it often suffices to
minimize the octave terminal for a few seconds to get a more
reasonable memory footprint.
TROUBLESHOOTING
Some hints for
troubleshooting Excel support are contained in this thread:
http://sourceforge.net/mailarchive/forum.php?thread_name=4C61B649.9090802%40hccnet.nl&forum_name=octave-dev
dated August 10,
2010.
A more structured
approach is below:
Check
if COM / ActiveXworks (only under Windows OS). Do a pkg
list and
see
a. If there's a windows
package mentioned (then it's installed). If not, install it.
b. If
there's an asterisk on the windows package line (then the package is
loaded). If not, do a pkg
load windows
Check
if the ActiveX server works. Do:
exl = actxserver
('Excel.Application') ##
Note the period between “Excel” and “Application”
If a COM object is
returned, ActiveX / COM / Excel works. Do: exl.Quit();
delete (exl) to
shut down the (hidden) Excel invocation.
If you get an error
message, your last resort is re-installing the windows package, or
trying the Java-based interfaces.
Check if java works. Do
a pkg
list and
see
a. If there's a java
package mentioned (then it's installed). If not, install it.
b. If there's an
asterisk on the java package line (then the package is loaded). If
not, do a pkg
rebuild -auto java
Check
Java
memory settings. Try javamem
a. If it works, check if
it reports sufficiently large max memory (had better be 200 MiB, the
bigger the better)
b. If it
doesn't work, do:
rt = java_invoke
('java.lang.Runtime', 'getRuntime')
rt.gc
rt.maxMemory
().doubleValue () / 1024 / 1024
The last command will
show MaxMemory in MiB.
c. In case
you have insufficient memory, see in “GOTCHAS”, “Java
memory pool allocation size”, how to increase java's memory
pre-reservation.
Check if all classes
(.jarfiles) are in class path. Do a 'javaclasspath'
(under unix/linux, do 'tmp
= javaclasspath; strsplit (tmp,”:”)'
(w/o quotes). See above under “REQUIRED SUPPORT SOFTWARE”
what classes should be mentioned.
If classes (.jar files)
are missing, download and put them somewhere and add them to the
javaclass path with their fully qualified pathname (in quotes) using
javaaddpath().
Once all classes
are present and in the javaclasspath, the xls interfaces should just
work. The only remaining showstoppers are insufficient write
privileges for the working directory, a wrecked up octave or some
other problem outside octave.
Try
opening an xls file:
xls1
= xlsopen ('test.xls', 1, 'poi').
If this works and xls1 is a struct with various fields containing
objects, the Apache POI interface (POI) works. Do an xls1
= xlsclose (xls1) to
close the file.
xls2 = xlsopen
('test.xls', 1, 'jxl').
If this works and xls2 is a struct with various fields containing
objects, the JExcelAPI interface (JXL) works as well. Don't forget
to do xls2
= xlsclose (xls2) to
close the file.
DEVELOPMENT
- xlsopen/xlsclose
and friends have been written so that adding other interfaces (Perl?
native octave? ...?) should be very easily accomplished. Xlsopen.m
merely needs two stanzas, xlsfinfo.m and getusedrange.m
each need an additional elseif stanza, and xlsclose.m needs a
small stanza for closing the pointer struct and writing to disk.
- The real work lies
in creating the relevant xls2<...>2oct &
oct2<...>2xls & <getusedrange_...>
subfunction scripts in xls2oct.m, oct2xls.m and
getusedrange.m, resp., but that shouldn't be really hard,
depending on the interface support libraries' quality and
documentation. Separating the file access functions and the actual
reading/writing from/to the workbook in memory has made developer's
life (I mean: my time developing this stuff) much easier.
Some
other options for development (who?):
-
Speeding up, especially
Java worksheet/cell access. For cracks, not me.
-
Automatic conversion of
Excel date/time values into octave ones and vice versa (adding or subtracting 636960). But then
again Excel's dates are 01-01-1900 based (octave's 0-0-0000) and
buggy (Excel thinks 1900 is a leap year), and I sometimes have to
use dates from before 1900. Maybe as an option?
-
Creating Excel graphs (a
significant enterprise to write from scratch).
-
Support for "passing
function handle" in xlsread.
Enjoy!
Philip
Nienhuis, February 16, 2011