The to_xml command is used to generate XML data from CSV inputs. This command can be used in two ways; the simplest is to use it to generate an XHTML table using the <table>, <tr> and <td> tags. To do this, run the command without specifying a configuration file:

csvfix to_xml somefile.csv

The more interesting mode uses a configuration file to produce customised, tree-structured XML data. For example, suppose we have the following CSVdata (this is actually a shortened books.csv) which describes some books, their author and some characters:

Dickens,Charles,Bleak House,Esther Sumerson,Drippy heroine
Dickens,Charles,Bleak House,Inspector Bucket,Prototype detective
Dickens,Charles,Great Expectations,Pip,Deluded ex-blacksmith
Dickens,Charles,Bleak House,Mr Vholes,Vampiric lawyer
Austen,Jane,Emma,Emma Woodhouse,Smug Surrey goddess
Austen,Jane,Pride & Prejudice,Elizabeth Bennet,Non-drippy heroine
Austen,Jane,Pride & Prejudice,Mr Darcy,"Proud, wet-shirted landowner"

We can transform this data to XML by writing a configuration file (books.xsp):

# create XML describing some fictional characters
tag characters
   tag author group 1,2 attrib forename 2 attrib surname 1
      tag book group 3 attrib title 3
         tag character group 4
            tag name
               text 4
            tag description
               text 5

and running the CSVfix command:

csvfix to_xml -xf books.xsp books.csv 

producing:

<characters>
    <author forename="Charles" surname="Dickens">
        <book title="Bleak House">
            <character>
                <name>
                    Esther Sumerson
                </name>
                <description>
                    Drippy heroine
                </description>
            </character>
            <character>
                <name>
                    Inspector Bucket
                </name>
                <description>
                    Prototype detective
                </description>
            </character>
        </book>
        <book title="Great Expectations">
            <character>
                <name>
                    Pip
                </name>
                <description>
                    Deluded ex-blacksmith
                </description>
            </character>
        </book>
        <book title="Bleak House">
            <character>
                <name>
                    Mr Vholes
                </name>
                <description>
                    Vampiric lawyer
                </description>
            </character>
        </book>
    </author>
    <author forename="Jane" surname="Austen">
        <book title="Emma">
            <character>
                <name>
                    Emma Woodhouse
                </name>
                <description>
                    Smug Surrey goddess
                </description>
            </character>
        </book>
        <book title="Pride &amp; Prejudice">
            <character>
                <name>
                    Elizabeth Bennet
                </name>
                <description>
                    Non-drippy heroine
                </description>
            </character>
            <character>
                <name>
                    Mr Darcy
                </name>
                <description>
                    Proud, wet-shirted landowner
                </description>
            </character>
        </book>
    </author>
 /characters>

We'll now look at how the input data must be structured and how the configuration file is written. The CSV input data must be grouped in a way that reflects the final XML output. In this case, we have grouped the CSV by author names and book title. Note that the data does not have to be sorted (alphabetically or otherwise) but if all the same values are not grouped together in the input, they will be separated in the output - for example, "Mr Vholes" is separated from the other "Bleak House" characters because he is not grouped with them in the CSV input.

Now let's look at the individual lines of the configuration file The first line:

# create XML describing some fictional characters

is a comment. Any lines where the first non-whitespace character is a '#', or which consist entirely of whitespace, are ignored by CSVfix.

tag characters

This line specifies the root tag of the XML output, using the tag keyword and giving it the name "characters". All configuration files must specify a single root (i.e. they must specify well-formed XML). The next line:

   tag author group 1,2 attrib forename 2 attrib surname 1

is indented using a single tab character. The difference in indentation means that it is a child of the root tag. It has the name "author". It also uses the group keyword to specify that this tag is used to group together CSV input data which share common values for the first two fields (the 1,2 values). It then specifies that the tag will have two attributes (using the attrib keyword) that will have the names "forename" and "surname" and take their values from the second and first fields respectively. Note their is no requirement that any attribute values are the same as the group values, though this will often be the case. The next line:

      tag book group 3 attrib title 3

specifies a tag which is the child of the author tag (because it is more deeply indented, using two tabs) and is grouped on the third field within the author tag's grouping - the title field. It also specifies a single attribute called "title".

         tag character group 4

This line specifies a tag called "character" grouped on the fourth field that has no attributes. As the fourth field is unique (within its parent) , there will only be one input row that matches this grouping. That means that the next two lines:

            tag name
               text 4

do not require a group. If the group keyword is omitted, the tag is produced using the first rowof the grouping provided by the parent tag. In this case, we ouput a single "name" tag with no attributes which contains an XML text element, specified by the text keyword. Text elements are always enclosed by their parent tag and cannot themaseleves have child tags. The next two rows therefore specify tags at the same level as the name:

            tag description
               text 5

Text fields have XML quoting applied to them - for example "Pride & Prejudice" becomes "Pride &amp; Prejudice". If you want to avoid this, you can use the cdata keyword instead, which wraps the output text in an XML CDATA section.

Some final things to note about this command:


  • It cannot (and is not intended to) produce any arbitrary XML from any arbitrary CSV input. This would need a Turing complete language. It is intended to produce simple tree structures that closely mirror the CSV input.  
  • The command does not currently check that tag and attribute names adhere to XML standards.  
  • If your input records are not grouped in the way you need them, use the CSVfix sort command to group them appropriately.
  • There is no way of combining fields using to_xml - instead use the merge, edit and other similar CSVfix commands to get your data in the right format.
  • The algorithm this command uses make it necessary to read all input data into memory - this may make it slow or even unusable for very large CSV input files.  


The to_xml command understands the following flags:

Flag

Req'd?

Description

-xf file

No

Specifies a configuration file defining how to produce XML from CSV. If omitted, a generic XHTML table is output.

-in indent

No

Specify the number of spaces to use for each level of indent in the XML output. If the special value tabs is used, a single tab character will be used for each level of indent.

-et

No

Specifies that a separate XML end tag will be generated even if a tag has no content. Has no effect if a configuration file is not used.



Created with the Personal Edition of HelpNDoc: Full-featured EPub generator