Pipelines and filters - tutorials

How to convert a raw text file to XML with XPath or with a regular expression, how to use XPath patterns for filtering a big SAX source, how to design complex pipelines with dynamic redirections, how to redirect SAX events to a DOM subtree.

Before testing these tutorials, you could also consider a simple pipeline that performs a validation and involves an XInclude filter in the DTD validation tutorial.

Understanding XCL filters

For a complete understanding of XCL filters, please refer to the XCL specification.

Text to XML

In this example, we are simply XMLizing a raw text thanks to some filters.

Batch script

This script simply convert a raw text file to XML:

[doc/tutorial/pipeline/poem/poem.xcl]

<?xml version="1.0" encoding="iso-8859-1"?>
<xcl:active-sheet xmlns:xcl="http://ns.inria.org/active-tags/xcl"> <!--use a hard-coded filter--> <xcl:parse-filter source="res:org.inria.ns.reflex.xml.filter.helpers.LineReader"
name="lineReader"/> <!--read poem.txt with the line reader--> <xcl:filter name="poem" source="poem.txt" filter="{ $lineReader }"/> <!--wrap text parts with XML elements--> <xcl:filter name="xmlPoem" source="{ $poem }"> <xcl:rule pattern="/"> <xcl:forward> <poem> <xcl:apply-rules/> </poem> </xcl:forward> </xcl:rule> <xcl:rule pattern="text()[1]"> <xcl:forward> <author><xcl:apply-rules/></author> </xcl:forward> </xcl:rule> <xcl:rule pattern="text()[2]"> <xcl:forward> <title><xcl:apply-rules/></title> </xcl:forward> </xcl:rule> <xcl:rule pattern="text()"> <xcl:forward xcl:if="{ . != '' }"> <line><xcl:apply-rules/></line> </xcl:forward> </xcl:rule> </xcl:filter> <!--write XML to file--> <xcl:transform source="{ $xmlPoem }" output="poem.xml"/> </xcl:active-sheet>

A hard-coded filter that can fire SAX events from a non-XML input is parsed (<xcl:parse-filter>) and connected to the text source (<xcl:filter>). The next filter is an inline filter that simply wraps the text-parts with XML elements : as explained here, each line of text read fires a character event, allowing the rules to involve a positional predicate within the XPath patterns. The last step transform the XML document to a file.

To run the script from the console prompt:

 $ java -jar reflex-0.4.0.jar (line cut)
     run doc/tutorial/pipeline/poem/poem.xcl

The input :

[doc/tutorial/pipeline/poem/poem.txt]

William Shakespeare
Love Sonnet 1

From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies,
Thy self thy foe, to thy sweet self too cruel:
Thou that art now the world's fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And tender churl mak'st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.

The output (pretty display), which is not written to the standard output but in a file :

[doc/tutorial/pipeline/poem/poem.xml]

<?xml version="1.0" encoding="UTF-8"?>
<poem> <author>William Shakespeare</author> <title>Love Sonnet 1</title> <line>From fairest creatures we desire increase,</line> <line>That thereby beauty's rose might never die,</line> <line>But as the riper should by time decease,</line> <line>His tender heir might bear his memory:</line> <line>But thou contracted to thine own bright eyes,</line> <line>Feed'st thy light's flame with self-substantial fuel,</line> <line>Making a famine where abundance lies,</line> <line>Thy self thy foe, to thy sweet self too cruel:</line> <line>Thou that art now the world's fresh ornament,</line> <line>And only herald to the gaudy spring,</line> <line>Within thine own bud buriest thy content,</line> <line>And tender churl mak'st waste in niggarding:</line> <line>Pity the world, or else this glutton be,</line> <line>To eat the world's due, by the grave and thee.</line> </poem>

Parsing a multipart SOAP message with a regular expression

A SOAP message is embedded within a MIME file; we use a regular expression to split the different parts of the message to a set of XML files.

Batch script

This script simply convert a raw text file to several XML files:

[doc/tutorial/pipeline/mime/mime.xcl]

<?xml version="1.0" encoding="iso-8859-1"?>
<xcl:active-sheet xmlns:xcl="http://ns.inria.org/active-tags/xcl" xmlns:sys="http://ns.inria.org/active-tags/sys"> <!-- parse a SOAP message embedded within a MIME message split all XML parts in a single file --> <xcl:set name="i" value="{ number( 0 ) }" scope="global"/> <xcl:parse-filter source="res:org.inria.ns.reflex.xml.filter.helpers.Tokenizer" name="filter"> <xcl:param name="pattern" value="--.[^\n]+(?:(?:--)|(?:\n(?:.[^\n]+\n)*\n))"/> </xcl:parse-filter> <xcl:filter name="mime-message" source="mime.txt" filter="{ $filter }"/> <xcl:filter name="soap" source="{ $mime-message }"> <xcl:rule pattern="text()"> <xcl:if test="{ . != '' }"> <xcl:then> <xcl:parse name="part" text-source="{ string( . ) }"/> <xcl:transform source="{ $part }" output="soap-part-{ $i }.xml"/> <xcl:set name="i" value="{ $i + 1 }" scope="global"/> </xcl:then> </xcl:if> </xcl:rule> </xcl:filter> <xcl:transform source="{ $soap }" output="{ $sys:null }"/> </xcl:active-sheet>

A hard-coded filter that can tokenize a text stream regarding a regular expression is parsed (<xcl:parse-filter>) and connected to the text source (<xcl:filter>). The next filter is an inline filter that simply parse the part of the text that is an XML document and save it to a file; the last step launches the process; it doesn't save anything.

To run the script from the console prompt:

 $ java -jar reflex-0.4.0.jar (line cut)
     run doc/tutorial/pipeline/mime/mime.xcl

The input :

[doc/tutorial/pipeline/mime/mime.txt]

--4389012.48390
Content-Type: text/xml

<?xml version="1.0" encoding="UTF-8"?>
<soap-env:Envelope
xmlns:soap-env="http://schemas.xmlsoap.org/soap/envelope/">
...snip...
</soap-env:Envelope>
--4389012.48390
Content-Type: text/xml
Content-Id: RootNode

<?xml version="1.0" encoding="UTF-8"?><RootNode>
.. snip ...
</RootNode>
--4389012.48390--

The 2 outputs in their own files :

[doc/tutorial/pipeline/mime/soap-part-0.xml]

<?xml version="1.0" encoding="UTF-8"?>
<soap-env:Envelope xmlns:soap-env="http://schemas.xmlsoap.org/soap/envelope/"> ...snip... </soap-env:Envelope>

[doc/tutorial/pipeline/mime/soap-part-1.xml]

<?xml version="1.0" encoding="UTF-8"?>
<RootNode> .. snip ... </RootNode>

Splitting a big SAX source to multiple XML chunks thanks to XPath-based filters

Processing huge XML sources couldn't be considered with XSLT or other tools based on XPath, as such tools require loading the entire XML file into memory, causing an "OutOfMemory" error.

It is now possible since v0.2.0 of RefleX to design quickly XCL filters based on XPath patterns rules. Moreover, such filters can work both for DOM trees (say small files < 100MB) and SAX events (big files > 100MB). You maybe consider a huge document when its size is greater than 1GB, but the document used in this example was big enough to raise an error with a 2 lines Java program that was performing a DOM-parsing with a standard parser. You can try very very big documents in the same way if you like to.

In this example, a 15MB file is processed with SAX. Each time a given XPath pattern is matched, the output is redirected to a small XML chunk serialized to an independant file.

Batch script

The XML source is a big list of tracks and playlists. It is a data structure made of typed datas; the type of the data depends on the name of the element inside which it is hosted.

A main SAX pipeline reads the entire XML source, and each track and each playlist is plugged to another channel for serialization, on which the SAX events are forwarded. The name of the file to write to is computed thanks to the ID of the track or the playlist, but the ID doesn't appear in the same place:

  • The ID of a track is placed on the <key> element BEFORE the track itself (a <dict> element); as we don't need to process the content of a track, the events will be forwarded as-is to a SAX channel.
      <key>1234</key><!--the ID of the track-->
      <dict>
          <!--the content of the track-->
      </dict>
  • The ID of a playlist is placed on the <integer> element INSIDE the playlist itself (which is also a <dict> element); we have to pour the SAX events inside a DOM container in order to retrieve the ID of the playlist used for the name of the file.
      <dict>
          <integer>4321</integer><!--the ID of the playlist-->
          <!--the content of the playlist-->
      </dict>

The tracks and the playlists are in separate subtrees of the document, reachable by different paths : /plist/dict/dict for tracks (couples of <key>, <dict>), and /plist/dict/array for playlists (list of <dict>).

While running the following script, we are counting the number of files written and notify to the standard output each time 500 files have been written, in order to follow the progression.

[doc/tutorial/pipeline/big/big.xcl]

<?xml version="1.0" encoding="iso-8859-1"?>
<xcl:active-sheet xmlns:xcl="http://ns.inria.org/active-tags/xcl" xmlns:sys="http://ns.inria.org/active-tags/sys"> <!--get the system property "file"--> <xcl:set name="file" value="{ string( $sys:env/file ) }"/> <xcl:set xcl:if="{ not( $file ) }" name="file" value="Bibliotheque32Kilos.xml"/> <xcl:echo value="Parsing { $file }"/> <xcl:set name="i" scope="global" value="0"/> <xcl:parse name="biblio" source="{ $file }" style="stream"/> <xcl:filter name="bib" source="{ $biblio }"> <xcl:rule pattern="/plist/dict/dict/key" normalize="yes"> <!--the name of the file for the tracks is just the <key> before a <dict> --> <xcl:set name="trackName" scope="global" value="{ string( text() ) }"/> </xcl:rule> <xcl:rule pattern="/plist/dict/dict/dict"> <!--a TRACK--> <xcl:set name="i" scope="global" value="{ $i + 1 }"/> <xcl:echo xcl:if="{ $i mod 500 = 0 }" value="{ $i } files created..."/> <!--the name of the file is not taken within, we just need a derivation to another SAX channel--> <xcl:document name="track" style="stream"> <xcl:forward channel="track"> <xcl:apply-rules/> </xcl:forward> </xcl:document> <!--SAX to file--> <xcl:transform source="{ $track }" output="split/tracks/{ $trackName }.xml"/> </xcl:rule> <xcl:rule pattern="/plist/dict/array/dict"> <!--a PLAYLIST--> <xcl:set name="i" scope="global" value="{ $i + 1 }"/> <xcl:echo xcl:if="{ $i mod 500 = 0 }" value="{ $i } files created..."/> <!--the ID used to compute the file name is within, thus, we are pouring SAX events inside a DOM container--> <xcl:document name="playlist" style="tree"> <xcl:forward channel="playlist"> <xcl:apply-rules/> </xcl:forward> </xcl:document> <!--now, we can compute the name of the file of the playlist from the DOM--> <xcl:transform source="{ $playlist }"
output="split/playlists/{ string( $playlist/dict/integer ) }.xml"/> </xcl:rule> </xcl:filter> <xcl:transform source="{ $bib }" output="{ $sys:null }"/> <xcl:echo value="Number of files created : { $i }"/> </xcl:active-sheet>

The main channel of the pipeline consist on :

  1. Parsing à la SAX the XML input with <xcl:parse>.
  2. Connecting the filter to the parsed source with <xcl:filter>, that contains 3 <xcl:rule>s (+1 invisible default rule that consists on forwarding everything read to the next step). Notice that this filter could be host in a separate file to make it reusable by several active sheets.
  3. Connecting the main channel of the filter to a serializer that writes to $sys:null ("/dev/null"), as we don't need to do anything with the nodes that are reaching that point. <xcl:transform> is the action that launches the pipeline process.

Inside the filter, there is :

  • a rule that is used to compute the file name of a track,
  • a rule that matches each track : a new SAX document is created for hosting the events that are forwarded to it ; this SAX document is serialized to a file.
  • a rule that matches each playlist : this time, a new DOM document is created, which allows to get inside the ID of the playlist.

The <xcl:forward> element is used to forward some content to the next step of the pipeline ; by default, it is the main channel, but if the @channel is specified, the events can be redirected to an alternative channel. It can accept several channels names. The special name "#main" also refer to the main channel, so if you replace channel="track" with channel="track #main", the events will be duplicated. Notice that to make it usefull, you should also replace $sys:null by $sys:out or any file reference in the last <xcl:transform> action.

A 15MB XML document is compressed inside a 1MB zip file, and will cause the creation of about 13000 files :

 $ java -Dfile=zip:http://disc.inria.fr/perso/philippe.poulard/xml(line cut)
/tests-data/Bibliotheque15Megas.zip\!Bibliotheque15Megas.xml (line cut)
     -jar reflex-0.4.0.jar (line cut)
     run doc/tutorial/pipeline/big/big.xcl
Parsing zip:http://disc.inria.fr/perso/philippe.poulard/(line cut)
xml/tests-data/Bibliotheque15Megas.zip!Bibliotheque15Megas.xml
500 files created...
1000 files created...
1500 files created...
[.../...]
12500 files created...
13000 files created...
Number of files created : 13138

You can check the created files in the directory doc/tutorial/pipeline/big/split/.

As the program will last few minutes with this file, you can also make a quicker test with a tiniest XML file (32Ko) instead :

 $ java -Dfile=Bibliotheque32Kilos.xml (line cut)
      -jar reflex-0.4.0.jar (line cut)
     run doc/tutorial/pipeline/big/big.xcl

To learn more about how XPath patterns are processed with SAX in RefleX, read this article. To see more examples, have a look at the test suite that covers the XCL module.

To learn more about XCL filtering, please refer to the XCL specification.

If you have to perform the inverse operation, that is to say to recompose a big XML file from thousands of independant sources, refer to that tip.