XPath filtering on SAX and non-XML text streams

The first usage that comes to mind when using SAX is filtering, that is to say to act on the stream while parsing. A great effort has been made in RefleX in order to use smart XPath patterns on streams, which allows to take the advantages of streaming with a higher API than the low-level one provided with SAX.

This article discuss on the strategy adopted in RefleX to filter both SAX and non-XML streams with XPath, and show how filtering works in Active Tags.

The simplest data model

To keep the benefit of streaming, specifically with SAX, one must take care of the memory usage. As XPath patterns à la XSLT are focusing essentially on the path to cross to reach a node of the tree, it is obvious that the hierarchy of the current branch must be maintained. Other nodes should be released.

Thus, if nothing more were needed, the data model would be reduced to a single branch of the tree, from the document root to the current node. In order to save memory usage, when the next node is read from the input source, the current node is pruned from the branch on behalf to the new one. Nodes are kept only for elements that have content, but removed when exiting the content.

With this very simple approach, one can apply basic patterns like those above :

/the document root
aa node
a/b/ca relative strict hierarchy of nodes
a//ba relative hierarchy of nodes
/a/b/can absolute strict hierarchy of nodes
/a//ban absolute hierarchy of nodes

Of course, other types/node tests can be considered : text(), comment(), any element *, any node(), any element within a namespace prefix:*, any processing-instruction(), and processing-instruction('target').

Attributes and namespace nodes are not involved on node tests because they are not encountered while traversing the input tree, but they might be present on predicates :

a[@b]
a[not(@b)]
a[@b='c']
a[@b='c']/d[@e]

...which doesn't require much more efforts because attributes are supplied by SAX when an element is encountered.

Using position()

Other predicates could be considered, for example, it would be very convenient to match an entry that satisfies :

/a/b/c[1]
a/*[2]
a/comment()[3]
a/node()[position() < 4]

...which denotes that the position of the node must be retrieved. If the whole tree is maintained, it is easy to cross the previous siblings and count the nodes, but with SAX, what has been read previously is lost, so we must anticipate and keep the number of nodes read ; thus, a node of the branch has one intrinsic position. One ?

In fact, according to the node test involved in a step, the position of the node may vary ; for example, let's consider this document :

Note

For readability, an indentation has been added ; the reader should consider that there is only a single text node : "[some text]".

<?xml version="1.0" encoding="iso-8859-1"?>
<root> [some text] <a/> <!-- --> <b/> <?pi ip?> <b foo="bar"/> </root>

Now, consider the following patterns :

b[2]
*[3]
node()[6]

They are all matching the last <b> element, which implies that there are up to 4 intrinsic positions for the same node (there is also the node test on any element within a namespace prefix:*, which is not revealed by this example). Let's call these intrinsic positions :

  • the natural index
  • the type index
  • the family index (namespace)
  • the name index

Maintaining some additional counters on each node of the current branch is very acceptable regarding the memory usage.

The following table shows which index has to be used according to the XML construct encountered and the node test involved :

XML construct natural index type index family index name index
<p:element> node() * p:* element
some text node() text() N/A N/A
<!--comment--> node() comment() N/A N/A
<?target ?> node() processing-instruction() N/A processing-instruction('target')

For example, if a comment <!--comment--> is read in the input source, the "type index" will be used if the node test is comment(), but the "natural index" will be used if the node test is node().

The index of the document root is 1.

Thus, each time a node is encountered, its parent increments up to 4 counters. RefleX store counters with keys that reflect the name and the type of the node considered. The following table shows how the counters of the <root> element are evolving while reading the input :

keyvalueinput
text()1[some text]
node()1
text()1<a/>
node()2
*1
a1
text()1<!-- -->
node()3
*1
a1
comment()1
text()1<b/>
node()4
*2
a1
comment()1
b1
text()1<?pi ip?>
node()5
*2
a1
comment()1
b1
processing-instruction()1
processing-instruction('pi')1
text()1<b foo="bar"/>
node()6
*3
a1
comment()1
b2
processing-instruction()1
processing-instruction('pi')1

You might have noticed that the counter bound to the node() key is incremented for each XML construct read. Other counters are incremented according to the type of the node.

After parsing the child nodes of the <root> element, 8 counters had been created for this element.

Reading forward

/a/b/c[last()]
a/*[count() > 3]
a/node()[last()]

Unlike with the position() function, the last() and count() functions requires to know in advance the total number of nodes, which is not yet known. For the same reasons that previously, 4 values of size have to be considered :

  • the natural size
  • the type size
  • the family size (namespace)
  • the name size

However, unlike previously, the values are kept unknown unless one of these sizes is expected explicitely or implicitely in a predicate. To compute the value, one must read further input until the end of the parent element, without loosing the current node and its following siblings. A cache is necessary to achieve this. RefleX just connect the nodes read in advance to the current one, so that we have a subtree kept in memory. Once the predicate is evaluable, the events are processed from the tree instead of the input, and the tree is pruned as one goes along.

This strategy also works when forward and descendant axis are involved in the pattern, but it costs lots of memory. Users should be aware that using such patterns is low-efficient if almost all the tree have to be read in advance, and a strategy based on DOM instead of SAX would become more suitable. On the opposite, near from the bottom of a branch, the ability to use such patterns would be a great advantage without loosing the benefit of streaming.

Here are some other patterns that require reading forward :

a[following-sibling::b]
a[b]
a[*[not(self::b)]]

Side effects

This feature requires some caution :

  • The strategy regarding the counters becomes slightly different : when reading forward, if the indexes were incremented each time a node is encountered, once the current node would be evaluated, the indexes would be false. Thus, when the cache is turned on, only the sizes are incremented ; when it is turned off back and the datas from the cache are browsed, the indexes have to be incremented again. We must care that the index values must always reflect the positions of the current node.
  • When reading forward is expected, the call stack from the point that starts to evaluate the XPath patterns must end without loosing the evaluation context, like coroutines do, which implies that 2 threads must share the job, as the Java language doesn't support coroutines : one that handles SAX events (read and push), the other that evaluates XPath patterns (pull). Once the former thread launches the latter, it waits for exiting its method :
    • until the latter ends normally, because it doesn't need further reading
    • until the latter expect some further reading ; each time a new SAX event is received, the former thread must check if the conditions to stop reading forward are met.
    These threads are not designed for parallel execution, although they could, but rather to keep 2 independant call stacks. When one is running, the other is suspended. They are not running in parallel to avoid too much memory consumption (too much SAX events would have to be cached).

Implementing a strategy for reading forward is not neutral in terms of complexity ; but the feature's worth it.

ID tracking

id("foo")
id("foo")/child::para[position()=5]/a/b/c[last()]

In order to supply the expected identified element, a simple ID tracking is performed while SAX parsing.

This feature is also usefull when resolving XInclusions with a shorthand XPointer or with the XPointer element() scheme.

All nodes that uses an attribute that is declared in the DTD as type ID are kept in memory. Its child nodes are released after the underlying events have been fired, unless they have themselves a unique ID : when an element that has a unique ID is a child of an element that also has a unique ID, the child is still attached to its parent.

Filtering with Active Tags

In Active Tags and RefleX, rule-based filters can be defined with the <xcl:filter> element ; within a filter definition, rules are described with the <xcl:rule> element. A filter can be defined alone in an external active sheet, and connected to a pipeline by another one :

<?xml version="1.0" encoding="iso-8859-1"?>
<xcl:filter xmlns:xcl="http://ns.inria.org/active-tags/xcl"> <xcl:rule pattern="/"> <xcl:forward> <wrapper> <xcl:apply-rules/> </wrapper> </xcl:forward> </xcl:rule> <xcl:rule pattern="bar[@delete]"/> <xcl:rule pattern="foo[1]/bar"> <xcl:forward> <newBar> <xcl:apply-rules/> </newBar> </xcl:forward> </xcl:rule> </xcl:filter>
  <!--use a filter-->
  <xcl:parse-filter name="myFilter" source="file:///path/to/filter.xcl"/>
  <!--connect a source to the filter and the filter to an output-->
  <xcl:parse name="input" source="file:///path/to/input.xml" style="stream"/>
  <xcl:filter filter="{ $myFilter }" name="filtered" source="{ $input }"/>
  <xcl:transform output="file:///path/to/output.xml" source="{ $filtered }"/>

By default, any nodes that haven't been matched by a pattern are forwarded to the next step of the pipeline. It is then easy to alter the input stream by only designing rules with patterns that will match specific nodes.

A notifiable thing is that a filter can be designed indifferently for a DOM input or a SAX input ; the output should differ only when side effects due to relaxing the tree while reading with SAX are occurring, of if deferred actions are occurring.

Please refer to the XCL documentation.

Matching text nodes

In the XPath data model, there are no adjacent text nodes : consecutive syntax constructs related to text appear as a single text :

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE foo [ <!ENTITY merge "a single text node"> ]>
<foo>The "foo" element <![CDATA[(from <foo> to </foo>)]]> contains &merge; (un seul n&#339;ud)</foo>

The root element of the above document contains a single text node. This standard approach, called "normalization", is yet somewhat restrictive :

  • in order to process chunks of text more efficiently, one might want to explicitely forward independant text nodes ; in this way, XPath patterns such as text()[1] or text()[last()] could be considered.
  • when reading a non-XML input with SAX, the whole content of the file would be served as a big text node.

This reasons are sufficient to avoid merging automatically and systematically adjacent text nodes :

  • SAX might fire several character events, at most, 1 for each syntax construct
  • DOM will separate the text contains in CDATA sections and entity references

The strategy adopted in XCL filters is to let text nodes as they are in the tree, except if the normalization is explicitely asked ; thus, if a text is splitted intentionally, some rules based on the position of the text can be considered. This also allows to design non rule-based filters that can split a text file in independant chunks of data.

  <xcl:rule normalize="yes" pattern="/foo">
      <!--adjacent text nodes within are merged-->
  </xcl:rule>

The @normalize attribute can also be specified at the filter level, to define a global default behaviour regarding adjacent text nodes (which is set to "no" by default).

An example where XPath patterns are matching consecutive but independant text nodes is available in the tutorial section.

Reading non-XML input

The XML Control Language defines some built-in filters designed to read text streams ; in RefleX, they are hard-coded filters (available by their class name). Either the XCL URI or the "class" URI can be used indifferently, as RefleX uses an Active Catalog to map the XCL URI to the "class" URI, which will be instanciated to a compiled filter. This kind of filters are non rule-based filters that read a text stream and apply their own strategy while reading, in the aim of keeping the benefits of the streaming. Huge text files can be splitted in a sequence of texts that fire character events to the next step.

The built-in filters to parse are :

  • http://ns.inria.org/active-tags/xcl/filters#LineReader : read the text input line by line
  • http://ns.inria.org/active-tags/xcl/filters#Tokenizer : split the text input around the strings that match a regular expression

In the latter filter, the regular expression used to tokenize the input text is a parameter of that filter. Another important parameter to be aware of, is the buffer size, that is to say the maximum amount of characters that can be read at once and against which the regular expression should match. If the regexp doesn't match the bufferized string, it will be forwarded as a single character event ; the user should tailor the buffer in the way that this case should never occur, but it is an upper limit to keep in mind.

  <xcl:parse-filter name="myFilter" source="http://ns.inria.org/active-tags/xcl/filters#Tokenizer">
      <xcl:param name="pattern" value="--.[^\n]+(?:(?:--)|(?:\n(?:.[^\n]+\n)*\n))"/>
      <xcl:param name="buffer" value="1024"/>
  </xcl:parse-filter>

A full example is available in the tutorial section.

Pouring SAX to DOM

With SAX, memory consumption is one of the main concerns even when reading forward is expected as explained previsouly : when the cached datas are browsed, the memory is freed as one goes along that the nodes are used. However, it is impossible to guess in advance if a given XPath expression will select a node before of after the current node. For example :

following-sibling::foo[1]/previous-sibling::bar[1]

According to the datas, this expression can return either a node after the context node (which implies to read forward), or a node before the context node (which has been discarded, which implies that an empty set would be returned). Be aware of the "loose" behaviour of the engine : the second case won't fail, it will just return nothing. In fact, such XPath expression is unpredictable with SAX when it is used in a pattern.

The better way to avoid such imprecision and mismatch, and more generally to act on a node that would be discarded with SAX, is to anticipate by creating a subtree that would work in both cases, that is to say to create a DOM container that hosts all the nodes involved in that expression, and that won't be discarded automatically by the filter. That is to say, to pour a SAX (volatile) subtree into a DOM (persistent) subtree, that act as a "container".

  <xcl:rule pattern="here">
      <xcl:document name="container" style="tree">
          <xcl:forward channel="container">
              <xcl:apply-rules/>
          </xcl:forward>
      </xcl:document>
      <xcl:forward>{
          $container/there/following-sibling::foo[1]/previous-sibling::bar[1]
     }</xcl:forward>
  </xcl:rule>

This example shows that there is almost always a workaround to achieve elegantly simple things that are complex enough to be very hard to implement straightfully with SAX. Dealing locally with a DOM tree instead of SAX events is a smart solution. The <xcl:forward> element can redirect the filtered input to an alternative channel, a DOM document that hosts the content. Then the nodes selected from this subtree with an arbitrary complex XPath expression can be forwarded to the main channel.

In the tutorial section, an example shows when splitting a SAX input to several output XML files, how to extract a value from a subtree without its preceding nodes beeing discarded.

Splitting a SAX input to multiple outputs

This strategy is also efficient for large input documents that can't be stored entirely in memory with a DOM tree. SAX is more suitable for reading such a document, and DOM can be considered locally for loading subtrees.

As each subtree that is matching a given pattern can be considered as a standalone document, it is also possible to serialize it to a file ; if accessing the content is not needed, the created document can also be a SAX document.

Below, a database has been flushed to a single huge XML file ; each "purchase order" is saved as a file.

  <xcl:parse name="allPo" source="file:///path/to/purchase-orders.xml" style="stream"/>
  <xcl:filter name="poSplitter" source="{ $allPo }">
      <xcl:rule pattern="/purchase-orders/order">
          <xcl:document name="order" style="stream">
              <xcl:forward channel="order">
                  <xcl:apply-rules/>
              </xcl:forward>
          </xcl:document>
          <xcl:transform output="file:///path/to/purchase-orders/order-{ @id }.xml" source="{ $order }"/>
      </xcl:rule>
  </xcl:filter>
  <xcl:transform output="{ $sys:null }" source="{ $poSplitter }"/>

In this example, the main stream is forwarded to $sys:null that stands for "/dev/null". It's a tip for starting the pipeline, as it is required to connect a consumer to the last step of the chain. While parsing, the filter will redirect each "purchase order" to a new file saved on the file system.

In the tutorial section, an example shows how to split with SAX a 15MB XML file to about 13000 output XML files.

Limitations

As suggested by Michael Kay, it would be better to use a technique that stores indexes by location paths rather than by steps like mentioned in this article. Actually, some edge cases can't be handled correctly by the SAX filter:

  • composed predicates such as child::p[@a=17][5]
  • grouping such as (/foo/*/*)[last()]

Conclusion

The ability to use XPath patterns on SAX and non-XML stream input is a great advantage :

  • we are able to read very large files
  • we are able to process them with a higher level API than SAX's
  • we are able to apply regular expressions on raw text

For this purpose, RefleX uses an efficient on-demand caching strategy. XPath becomes usable on streaming inputs in a very acceptable way that makes designing filters much more simpler.

Summary

The XML Control Language offers very convenient facilities for filtering XML and raw text sources.

When filtering SAX or non-XML stream inputs with RefleX, XPath patterns are restricted to use forward and descendent axis in their predicates ; the contextual position/size of a node is also allowed.

Those limitations are not appliable on DOM inputs, and the same filter definitions works both with SAX and DOM.