Xtract: a query language for XML documents

Malcolm Wallace, Colin Runciman
University of York December 1998,
updated June, August 1999, February 2000

Introduction

Xtract is a query language based originally on XQL, which was a W3C proposal that eventually mutated into XPath and XQuery. The syntax of Xtract is very similar to XPath, although not completely conformant.

The idea of Xtract is that it can be used as a kind of XML-grep at the command-line, but it could also be used within a scripting language (such as the Haskell XML combinator library) as a shorthand for a complicated selection filter.

All queries return a sequence of XML document fragments (either whole tagged elements or text contained inside an element): for our purposes, we also treat attribute values as document fragments.

This document describes the expression language for queries.

Queries

Just as in XPath, a query looks rather like a Unix file path, where the ``directories'' are tags of parent nodes, and the / separator indicates a parent/child relationship. Hence,

    matches/match/player

selects the player elements inside match elements inside a matches element. The star * can be used as a wildcard meaning any tagged element, thus:

    matches/*/player

means the player elements inside any element within a matches element. The star can also be used as a suffix or prefix to match a range of tags: [ Note that this is not a full regular expression language: we just provide for the common cases of wildcards. ]

    html/h*

means all the headings ( <H1> to <H6> ) within an HTML document (and HR too!). A double slash indicates a recursive search for an element tag, so

    matches//player

means all player elements found at any depth within a matches element. The plain text enclosed within a tagged element is expressed with a dash symbol:

    matches/location/-

means the plain text of the location, without any surrounding <location> tags. Likewise,

    *//-

simply means to flatten the text of the document at all levels, removing all tagged element structure. The union of two queries is expressed with the + operator and parentheses if required:

    matches/match/(player + spectators)

gives both the players and spectators at a match. Finally,

    matches//player/@goals

returns the value of the attribute `goals' on the selected player elements, if the attribute appears.

Predicates

There is a notion of a predicate on an element. The square bracket notation is used:

    matches/match[player]

means all match elements which contain at least one player element. It is the match elements that are returned, not the players they contain. One can also ask for the presence of a particular attribute:

    *//player[@goals]

means those players (found anywhere within the tree) who scored any goals. You can compare attribute values using any of the operators = , != , < , <= , > , >= all of which use lexicographical ordering. In this example:

    */match[player/@surname!=referee/@surname]

we only want those matches in which the referee does not have the same surname as any of the players. A comparison may be either against another attribute value, or against a literal string; however a literal string may only appear to the right of the operator symbol. For instance,

    */match[player/@name='colin']

asks for only those matches in which the player named ``colin'' participated. If lexicographical comparison is inappropriate, numeric comparisons are also possible: these comparison operators are surrounded by dots: .=. , .!=. , .<. , .<=. , .>. , .>=. Again, either two attribute values are compared, or one attribute value is compared with a literal integer. For instance

    */match[@ourgoals .>. @theirgoals]

asks for the matches we won, while

    */match[@ourgoals .<=. 3]

asks for the matches in which we scored three or fewer goals. (Note that the literal integer is not surrounded by quote marks.)

In addition to comparing attribute values, you can also compare the textual content of elements. For instance,

    */match[player/- = 'Colin']

asks for the matches in which ``Colin'' participated, where the name is recorded between the player tags, rather than as an attribute. All the same conditions and operations apply as for attribute value comparisons. Note however that you can only compare texts, not whole structures.

Combining predicates

Predicates can be combined using the common Boolean operations & | and ~ , with parentheses for disambiguation if they are required:

	        
	    */match/[@ourgoals .=. @theirgoals | (player/@name='colin' 
	    & ~(@opposition='city'))]

means the matches which either ended in a draw, or in which ``colin'' played but the opposition was not ``city''.

Positional selection

The final feature of Xtract is that the square bracket notation is overloaded to allow the selection of elements by position:

    */match[3]

means the fourth match in the sequence (numbering starts at zero). You can have a series of indexes, separated by commas, and ranges are indicated with a dash. The dollar symbol means the last in the sequence. For example:

    */match[0,2-$,1]

reorders the matches to place the second one last.

Complex queries

The full expression language is highly recursive, permitting you to build arbitrarily complex queries. For instance:

    */match[player/@name='colin'][5-$]/referee[@age.>=.34]

means: from the sixth onwards, of those matches in which ``colin'' was a player, select those referees who are older than 34.

Grammar

We give a full grammar for Xtract.

textquery	=	query	elements
	\|	`-`	plain text
query	=	string	tagged element
	\|	string`*`	prefix of tag
	\|	``string*	suffix of tag
	\|	`*`	any element
	\|	`(` textquery `)`	grouping
	\|	query`/`textquery	direct descendant
	\|	query`//`textquery	deep descendant
	\|	query`/@`string	value of attribute
	\|	query `+` textquery	union
	\|	query`[`predicate`]`	predicates
	\|	query`[`positions`]`	indexing
qa	=	textquery	has tagged element
	\|	attribute	has attribute
predicate	=	qa	has tagged element or attribute
	\|	qa op qa	lexical comparison of attribute values or element texts
	\|	qa op `'`string`'`	lexical comparison of attribute value or element text
	\|	qa op `"`string`"`	lexical comparison of attribute value or element text
	\|	qa nop qa	numeric comparison of attribute values or element texts
	\|	qa nop integer	numeric comparison of attribute value or element text
	\|	`(` predicate `)`	grouping
	\|	predicate `&` predicate	logical and
	\|	predicate `\|` predicate	logical or
	\|	`~` predicate	logical not
attribute	=	`@`string	attribute of this element
	\|	query`/@`string	attribute of descendant
positions	=	position { `,` positions}	comma-separated sequence
	\|	position `-` position	range
position	=	integer	positions start at zero
	\|	`$`	last element
op	=	`=`	lexical equality
	\|	`!=`	lexical inequality
	\|	`<`	lexically less than
	\|	`<=`	lexically less than or equal
	\|	`>`	lexically greater than
	\|	`>=`	lexically greater than or equal
nop	=	`.=.`	numeric equality
	\|	`.!=.`	numeric inequality
	\|	`.<.`	numeric less than
	\|	`.<=.`	numeric less than or equal
	\|	`.>.`	numeric greater than
	\|	`.>=.`	numeric greater than or equal