;; -*- mode: memo -*-
			Tutorial: XML toolkit
					V1.02 September 5, 2002
					Makoto Onizuka (oni@acm.org)

1. What is XML toolkit?
XML toolkit is a framework for a light-weight and high performance 
XML data stream processing. It is comprised from two fundamental 
libraries (XML TSAX parser, XPath processor) and a collection of 
simple XML processing commands (that correspond to plain text 
processing commands like cat, head, tail, sort, grep and so forth) 
built upon those libraries.

Section 2 explains the XML toolkit commands
and Section 3 explains the XML toolkit libraries and an example.


2. Let's look into the some details of XML toolkit commands.
2.1 What is merit of XML toolkit commands?
There may be many types of data operation demand for XML data 
streams, like concatenation, filtering (selection), 
re-structuring, sorting, duplicate elimination, calculating 
aggregation, and so forth. XML toolkit commands satisfy 
these kinds of demands. You can also combine any of the XML 
toolkit commands using pipe.
For example, 

  cat book.txt article.txt | grep title | sort;

can be replaced as 

  xcat book.xml article.xml | xsort -c "/bib" -e "book" -k "title/text()";

Of course, as the data is written in XML format, one can devise
more complex data operations like 'choose every first
author in every book or article'.

  xcat book.xml article.xml | xhead -c "/bib/book" -e "author" -n 1 | xrun /bib/*;

If the data is written in plain text, we need to do much more work
than this using perl or combination of sed, awk, etc.

How do we then compare XML toolkit commands with XSL syntax or 
XQuery? Although each toolkit command has only subset functions of 
XSL and XQuery, as described above, we can combine any of the toolkit 
commands, existing unix commands, and user-implemented data operation
commands using UNIX pipe to realize complex data operations. 


2.2 XML toolkit commands details
2.2.1 xcat
a)usage: xcat [-b] [-t type-expr] xmlfile1 xmlfile2 ...;
         The -h option shows the above usage message.
b)description:
  This command simply concatenates a given XML files (either file
  or stdin, either tokenized XML or plain XML) and outputs it to 
  stdout as plain XML (default) or as tokenized XML (-b option). 
  The tokenized XML (or we call "binary XML") is such a format 
  that all tags (element and attribute) is tokenized as integer.
  The output XML can be a forest, meaning that there may be several 
  root elements.
  The -t option enables an advanced tokenizer for an integer value 
  with some constant strings like "$30" or "30 dollar". 
  For example, xcat -b -t 'xpathExpr:PREFIX%iSUFFIX' tokenizes all 
  text value whose prefix is "PREFIX" and suffix is "SUFFIX" that
  are matched with the given xpathExpr. %i indicates that the value 
  except PREFIX and SUFFIX is a integer type.
c)example: 
  xcat article.xml book.xml > bib.xml;
  xcat -b article.xml book.xml > bib.xml;
  xcat -b -t '//price:$%i' article.xml book.xml > bib.xml;


2.2.2 xrun
a)usage: xrun xpathExpr xmlfile;
         "xrun" without any parameters shows the above usage message.
b)description
   This command inputs one XPath expression and one XML file (either 
   tokenized XML or plain XML) and outputs a sub-part of the input 
   XML that satisfies the given XPath expression. The result can be 
   a forest. The restriction of the XPath expression is described in 
   Section 4.
c)example:
  xrun "/" xmlfile;
  xrun "/bib/book/title" xmlfile;
  xrun "//title/text()"  xmlfile;
  xrun "//book/*/@year"  xmlfile;


2.2.3 xsort
More details are written in the PLANX paper.
http://www.cs.washington.edu/homes/suciu/XMLTK/planx.ps
a)usage: xsort [-binary | -b ] 
               [(-memory | -m) mem-size] 
               [(-type | -t) type-expr] 
               [(-context | -c) xpath-expr] 
               [(-element | -e) xpath-expr]
               [(-key | -k) xpath-expr]]] 
               [xmlfile];
         The -help, -h or -? shows the above usage message.
b)description
  This command sorts target element (specified by an element XPath
  expression) under the context element (specified by a context XPath
  expression) using the key element (specified by a key XPath expression)
  for an XML input (stdin or file). All elements under the context
  are emitted except the elements satisfied by the -e XPath expression. 
  The -b and -t options play the same role as that of the xcat. The -m 
  option determines the memory size for in-memory sorting (beyond which 
  sorting is conducted externally, i.e. on disk).
c)example:
  xsort -c "/bib" -e "book" -k "title/text()" xmlfile;
    This sorts all "book" element under "bib" element using 
    "title/text()" as key value.
  xsort -c "/bib" -e "book" -k "title" xmlfile;
    When you specify the key parameter as an element (not text()), 
    then the concatenation of all text() under the key element is
    interpreted as a key value.
  xsort -c "/bib" -e "*" -k "@year" xmlfile;
    This sorts all elements under "bib" element using "@year" as 
    a key value. The key value is always treated as string value 
    (not as integer value).
  xsort -c "/bib/book" -e "author" -k "lastname/text()"
                       -e "*" xmlfile;
    This sorts all "author" element under "book" element using 
    "lastname/text()" as key value. In addition, all elements except
    "author" under the "book" element also are outputed. In this case,
    there are two -e XPath expressions for one context element. The xsort
    tries to check the first XPath expression, and if it fails then it 
    tries to check the second XPath expression. So, the order of the -e 
    XPath expression means a precedence.
  xsort -c "/bib" -e "book"   -k "title/text()"
                  -e "article -k "@year"        xmlfile;
    This sorts all "book" element under "bib" element using 
    "title/text()" as key value, and sorts "article" element under 
    "bib" element using "@year" as key value.
  xsort -c "/bib/book"    -e "author" -k "lastname/text()"
        -c "/bib/article" -e "author" -k "firstname/text()" xmlfile;
    This sorts all "author" element under "bib/book" element using 
    "lastname/text()" as key value, and sorts "author" element under 
    "/bib/article" element using "firstname/text()" as key value.
    In this case, there are two context XPath expressions. The xsort 
    tries to check the first XPath expression, and if it fails then it 
    tries to check the second XPath expression. So, the order of the -c 
    XPath expression also means a precedence.

2.2.4 xtail
a)usage: xtail [-help | -h | -? ] 
               [-binary | -b ] 
               [(-context | -c) xpath-expr 
               [(-element | -e) xpath-expr 
               [(-number | -n) number]]] [xmlfile];
         The -help, -h or -? shows the above usage message.
b)description
  This command filters the last specified number of target element 
  (specified by an element XPath expression) under the context element 
  (specified by a context XPath expression) for an XML input (stdin or 
  file). All elements under the context are emitted except the elements 
  satisfied by the -e XPath expression. The -b option plays the same role 
  as that of the xcat. 
c)example:
  xtail -c "/bib" -e "//author" -n 2 xmlfile;
  xtail -c "/bib" -e "*" -n +1 xmlfile;
  xtail -c "/bib" -e "*" -n -1 xmlfile;

2.2.5 xstat
a)usage: xstat xmlfile;
b)description
  This command reports the number of elements, the maximum depth
  and the average depth of the input XML file. For example, if the 
  input XML contains only one root element (like <root/>), then 
  the xstat outputs:
  <stat fileName="xmlfile">
    <element count="1"/>
    <depth max="1" avg="1"/>
  </stat>
c)example:
  xstat bib.xml;

2.2.6 file2xml
a)usage: file2xml [-s start directory name]  [-f file to direct output];
b)description
  This command collects file and directory information (name, path, 
  size, permission, type, userid, gourpid, last access time, last 
  modified time) under a specified start directory and packs it as 
  XML format. For example, if "CVS" directory contains one "Root"
  file, file2xml outputs:
  <directory>
    <name>CVS</name>
    <file>
      <name>Root</name>
      <filelink xlink:type="simple"
                xlink:href="file:/localhost/homes/june/makoto/work/xmlbase/xmltoolkit/xmltk/file2xml/CVS/Root">
      </filelink>
      <path>/homes/june/makoto/work/xmlbase/xmltoolkit/xmltk/file2xml/CVS/Root</path>
      <size>33</size>
      <permissions>-rwrr----</permissions>
      <type>regular file</type>
      <userid>13750</userid>
      <groupid>330</groupid>
      <lastAccess>Wed Nov 21 11:22:33 2001</lastAccess>
      <lastModification>Wed Nov 21 11:22:23 2001</lastModification>
    </file>
  </directory>
c)example:
  file2xml -s . -f currentFile.xml;

2.2.7 createSindex
a)usage: createSindex [-t threshHold] xmlfile;
b)description
  This command creates a streaming index (SIX) for the input XML file.
  If the XML file's name is "xmlfile.xml", the index is stored at
  "xmlfile.six" in the same directory as the XML file. 
  The "-t" option specifies the threshHold (byte) for the index 
  entry deletion. 
  Whenever there is a SIX file for its XML file, the XPath processor 
  automatically uses the index to skip parsing certain part of XML. 
  There are two type skips:
[fail skip]
  When we receive a startElement SAX event and it is not what we need, 
  we want to skip the whole element and go to the nest sibling 
  element. For example, if we want to find a book element (/bib/book) 
  and receive an article startElement event, then we skip the article 
  element (Element/attribute based fail skip).
  Another example is that we want to find a book that is published at 
  1999 (/bib/book[@year='1999']) but receive a book start element 
  event that is published at 2001 (Value based fail skip).
[succeed skip]
  When we receive an endElement SAX event and we don't need to look 
  into the following siblings, we can skip them all. For example, if 
  we want to find a second book's first author  (/bib/book[2]/author[1]) 
  and receive an end element SAX event for the author element, then we 
  can skip 2 depth to go to bib endElement (Element/attribute based 
  succeed skip). 
  Another 1 depth succeed skip example is that we have a key constraints 
  on isbn of books and we found a book whose isbn matches our condition 
  (/bib/book[isbn='XX123'], value based succeed skip). 

  Some future work on the SIX is to enable skip parsing using 
  key constraints, a schema specifying a order of sibling elements 
  (e.g. DTD content model like (A, B*, C)), sorted elements, and 
  existential filter condition.

c)example:
  createSindex bib.xml;
    This creates a full streaming index (bib.six) for the bib.xml. 
  createSindex -t 100 bib.xml;
    This creates a partial streaming index (bib.six) for the bib.xml
    whose entry is only for elements larger than 100 byte.

3. Let's look into the some details of XML toolkit libraries.
3.1 What is functionality?
There are two libraries: one is a tokenized XML parser, and
the other is XPath processor. 

The tokenized XML parser converts every element tag or attribute 
into an unique integer value (we implement as XTOKEN) and calls
tokenized SAX events. For example, startElement('bib') can
be tokenized as startElement(5) and the value '5' corresponds to 
'bib'. We extends the Xmill's XML parser to implement our tokenized 
XML parser, as it parses XML data stream around ten times faster 
than the xerces XML parser. But it just supports ascii character code 
(it doesn't support unicode) and does not have DTD validation.

The XPath processor combined with the above tokenized XML parser
calls only such SAX events that satisfy a given collection of 
XPath expressions. In other words, it filters out any SAX events 
that does not satisfy any given XPath expressions. This is effective
for such a case that you want to operate some particular part of the 
input XML data stream (not whole of the XML data).

For example, if you want to make a title list of bibliography, then
you need to register two XPath expression '/bib/book/title' as
$x1 and '/bib/article/title' as $x2 (or simply one XPath expression
'//title' or '/bib/*/title'). If the input XML data stream is 

<bib>
  <book>
    <title>Data on the web</title>
    <author>Serge Abiteboul</title>
    <author>Peter Buneman</title>
    <author>Dan Suciu</title>
  </book>
  <article>
    <title>Processing XML Streams with Deterministic ...</title>
    <author>Todd J Green</title>
    <author>Makoto Onizuka</title>
    <author>Dan Suciu</title>
  </article>
</bib>

then the XPath processor calls only the following call-back events:
  startContext($1)               // for '/bib/book/title'
    startElement('title')
    character('Data on the web') 
    endElement('title')
  endContext($1)
  startContext($2)               // for '/bib/article/title'
    startElement('title')
    character('Processing XML Streams with Deterministic ...') 
    endElement('title')
  endContext($2)

You can find there are additional events 'startContext' and 
'endContext'. These enable the applications to know which XPath
expression satisfies which SAX events.

3.2 How does one develop a new application using XML toolkit library?
If you would like to build a new application or command using XML 
toolkit library, you need to understand:
1) com-lite interface
2) main functions
3) handler functions (including SAX events)
4) library functions

3.2.1 com-lite interface
TJ wrote this part already. See 
URL http://www.cs.washington.edu/homes/suciu/XMLTK/comlite.html
The concept is that it utilizes the factory class design pattern
and smart pointer.

3.2.2 main functions
Here, I explain library functions using the xrun example.

#include "xmltk.h"

int main(int argc, char* argv[]){
  if (argc != 3){
	cout << "Usage: " << argv[0] << " xpathExpr xmlfile" << endl;
	return 1;
  }
    // initialize token table that convert string
    //  (element, attribute, extented integer) to XTOKEN
  InitGlobalTokenTable();
    //
    // before the parsing
    //
  try {
	IDfaFilter *g_pfilter = NULL;
    // create XML parser and XPath processor object
    // 1st parameter: Interface ID
    // 2nd parameter: result
	CreateDfaFilter(&IID_IDfaFilter, (void**)&g_pfilter);
    // register a given XPath to the XPath processor.
    // 1st parameter: Variable *, a variable for a parent XPath expression
    // 2nd parameter: char *, XPath expression
    // 3rd parameter: book, echo toggle (described at Section 3.4.1)
    // The XPath processor manages the construction/destruction of
    // the returned Variable *.
	Variable * g_pvRoot = g_pfilter->RegisterQuery(NULL, argv[1], true);

    // create an input file stream object
    // 1st parameter: char *, file path
        IFileStream *pstm = _CreateFileStream(argv[2]);
    // create an output file stream object
    // 1st parameter: bool, binary format or plain format
	ITSAXContentHandler *pchOut = CreateStdoutStream(false);
    // create a handler that implements call-back functions
    // 1st parameter: ITSAXContentHandler *
	myHandler *handler = new myHandler(pchOut);
    // register hander to XPath processor
    // 1st parameter: IDfaFilter *
    // 2nd parameter: user class derived from IFilteredTSAXHandler
	IUnknownCL_SetHandler(g_pfilter, handler);
    //
    // parse (call-back functions are invoked from the parse)
    // 1st parameter: IFileStream *
    // 2nd parameter: IDfaFilter *
	ParseUnknownStream(pstm, g_pfilter);
    //
    // after the parsing
    // ATOMICRELEASE deletes the parameter and set it as NULL;
	ATOMICRELEASE(pchOut);
	ATOMICRELEASE(pstm);
	ATOMICRELEASE(g_pfilter);
  }
  catch (_Error &err) {
    // print an error message
	err.perror();
  }
    // cleanup the token table
  CleanupGlobalTokenTable();
  return 0;
}

3.2.3 Hanlder functions
Similar to handling the SAX events, you need to implement your 
handler to receive events from the XPath processor (or XML parser)
using ITSAXContentHandler or IFilteredTSAXHandler interface.
ITSAXContentHandler interface defines SAX events and 
IFilteredTSAXHandler interface defines both SAX events and context 
events (startContext and endContext). 

class myHandler : public IFilteredTSAXHandler
{
public:
    //
    // *** IUnknownCL methods ***
    // CL_STDMETHOD is a macro to define a bool method 
    // CL_STDMETHOD_ is a macro to define a method whose return 
    // type is specified by the first parameter
    //
    CL_STDMETHOD(QueryInterface) (RCLIID riid, void **ppvObj);
    CL_STDMETHOD_(ULONG,AddRef) ();
    CL_STDMETHOD_(ULONG,Release) ();

    //    
    // *** ITSAXContentHandler methods ***
    //
    CL_STDMETHOD(startDocument) ();
    CL_STDMETHOD(endDocument) ();
    CL_STDMETHOD(startElement) (XTOKEN xtName);
    CL_STDMETHOD(endElement) (XTOKEN xtName);
    CL_STDMETHOD(attribute) (XTOKEN xtName, char *pszChars, int cchChars);
    CL_STDMETHOD(characters) (char *pszChars, int cchChars);
    CL_STDMETHOD(cdata) (char *pszChars, int cchChars);
    CL_STDMETHOD(extendedint) (XTOKEN xt, int iInt);

    //
    // *** IFilteredTSAXHandler methods ***
    //
    CL_STDMETHOD(startContext) (Variable *pv);
    CL_STDMETHOD(endContext) (Variable *pv);

    myHandler(ITSAXContentHandler *pchOut) {
	  m_pchOut = pchOut;
	  m_pchOut->AddRef();
	}
    virtual ~myHandler(){
	  ATOMICRELEASE(m_pchOut); 
	}

    ITSAXContentHandler *m_pchOut;
    ULONG m_cRef;
};

//
// *** IUnknown methods ***
//
// RCLIID is a static value that indicates a class ID
bool myHandler::QueryInterface(RCLIID riid, void **ppvObj)
{
    if (IsEqualCLIID(riid, &IID_IUnknownCL) ||
        IsEqualCLIID(riid, &IID_ITSAXContentHandler) ||
        IsEqualCLIID(riid, &IID_IFilteredTSAXHandler)){
        *ppvObj = static_cast<IFilteredTSAXHandler*>(this);
    }
    else {
        *ppvObj = NULL;
        return false;
    }
    AddRef();
    return true;
}

ULONG myHandler::AddRef(){
    return ++m_cRef;
}

ULONG myHandler::Release(){
    if (--m_cRef > 0)
    {
        return m_cRef;
    }
    delete this;
    return 0;
}

You need follow the com-lite style because the XML Toolkit class
library is defined in com-lite style and you need to define your
own class as a sub-class of those library classes. Thus, you 
need to implement all IUnknownCL methods in the way like above. 
You can implement anything you like for every SAX call-back function, 
ITSAXContentHandler and IFilteredTSAXHandler methods.
You can also implement anything you like on the constructer and
destructor of your handler. You may want to initialize 
your own data member, or may need to free your data member memory 
space.

3.2.4 library functions
a) g_ptt: g_ptt is a global variable points to a global token table.
   char * StrFromXTOKEN(XTOKEN token);
     returns the original string of the "token".
   XTOKEN XTOKENFromStr(char * s, XST_TYPE);
     XST_TYPE=: (XST_ELEMENT | XST_ATTRIBUTE | XST_EXTENTEDINT)
     returns a token of the "s" as type "XST_TYPE".
     The XST_TYPE is eigther XST_ELEMENT, XST_ATTRIBUTE, or XST_EXTENTEDINT.
b) Variable class: The Variable instance is returned by
   the IDfaFilter::RegisterQuery().
   Variable * getParent(void);
     returns a variable for the parent XPath expression.
   void	      setUserSpace(void * s);
     You can set any pointer to your own memory space.
   void*      getUserSpace(void);
     returns the pointer to the memory space.

3.3 Advanced feature of the tokenized XML processor
3.3.1 Input
The input XML data can be either tokenized XML or ordinary plain 
text XML. The tokenized XML (or we call "binary XML") is such a 
format that all tags (element and attribute) is tokenized as integer.

3.3.2 Call-back event
In addition to the same as SAX events (startDocument, endDocument, 
startElement, endElement, characters, cdata), there are additional
and modified events. 

a) characters(char *pszChars, int cchChars);
   cdata(char *pszChars, int cchChars);
   Not like ordinary XML SAX parser, the pszChars is a pointer to 
   the character data WITHOUT STRING TERMINATOR. So you need to 
   know the length of the string. The ccChars tells you the length
   of the string.
b) attribute(XTOKEN xtName, char *pszChars, int cchChars);
   Not like the ordinally SAX event interface, an attribute is
   independent from its element and "attribute" event is invoked for 
   each attribute. 
c) extendedint(XTOKEN xt, int iInt);
   This event is a special case for character SAX event. If the user
   registers some extended integer using "RegisterType" method with
   a parameter in 'xpathExpr:PREFIX%iSUFFIX' format, then an 
   extendedint event is invoked instead of a character event.
d) context events: startContext(Variable *pv), endContext(Variable *pv)
   As decribed in Section 3.1, these enable the applications to know 
   which XPath expression satisfies which SAX events. The Variable *
   is returned when the user register an XPath expression. 

3.4 Advanced feature of the XPath processor
3.4.1 Echo toggle (ON/OFF)
   When the user register XPath expression, he/she needs to specify 
   an echo toggle (ON/OFF) by the third parameter of "RegisterQuery"
   method. 
     RegisterQuery(Variable * parentVariable,
                   const char * xpath,
		   bool echoToggle);
   If the toggle is ON (true), then both the context events and all 
   the SAX events during the given XPath expression is satisfied are 
   call-backed (as described in Section 3.1). 
   If the toggle is OFF (false), then the only context events are 
   invoked. None of SAX events are invoked.
3.4.2 Echo overwrite
   The echo toggle of an XPath expression can be overwritten by 
   its child XPath expression. For example, if the user want to 
   output all element except "title" element, then he/she realizes
   it as follows:
     Variable * v1 = g_pfilter->RegisterQuery(NULL, "/", true);
     Variable * v2 = g_pfilter->RegisterQuery(v1, "//title", false);
   This directs all SAX events invoked under "/" XPath expression 
   unless "//title" XPath expression is satisfied. 
3.4.3 Precedence
   Like xsort examples, the user sometimes need to use some
   conditional procedure:
     IF ... THEN ... 
     ELSE IF ... THEN ...
     ....
     ELSE ...
   The XPath processor supports this functionality by introducing
   a precedence in XPath expressions. 
     RegisterQuery(Variable * parentVariable,
                   const char * xpath,
		   bool echoToggle
		   float precedence);
   The forth parameter of the "RegisterQuery" method indicates the
   precedence. When several XPath expressions become satisfied, the
   XPath processor only invokes startContext events with the highest
   precedence. Precisely to say, the XPath processor calculates the
   minimum precedence in the satisfied XPath expressions, and locates
   and invokes startContext event for variables with the minimum 
   precedence. 


4. Restrictions
4.1 The predicate expressions
== supported ==
1. The predicate needs to be written with an attribute or value.
   If the predicate is binary, it has to be [attribute operator value]
   order. 
   e.g. //book[@year]/title, //book[@year=1999]/title
2. The position predicate can be used at any location step.
   e.g. /bib/book[2]/author[1]
3. Predicates can be concatinated.
   e.g. //book[2][@year=1999]/title[contains(text(),'XML')/text()

== not supported (restrictions) ==
4. The tail location step can not have the predicate written in 
   the above, because the lazyDFA does not support output XML 
   buffering.
   n.g. //book[@year]
   o.k. //book[@year]/title
5. DoubleSlash with position predicate is not supported.
   n.g. /bib/book//name[2]
   "book" has to control the increment/reset the
   number of "name".
6. Position predicate following another predicate is not supported.
   n.g. //book[@year=1998][2]
   o.k. //book[2][@year=1998]
   "@year" has to control the increment and
   "book" has to control the reset.
7. Element or "." is not supported in the predicate.
   n.g. //book[contains(.,'XML')]/title
8. Such XPath expressions that uses both "//" and any predicate does 
   not work correctly. The lazyDFA does not output error message in
   this case.
   Precisely saying, such XPath expressions does not work correctly if 
   it contains some predicate and it matches with prural XML fragment 
   simaltaniously. For example, //a/a[1] matches two XML fragment in 
   the following XML. 
<a>                  <-- 1st context starts here
  <a>                <-- 1st context completed here.
		              simaltaniously, 2nd context starts here.
    <c>123</c> 
    <a> 
      <c>123</c> 
      <d>456</d> 
    </a> 
    <d>456</d> 
  </a> 
</a> 
