Calligra/Libs/KoXmlReader

KoXmlReader is a collection of classes which work with DOM: KoXmlNode, KoXmlElement, KoXmlDocument, etc. These classes are designed to have similar API to QDom, e.g. KoXmlNode is very much just like QDomNode.

The main difference between KoXmlReader and QDom are:

KoXml is read-only
KoXml has much lower memory footprint

KoXmlReader is created because QDom is very inefficient when handling large XML document. For example, loading a test OpenDocument spreadsheet (filesize is 627 KB, 19 MB content.xml) using QDom causes up to 250 MB heap allocation.

KoXmlReader can be set to just use QDom, i.e KoXmlNode is just a typedefed QDomNode. This is useful only to ease porting of existing code.

You can find KoXmlReader in libs/store/store/KoXmlReader.{h,cpp}. Automatic test is in libs/store/tests/koxmlreadertest.cpp.

Porting to KoXml

Existing code which uses QDom can be ported to use KoXml. Here are the typical steps:

First, make sure KOXML_USE_QDOM is defined (not commented out) in KoXmlReader.h

Replace QDomNode with KoXmlNode, QDomElement with KoXmlElement, QDomDocument with KoXmlDocument, and so on

Include header file KoXmlReader.h in proper places

Rebuild your code

Now, comment out KOXML_USE_QDOM in KoXmlReader.h.

Compile again

The last step is not always smooth. Here are some tricks:

For QDom features not supported in KoXmlReader, use KOXML_USE_QDOM to isolate the code.

To get the number of child nodes, use KoXml::childNodesCount().

If you need to find out or iterate over node attributes, use KoXml::attributeNames() to get the list of the attribute names and then get the value one by one using KoXmlNode::attribute().

Minimizing Heap Allocation

In order to fully utilize the memory-efficient feature of KoXml, you can use KoXml::unload function. This function will unload specified node and its children from memory. Do not worry, they are not lost or completely removed, rather they are stored in a compact form which minimize memory consumption. KoXmlReader will automatically load again the node and/or its children whenever necessary.

Typically, unload is used at the end of node iteration. For example (in KSpread), there is no need to access nodes/elements associated with the first sheet when we are now loading the second sheet.

Be careful, though. Do not place unload aggresively because it will cause too much overhead. When in doubt, use Valgrind with Massif to profile memory usage.

Here is an example:

  KoXmlElement bodyElement;
  forEachElement( bodyElement, contentElement )
  {
    // <office:spreadsheet>
    KoXmlElement spreadsheetElement;
    spreadsheetElement = bodyElement.firstChild().toElement();

    // now we visit every sheet
    KoXmlElement tableElement;
    tableElement = spreadsheetElement.firstChild().toElement();
    for(;;)
    {
      // <table:table>
      if(tableElement.localName() == "table")
      {
          KoXmlElement rowElement;
          rowElement = tableElement.firstChild().toElement();
          if(rowElement.localName() == "table-row")
          for(;;)
          {
            KoXmlElement cellElement;
            cellElement = rowElement.firstChild().toElement();
            for( ; ; )
            {
              // do something with cellElement
              
              cellElement = cellElement.nextSibling().toElement();
            }
      
            KoXml::unload( rowElement );
            rowElement = rowElement.nextSibling().toElement();
          }
       }

       KoXml::unload( tableElement );
       tableElement = tableElement.nextSibling().toElement();
      }
    
    KoXml::unload( spreadsheetElement );
  }

Caveats

KoXml is a read-only DOM. This means you can not modify the DOM tree. You can only load DOM from XML source. Normally, this is not a problem since in during loading application data from the XML source, there is no need to modify anything.

KoXml classes are not a fully-compliant DOM implementation. It is designed only to be used in certain use cases, among others in KOffice applications. Thus, do not expect the same functionalities as in QDom, e.g. not all node types are implemented.

KoXml is slower than QDom if the XML document is very small. Thus, nothing to gain by using it.

Implementation

Some tricks used to implement KoXmlReader:

Common strings, e.g. namespace, are cached. Since every element in OpenDocument format has associated namespace, this saves memory by avoiding duplicated namespaces.

Nodes are packed efficiently. Nodes are compacted when not needed and recreated again only when necessary.

Node compaction is using very fast compression and decompression. When loading a large document, the overhead due to compression and decompression is compensated by a much less allocated heap.

Miscellaneous

For some colorful graphs and other explanations, see: