Calligra/KOffice and ODF

From KDE Community Wiki

This article is taken from a talk given by David Faure at the KOffice ODF meeting in Berlin. We start with a basic introduction to the OpenDocument Format, ODF, and then take a look at the existing implementation of ODF support in KOffice. With KOffice 2 in rapid development, some items are likely to change, but most of the implementation described here should remain the same. Look out for the TODOs!

What is ODF?

So, without further ado, let's get started. OpenDocument Format is an open, XML-based format for general office documents. It supports text documents, presentations, spreadsheets, and other common type of documents. The definition of the format has been done by the OASIS organization, with input from OpenOffice.org, KOffice, and other interested parties; ODF even became an ISO standard! (26300). The format grew out of the OpenOffice.org 1.x format, with cleanups to the style, and additions to allow compatibility between different applications.

ODF concepts

Let's take a look at the basics of an ODF document. ODF is a detailed and fairly complicated format - the full specification runs to over 700 pages, but we'll just cover the most important aspects here. For more details, take a look at the OpenDocument Esentials book by J. David Eisenberg at http://books.evc-cit.info/ or, if you're feeling particularly keen, the full specification at http://www.oasis-open.org/specs/index.php#opendocumentv1.1 .

Before we start with the details, there are a few principles behind the ODF format. Two important principles are the use of XML, and content/presentation separation. ODF uses the XML format throughout, which means that reading and writing ODF files, as well as checking their validity, can be done with standard tools and libraries. If you don't know any XML, it's worth reading up on the basics on one of the many free primers available on the web.

Separation of presentation and content is another important principle for ODF documents - although it increases the complexity of working with ODF slightly, there's a net win with the variety of features it opens up. For example, the outline of this article was produced straight from an ODF presentation using a simple KOffice filter.

A document is a ZIP

To support the wide range of features needed by office applications and documents, an ODF file is actually a collection of files, tied together in a ZIP archive. So you can unpack an ODF document using the standard unzip command, like this:

# unzip -l test_file.odt

 Length     Date   Name
--------    ----   ----
      39  12-05-07 mimetype
    3465  12-05-07 content.xml
   18755  12-05-07 styles.xml
    1174  12-05-07 meta.xml
    8125  12-05-07 settings.xml
     490  12-05-07 Thumbnails/thumbnail.png
    1866  12-05-07 META-INF/manifest.xml

We'll discuss the contents of some of these files in more detail later, but if you're interested in more details, see the OpenDocument Essentials book.

A handy tip: since most ODF files are created and read by computers, not people, they often lack newlines, indentation, etc. You can use David Faure's <a href="http://www.koffice.org/developer/fileformat/oasis_unpack">oasis_unpack</a> script to unzip an ODF document and reformat the XML to make it more readable.

Content vs. style

Here's a simple excerpt from an ODF document, showing how content and style are separated:

<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
  <style:paragraph-properties fo:text-align="center"/>
</style:style>
<office:text>
  <text:p text:style-name="P1">Hello</text:p>
</office:text>

The snippet simply produces a centered paragraph of text with the word "Hello". Easy!

Style families and properties

In the previous example, we applied a style to a paragraph. But of course, in an office document, there are other things we can apply styles to: text (ie, individual words, characters, etc), graphics, tables, and so on. Each of these is represented in ODF by a style family. Each style-family is associated with a set of properties: for example style:family="text" is associated with a <style:text-properties> child element. More than this, some style families can contain multiple style properties - a table cell has its own table cell properties (border, ...), as well as properties for the paragraph it contains (alignment, ...) and properties for the text in the paragraph (font, size, ...).

For the full list of which properties go with which style families, the easiest place to look is in the RelaxNG schema (on which, more later).

Named styles

ODF supports two types of style: named styles and automatic styles, to deal with the way office applications are used. Named styles are:

  • Visible in the application
  • Created by the user (or by a template file)
  • Stored in styles.xml in the ODF file

So named styles are the explicit styles you, as a user, set for standard headings, paragraphs, etc.

Here's an example of a named style:

<style:style style:name="Head1"
    style:display-name="Heading"
    style:family="paragraph"
    style:parent-style-name="Standard">
  <style:paragraph-properties fo:margin-bottom="0.1cm"/>
  <style:text-properties
    style:font-name="Bitstream Vera Sans"
    fo:font-size="14pt"/>
</style:style>

Let's notice a few things:

  • style:name is the internal name of the style. It's simple, with no spaces, and so on. When the style is applied to any item in the document, this internal name is used.
  • style:display-name on the other hand, is the name shown to the user. It can contain spaces etc.
  • The style inherits the "Standard" style using style:parent-style-name . This means we don't have to set every possible attribute of the style here - any we don't set are found from the "Standard" style.

Automatic styles

Automatic styles, on the other hand, store formatting information that is used for individual items in the document - styles for which no external name is needed, because they aren't presented to the user. Automatic styles:

  • Save formatting information
  • Are stored in content.xml except when needed in styles.xml (e.g. for headers and footers).

Automatic styles live in the <office:automatic-styles> element, as shown in this example, which also features a page-layout style, as an example of a style that is not defined by a style:style element:

<office:automatic-styles>
 <style:style style:name="P1">...</style:style>
 <style:page-layout style:name="pagelay1">
  <style:page-layout-properties fo:page-width="8.5in".../>
 </style:page-layout>
</office:automatic-styles>
<office:master-styles>
  <style:master-page style:name="Standard" style:page-layout-name="pagelay1"/>
</office:master-styles>

RelaxNG

The ODF format is defined in a machine-readable way by a RelaxNG schema. This defines which elements can appear, and where; how many elements can appear; their content types, and so on. If you're familiar with DTDs, you'll understand the idea, although RelaxNG is much more advanced.

Here's a snippet from the ODF RelaxNG schema:

<define name="text-content">
   <choice>
       [1]
       [2] [...]
   </choice>
</define>
<define name="text-p">
   <element name="text:p">
       [3]
       <zeroOrMore>
           [4]
       </zeroOrMore>
   </element>
</define>

Validation

In order to validate ODF documents, we use a tool called jing, and a few scripts around it. See http://www.koffice.org/developer/fileformat/validate.php

* oasislint checks a single xml file previously extracted from a document. For instance:

$ oasislint content.xml
content.xml:28:50: error: attribute "foo" not allowed at this point; ignored

  • oasisfilecheck checks an entire opendocument file, by extracting it into a temporary directory and then calling oasislint on each relevant xml file.

$ oasisfilecheck test.odt
Extracting...
Checking content.xml...
Checking styles.xml...
[....]

The script currently (Feb 2012) used is validateODF.py, in the folder calligra/tools/scripts.

$ validateODF.py test.odt
content.xml:28:50: error: attribute "foo" not allowed at this point; ignored

You can also validate ODF documents using the online validator at http://opendocumentfellowship.com/validator.

Embedding

Embedded documents in an ODF are saved in a subdirectory inside the ZIP file. To include an embedded document requires two things: firstly a link to the document in the content.xml file, like so:

<draw:frame draw:name="obj1" draw:style-name="fr1"
   svg:width="362pt" svg:height="346pt">
   <draw:object [...] xlink:href="./Object 1"/>
</draw:frame>

Secondly, an entry in the META-INF/manifest.xml file, like this:

<manifest:file-entry manifest:media-type="text/xml"
   manifest:full-path="content.xml"/> 
   [...]
<manifest:file-entry
   manifest:media-type="application/vnd.oasis.opendocument.spreadsheet"
   manifest:full-path="Object 1/"/>

ODF in KOffice

Now we've looked at some basics about how ODF works, we'll discuss how ODF support is implemented in KOffice. We'll look at the low level classes that deal with the zip file itself; the issues that arise with writing XML and take a detour into XML namespaces. We'll see how loading and saving ODF is done, and how styles are handled.

KoStore API

The KoStore class talks to the backend file - for ODF, that means a ZIP file, but tar.gz and directory are also supported for other formats. In the KoStore terminology, a 'store' is the file on disk that the user sees (eg somefile.odt), while a 'file' is a file taken out of that archive (eg content.xml).

  • Currently the static KoStore::createStore method creates a KoStore object and returns a pointer to it, although this is subject to change in KOffice 2.
  • The open() and close() methods read one file at a time from the store, while the read, write, pos, seek and atEnd methods provide low level access to the store.
  • device() returns the QIODevice used internally for reading. The open() call may be extended to also return this device.

While KoStore is all you need for reading, a further class is currently required for writing: KoStoreDevice wraps around KoStore, and provides the relevant methods for writing to an ODF file. Initialize it with, for example: KoStoreDevice dev(store);

Writing XML

Writing the ODF XML to file is done by KoXmlWriter: it's simple and stupid, and therefore fast :). KoXmlWriter uses QIODevice internally. All parameters are passed as const char *, which means no memory copies, so it remains fast. The methods are mostly self-explanatory.

writer.startElement("draw:frame")
writer.addAttribute("draw:id", "foo");
writer.addAttributePt("svg:x", x); // Write the value x, which is a measure in points
// add nested elements here
writer.endElement(); // draw:frame

Notice that writer.endElement knows what element it's ending, so you don't need to tell it. However, it's common practice to put the element name in a comment to help you (and others reading your code!) to keep track.

XML namespaces

At this point, let's take a minute out to look at XML namespaces. ODF uses XML namespaces widely, to allow it to use and extend existing XML dialects without them stepping on each other's toes. To make this clear, let's deconstruct an example from an ODF document: <draw:frame> is an element you'll often see in ODF. Here, 'draw' is called the prefix, and is chosen arbitrarily for the document. 'frame' is called the localname, and is essentially the name of the element within the namespace.

Namespaces are defined by URNs, which can be long and complicated, so we define short prefixes (like 'draw' above) to use in documents. In ODF documents, this is done in the toplevel element (office:document-content), like so:

<office:document-content
xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"
[...]
xmlns:koffice="http://www.koffice.org/2005/">

In this example, we're defining (amongst other things) draw to mean urn:oasis:names:tc:opendocument:xmlns:drawing:1.0 . The long string beginning "urn:" is the namespace, which is part of the ODF specification.

The upshot for KOffice's ODF support is that, when using namespace, writing is easy: we just define prefixes and write out elements with those prefixes. On the other hand, loading is more complex, since the prefix must be resolved to the relevant namespace.

So, how do I save?

Now we've looked at some backend issues, how do we actually go about saving in a KOffice app? There are two cases: First, saving from a KOffice application: You need to reimplement the abstract method KoDocument::saveOasis( KoStore* store, KoXmlWriter* manifestWriter ); In the implementation, for each file you need to write to, open() the file, and create a KoStoreDevice to write to it. Next, create a KoXmlWriter on that KoStoreDevice using createOasisXmlWriter(&dev, "..."). Once you've got here, you're ready to actually write the data from your application to the document using the methods of the KoXmlWriter object.

You also need to record each file in the manifest when you've finished with it, using the manifestWriter object, thusly: manifestWriter->addManifestEntry( "styles.xml", "text/xml" );

The other case you might come across is saving from a KOffice filter, ie a class derived from KoFilter. In this case, create the KoStoreDevice using the KoFilter::m_chain member for each file like this:

KoStoreDevice* out =
   m_chain->storageFile( "content.xml", KoStore::Write );

Once you have the KoStoreDevice, the other steps are the same as before.

Automatic styles

Automatic styles present a few issues of implementation in KOffice. The first is in the creation of the automatic styles. Imagine an ODF text document with two words in bold, one at the beginning of the document, and one at the end. We only need one automatic style for these two, so we need to keep track of the already-created automatic styles. This is done by the KoGenStyles collection. It holds a QMap with objects of class KoGenStyle (no trailing 's'!) as the keys, and QStrings as the values. This way, when a new style is encountered in the document, we can look up whether it's been seen before, and if so, what its name (the value in the QMap) is, in order to reference it. We can also iterate over the KoGenStyles elements when saving, to ensure that all the necessary automatic styles are saved in the document.

Here's an example showing some of the API of the KoGenStyles and KoGenStyle classes:

KoGenStyle s1(STYLE_AUTO, "paragraph");
s1.addAttribute("a","b");
s1.addProperty("c","d",TextType);
QString name = genStyles.lookup(style);

Automatic styles also present a tricky problem because they're generated at saving time from the contents of the document, so we need to look at the whole document before we know what automatic styles should be created. But, the problem comes when we want to write the XML file: automatic styles are required to appear in the XML file before the body of the document. We either have to loop over the document twice, or use a more complex solution. Enter KoOasisStore (name subject to change!). This object maintains two output streams: the bodyWriter(), which is the actual output document (in memory), and a temporary output stream called contentWriter(). This allows the collection of automatic styles and the writing of the content to go in one pass: during this pass, automatic styles are written straight to the output document as they're created, while the content is written to the temporary contentWriter() stream. Once the whole document has been parsed, a call to closeContentWriter() writes all of the body content to the main output stream.

Saving context

Saving documents means passing around a lot of parameters: there are lots of things. KOffice implements a simple solution to the problem of methods with 10 or so arguments each: a simple state object with data members that hold the relevant parameters. This object is the KoSavingContext.

Loading XML

Now we've covered saving, what about loading ODF documents in KOffice? This is currently done internally with the QDom * classes, but work is underway to move to a more efficient KoXML* solution (from Calligra/Libs/KoXmlReader). Use of the KoXml* classes looks something like this:

KoXmlElement docElem = doc.documentElement();
KoXmlElement body(KoXml::namedItemNS(docElem, KoXmlNS::office, "body"));

It's fairly standard DOM-like access, but with some extras. The most important is that it's namespace-aware: the KoXml::namedItemNS function returns the body element in the office namespace. But recall that the 'office' prefix isan arbitrary name, which could be different in the particular document we're dealing with. What we really should be using is the official namespace URI: urn:oasis:names:tc:opendocument:xmlns:office:1.0 . KoXmlNS::office is just a shortcut for this long string.

'Note:' do not use getElementsByTagNameNS (it's recursive!) 'TODO:' move to KoXmlElement::namedItemNS

To get attributes in a namespace-aware way, use QDomElement::attributeNS.

Iterating over child elements is easy, but again we have to be careful. The forEachElement macro helps here:

forEachElement(KoXmlElement e, parent) {
   if (e.localName() == "p"
       && e.namespaceURI() == KoXmlNS::text) {
       // Found a text:p element
   }
}

'IMPORTANT:' Do not use tagName() or nodeName() or prefix(), since these use the (arbitrary) prefix, rather than the namespace URI.

The style stack

The nested nature of ODF styles means it can be tricky to answer the question "Is this text italic?": The text might be in a span inside a paragraph inside a text object, each of which has their own associated text and paragraph style. To handle this, KOffice provides KoStyleStack, which maintains a stack of styles, and knows how to retrieve the current value of any given property. So in the example above, of a span in a paragraph in a text object, KoStyleStack might behave like this:

  1. push default styles
  2. push text object styles [+parents]
  3. push paragraph styles [+parents]
  4. push span styles
  5. setTypeProperties("text") // to ask for text-properties
  6. property(KoXmlNS::fo, "font-style") // to actually retrieve the value of the style

When walking through a document, we need another set of features of the KoStyleStack: save() and restore(). These two functions implement a second "history stack" within the KoStyleStack. Each time you call save(), the current state of the style stack is saved and pushed onto the "history stack". Then when you call restore(), the "history stack" returns to the previous state.

This is very similar to the way QPainter works, but to make it more concrete, let's take an example: imagine a document with a frame containing a paragraph and a table. Walking through the document, we enter the frame, and save() the KoStyleStack state. Now we enter the paragraph, and push the appropriate styles onto the style stack. Once we're done with the paragraph, we issue a restore() command, and are back to the state corresponding to the frame, ready to enter the table.

The style repository

The KoOasisStyles class provides access to the available styles, via a hash. KoOasisStyles deals with styles at the level of KoXmlElement s

Examples of the functionality provided by KoOasisStyles are:

KoXmlElement* findStyle(name, family)
To find the style with the given name and family
KoXmlElement* defaultStyle(family)
To find the default style for the given style-family
listStyles, masterPages, drawStyles...
To get lists of the available list styles, master page styles, drawing styles, and so on.
customStyles
To get user styles (that should be presented to the user - may be renamed)

Loading context

KoOasisLoadingContext provides access to the KoOasisStyles instance, as well as the current KoStore. It also loads the manifest file, and provides the interface to the style stack via the functions:

fillStyleStack(element, nsURI, attribute, family)
reads style name and loads the style for this element
addStyles(KoXmlElement* style, const char* family, ...)
adds style, parent styles, default style.

Remember to save/restore!

So, how do I load?

Loading ODF from a KOffice application is much like saving. Reimplement the abstract method KoDocument::loadOasis( const KoXmlDocument& doc, KoOasisStyles& oasisStyles, const KoXmlDocument& settings, KoStore* store ). The oasisStyles object contains the styles, already parsed, while the store object can be used to load embedded documents and images.

Loading ODF from a KOffice filter goes as you would expect too:

KoStoreDevice* in =
   m_chain->storageFile( "content.xml", KoStore::Read );
KoXmlDocument doc;
if (KoOasisStore::loadAndParse(in, doc)) ...

TODO: move to KoXmlDocument?

Settings

ODF allows you to save document-specific configuration in the ODF file. For example, KSpread can show a formula indicator on each cell that contains a formula, and this setting is saved per-document. Settings like these are saved in settings.xml in this rather verbose format:

<config:config-item-set config:name="configuration-variable-settings">
  <config:config-item config:name="displaylink"
       config:type="boolean">true</config:config-item>
  <config:config-item config:name="modificationDate"
       config:type="string">2007-05-10T12:36:14</config:config-item>
 </config:config-item-set>

To save config items like these, use one of the overloaded KoXmlWriter::addConfigItem methods, while to load them, use KoOasisSettings [Yet another class!] like this:

KoOasisSettings settings( doc );
KoOasisSettings::Items varSettings =  settings.itemSet( "configuration-variable-settings" );
bool dl = varSettings.parseConfigItemBool("displaylink", false)

  1. Cite error: Invalid <ref> tag; no text was provided for refs named text-h
  2. Cite error: Invalid <ref> tag; no text was provided for refs named text-p
  3. Cite error: Invalid <ref> tag; no text was provided for refs named paragraph-attrs
  4. Cite error: Invalid <ref> tag; no text was provided for refs named paragraph-content