Note: This is only a copy of the document "libMSOOXML Usage HOWTO" which can be found in KDE's svn repository under
svn.kde.org/home/kde/trunk/koffice/filters/libmsooxml/doc/MSOOXML_Filters_HOWTO.odt

Goal and scope

In this document we cover creating MSOOXML import filters only. Export filters implementation has not started yes so the documentation for them is a TODO.

The filters implementation refers to the ECMA-376 specification. The proper version of the specification document is "ECMA-376 2nd edition Part 1" (December 2008)Template:Fn http://www.ecma-international.org/publications/standards/Ecma-376.htm. It may not reflect the behavior of MS Office 2007 but is of slightly higher quality than the 1st edition.

Modules

MSOOXML filters have been split into the following modules:

Name	Type	Location	Description
libmsooxml	Shared library	koffice/filters/libmsooxml	Base classes, interfaces and utilities for MSOOXML filter development
docximport	Plugin (filter)	koffice/filters/kword/docx	MS Word 2007 document import filter (docx, docm, dotx, dotm)
pptximport	Plugin (filter)	koffice/filters/kpresenter/pptx	MS PowerPoint 2007 document import filter (pptx, pptm, potx, potm, ppsx, ppsm)
xlsximport	Plugin (filter)	koffice/filters/kspread/xlsx	MS Excel 2007 document import filter (xlsx, xlsm, xltx, xltm, xlsb)

The process of document import

Any type of loading and saving in KOffice is based on so called filter chains. The KoFilterManager object builds a filter chain consisting of filters that, if used in a specific order, are able to convert from the format that is read to the desired format. When a KOffice application loads a file, it should always be in the OpenDocument Format.

Example chain:

[.docx file] ---DOCX import filter---> [.odt file] ---loading of the native format---> [KWord]

Creating an MSOOXML import filter

Any MSOOXML import filter is created using the recipe presented below. We focus on DOCX filter as an example unless explicitly noted otherwise.

Inheriting MSOOXML::MsooXmlImport

Creating a filter is done by creating e.g. a DocxImport class.

MSOOXML::MsooXmlImport in turn inherits KoOdfExporter. Its description is “The base class for filters exporting to ODF”, which is true since from the OOXML filter's perspective this class exports the format to ODF.
KoOdfExporter accepts (const QString& bodyContentElement, QObject* parent = 0) args. bodyContentElement will be used to automatically create the document element of the ODF body; it is "text" when ODT is the destination format, "presentation" is for ODP, "spreadsheet" is for ODS.
Implement virtual bool acceptsSourceMimeType(const QByteArray& mime) const, which is automatically called by the MsooXmlImport. For example DocxImport accepts these types:

- - - "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    - "application/vnd.openxmlformats-officedocument.wordprocessingml.template"
    - "application/vnd.ms-word.document.macroEnabled.12"
    - "application/vnd.ms-word.template.macroEnabled.12"

These mime types are defined (with names appearing in the “Open dialogs” and extensions) in the koffice/filters/libmsooxml/msooxml-all.xml file used by the shared mime info databaseTemplate:Fn http://www.freedesktop.org/wiki/Software/shared-mime-info of freedesktop.org (in the future types defined here would be merged with the freedesktop.org standard database)
The msooxml-all.xml file is installed by the buildsystem into $XDG_MIME_INSTALL_DIRTemplate:Fn It is usually /usr/share/mime/packages/; OpenOffice.org installs similar file into the same
Implement bool acceptsDestinationMimeType(const QByteArray& mime) const, whichis automatically called by the MsooXmlImport. The DOCX import only accepts "application/vnd.oasis.opendocument.text" here.
Implement:KoFilter::ConversionStatus parseParts(KoOdfWriters *writers, MSOOXML::MsooXmlRelationships *relationships, QString& errorMessage);This is the central place when parsing of all the parts of the input files is done, information is collected and transformed into the ODF-compatible output, also split into parts.In case of the DOCX import filter, following steps are performed; each uses a separate part of the MSOOXML's ZIP container:{{fn Actually, as the 0^th step, temporarily hardcoded styles are inserted into the output document (using writers‑>mainStyles‑>addRawOdfDocumentStyles()), until functional styles import is implemented.

}}

- - - Font table parsing (word/fontTable.xml part)
    - Document styles parsing (word/styles.xml part)
    - Document parsing (word/document.xml part)

KoOdfWriters *writers provides convenience structure encapsulating XML writers (based on KoXmlWriter). Currently, content, body, meta, manifest are accessible by the writers. Document styles are accessible via KoGenStyles *mainStyles.

MsooXmlRelationships *relationships provides relationships parsed (in a delayed manner) from .xml.rels files of the MSOOXML container. MSOOXML relationships files define mappings from XML namespace URIs to relative paths within the container where the part files are stored.

For example, the DOCX import filter uses the relationships object for supporting hyper links, finding paths to embedded images and other objects.

Parsing individual parts

As mentioned above, parts like the font table, styles, and the document itself are separate files within the ZIP container. Therefore these parts are parsed separately using dedicated parsers. Execution of the parsers happen at the top level of the filter, i.e. in the {Docx|Pptx|Xlsx}Import::parseParts() method.

The general pattern for parsing individual parts is as follows:

Create and initialize *Context data structure.
Create *Reader object initialized with the KoOdfWriters *writers argument.
Call loadAndParseDocument() of MsooXmlImport. It takes arguments:
1. const QByteArray& contentType,
2. MsooXmlReader *reader, created before
3. KoOdfWriters *writers passed to parseParts()
4. QString& errorMessage
5. MsooXmlReaderContext* context created before

Example of parsing the DOCX's word/document.xml:

DocxImport::parseParts(...) {//...DocxXmlDocumentReaderContext context(*this, // DocxImportQLatin1String("word"), // path within the containerQLatin1String("document.xml"), // file*relationships); // relationships provided for parseParts()DocxXmlDocumentReader documentReader(writers);RETURN_IF_ERROR( loadAndParseDocument(d->mainDocumentContentType(),&documentReader,writers,errorMessage,&context))//...}

Each *Context structure is related to a given reader, and accepts a specific set of arguments. *Context structures were introduced in order to make *Reader objects stateless, so one object can be reused for many inputs without destroying anything.

Anatomy of *Reader classes

Reader classes inherit from MsooXmlReader which in turn inherit from QXmlStreamReader (fast XML recursive pull-parser) and KoOdfWriters (a structure with Xml push writers and main style object).

Typical *Reader class:

Defines typedef KoFilter::ConversionStatus(???Reader::*ReadMethod)()
Provides QStack<ReadMethod> m_calls attribute
Provides ???Context* m_context attribute for keeping current context
Has constructor taking one argument: KoOdfWriters *writers
ImplementsKoFilter::ConversionStatus MsooXmlReader::read(MSOOXML::MsooXmlReaderContext* context), which:
1. sets m_context to dynamic_cast<???Context*>(context)
2. checks if namespaceDeclarations() contains reference to main schema, e.g. QXmlStreamNamespaceDeclaration("w", MSOOXML::Schemas::wordprocessingml)
3. reads child elements of the document top level tag, e.g. “body” for the DOCX filter; this way the recursive XML parsing starts
Implements recursive read_{name}() methods where {name} is the respective MSOOXML's tag name without the namespace part. The use of the ”read_” prefix is a must; READ_* macros depend on it.
At the top of the *Reader.cpp file, MSOOXML_CURRENT_NS macro should be defined, for example in the DOCX document reader, it is:
define MSOOXML_CURRENT_NS "w"
MSOOXML_CURRENT_CLASS macro should be defined at the top of the *Reader.cpp too, e.g.:#define MSOOXML_CURRENT_CLASS DocxXmlDocumentReader

The goal of using macros is readability of the readers' code. Similar functionality could be provided by C++ templates but to often at the cost of readability.

Inside the read_*() method

Any read_*() method handles a single XML tag, where * is the tag name.
Before definition of the method, add these lines:

undef CURRENT_EL#define CURRENT_EL {tag-name}

where {tag-name} is the XML tag handled by the read_*() method.

CURRENT_EL is used in various macros, so adding the tag name attribute again and again is not needed inside the method's body. This reduces the risk of mistakes (e.g. use of invalid names).

As explained in section 5.3. Anatomy of *Reader classes, MSOOXML_CURRENT_NS defines the current namespace, so CURRENT_EL is specified without any explicit namespace.

Example method for <w:sectPr> element in the DocxXmlDocumentReader:

undef CURRENT_EL

define CURRENT_EL sectPr

KoFilter::ConversionStatus DocxXmlDocumentReader::read_sectPr()

{

READ_PROLOGUE

// store page style in the KoGenStyle object's attribute

m_currentPageStyle = KoGenStyle(KoGenStyle::StylePageLayout);

m_currentPageStyle.setAutoStyleInStylesDotXml(true);

m_currentPageStyle.addProperty("style:writing-mode", "lr-tb");

m_currentPageStyle.addProperty("style:print-orientation", "portrait");

while (!atEnd()) { // reading subelements

readNext();

if (isStartElement()) {

TRY_READ_IF(pgSz)

ELSE_TRY_READ_IF(pgMar)

ELSE_TRY_READ_IF(pgBorders)

}

BREAK_IF_END_OF(CURRENT_EL);

}

QString pageLayoutStyleName = mainStyles->lookup(

m_currentPageStyle, “Mpm”, KoGenStyles::DontForceNumbering);

KoGenStyle masterStyle(KoGenStyle::StyleMaster);

masterStyle.addAttribute("style:page-layout-name", pageLayoutStyleName);

mainStyles->lookup(masterStyle, "Standard", KoGenStyles::DontForceNumbering);

READ_EPILOGUE

}

READ_PROLOGUE, defined in MsooXmlReader_p.h, adds some sanity-check code that reports error if the read_*() method is called when the stream reader's does not point to the expected element. For the example above, calling any read_sectPtr() when the current element is not <w:sectPtr> will immediately give result KoFilter::WrongFormat.
Call stack kept in m_calls is also updated by pushing information about the new call.
Similarly, the programmer should most typically return KoFilter::WrongFormat when an unexpected value or tag has been encountered.
READ_EPILOGUE should be the last bit of code in read_*() methods. It pops the information about the current call from the m_calls stack. It also checks if the current element is a closing element for the tag specified as CURRENT_EL.
There exists a macro READ_EPILOGUE_WITHOUT_RETURN too, which is like READ_EPILOGUE except it lacks the final return. This lets the programmer add additional code before returning from the read_*() method.

The while loop

The central part of any read_*() method is the while loop that reads child elements. The loop is used when child elements are expected. However, there may be elements not handled by the current implementation of the filter or elements coming from the future versions of the MSOOXML format, or elements extending the format (for whatever reason). If this is the case, code that skips these child elements is mandatory. To make this task easy, the SKIP_EVERYTHING macro is provided. It reads child elements until a closing element equal to CURRENT_EL is encountered.

Example:

KoFilter::ConversionStatus ???::read_???()

{

READ_PROLOGUE

// ...

SKIP_EVERYTHING

// ...

READ_EPILOGUE

}

If there is nothing to do in the method, it is even easier is to write:

KoFilter::ConversionStatus ???::read_???()

{

SKIP_EVERYTHING_AND_RETURN

}

This pattern is useful for adding methods that should behave properly but have no functionality implemented yet.

Inside of the loop

TokenType MsooXmlReader::readNext() is executed at the beginning of the while loop. Then, the typical code is:

if (isStartElement()) {

TRY_READ_IF({tag1})

ELSE_TRY_READ_IF({tag2})

// ...

ELSE_TRY_READ_IF({tagn})

ELSE_WRONG_FORMAT

}

TRY_READ_IF(tag) executes read_tag() method if the current element is “tag” (the namespace should match MSOOXML_CURRENT_NS). Note that for simplicity the tag is not enclosed in quotation marks: the macro does this automatically.

If the current element does not match “tag”, ELSE_TRY_READ_IF is tried if present, which is just else +TRY_READ_IF. Finally, the programmer can append ELSE_WRONG_FORMAT to the list of conditions. In this case KoFilter::WrongFormat will be returned when no condition was true. Use ELSE_WRONG_FORMAT only if any possible element is already checked and if unexpected elements should cause errors. Skipping ELSE_WRONG_FORMAT will just ignore unhandled elements.

Finally after the conditions, it is common to add BREAK_IF_END_OF(CURRENT_EL) what breaks out of the loop as soon as closing tag equal to CURRENT_EL is encountered.

Reading attributes

It is common to read XML attributes of the XML element before the while loop because after starting the loop, readNext() is executed so the attributes are not accessible any more.

The libmsooxml library provides a set of macros for reading attributes. To start using them, add this line just after the READ_PROLOGUE:

const QXmlStreamAttributes attrs(attributes());

Then, to read value of attribute, it is enough to write:

TRY_READ_ATTR(a)

It is important that a is the attribute name without quotation marks. This code reads attribute a with namespace MSOOXML_CURRENT_NS. So if MSOOXML_CURRENT_NS is “w”, TRY_READ_ATTR(a) in fact reads value of w:a attribute. The target variable when the value is stored is created by the macro in the local code block and has the same name as the tag (for the above example it is QString a).

For the set of TRY_READ* macros, TRY means that the code only tries to read the attribute values, and if the attribute is missing, no error is returned and just empty string will be used as value.

TRY_READ_ATTR_INTO(atrname, destination) macro is like TRY_READ_ATTR but it does not create a new variable, instead the variable name provided as destination is reused.
TRY_READ_ATTR_WITH_NS(ns, atrname) is like TRY_READ_ATTR() but offers the argument ns which lets the programmer bypass the current namespace provided by MSOOXML_CURRENT_NS. The QString variable created is of the form {ns}_{atrname}. Some attribute names in MSOOXML have namespaces, some do not, so TRY_READ_ATTR_WITH_NS() is useful.
TRY_READ_ATTR_WITH_NS_INTO(ns, atrname, destination) is a combination of TRY_READ_ATTR_WITH_NS() with TRY_READ_ATTR_INTO().
TRY_READ_ATTR_WITHOUT_NS(atrname) is like TRY_READ_ATTR() but no namespace is expected for the name at all.

Value checking. All attributes returned by READ_ATTR* macros are of type QString. Sometimes conversion of the read value is needed, and validity checking is beneficial.

STRING_TO_INT(string, destination, debugElement) converts string into destination of type int; displays warning and returns KoFilter::WrongFormat on failure.
STRING_TO_QREAL(string, destination, debugElement) works like STRING_TO_INT() but for qreal type.

Reading mandatory attributes. For this purpose there are: READ_ATTR(), READ_ATTR_INTO(), READ_ATTR_WITH_NS_INTO(), READ_ATTR_WITH_NS(), READ_ATTR_WITHOUT_NS(), READ_ATTR_WITHOUT_NS_INTO(), modeled exactly after TRY_READ_* macros; they warn and return KoFilter::WrongFormat when QXmlStreamAttributes::hasAttribute() is false for the given attribute.

Implemented *Reader classes

As of this writing, MSOOXML import filters implement (class hierarchy preserved):

MSOOXML::MsooXmlReader (filters/libmsooxml)
- MSOOXML::MsooXmlRelationshipsReader (filters/libmsooxml)
- MSOOXML::MsooXmlThemesReader (filters/libmsooxml)
- MSOOXML::MsooXmlCommonReader (filters/libmsooxml)
::DocxXmlDocumentReader (filters/kword/docx)
::PptxXmlSlideReader (filters/kpresenter/pptx)
::XlsxXmlWorksheetReader (filters/kpresenter/pptx)
- DocxXmlCommentsReader (filters/kword/docx)
- DocxXmlFontTableReader (filters/kword/docx)
- DocxXmlStylesReader (filters/kword/docx)
- XlsxXmlCommonReader (filters/kspread/xlsx)
::XlsxXmlSharedStringsReader (filters/kspread/xlsx)
- XlsxXmlDocumentReader (filters/kspread/xlsx)
- XlsxXmlStylesReader (filters/kspread/xlsx)

Code reused by readers

In addition to inheritance, there are implementation files included by some readers. This effectively copies the code with one main difference: namespaces differ between DOCX/PPTX/XLSX and since READ_* and READ_PROLOGUE/READ_EPILOGUE macros heavily depend on value of MSOOXML_CURRENT_NS . The resulting code differs from filter to filter. This mechanism can be treated as C++ templates but works implicitly - no namespace arguments pollute the code.

Readers that copy code from MsooXmlCommonReaderImpl.h (they also inherit MsooXmlCommonReader) are:

DocxXmlDocumentReader
PptxXmlSlideReader

Readers that copy code from MsooXmlCommonReaderDrawingMLImpl.h (DrawingML support) are:

DocxXmlDocumentReader
PptxXmlSlideReader

Readers that copy code from MsooXmlVmlReaderImpl.h (VML support) are:

DocxXmlDocumentReader

Using XmlWriteBuffer

XmlWriteBuffer is a helper allowing to buffer XML streams and writing them back later. This class is useful when information that has to be written in advance is based on XML elements parsed later. In such case the information cannot be saved in one pass. Example of this is paragraphs style name: is should be written to style:name attribute but relevant XML elements (that we use for building the style) are appearing later. So we first output the created XML to a buffer, then save the parent element with the style name and use KoXmlWriter::addCompleteElement() to redirect the buffer contents as a subelement.

See MsooXmlUtils.h for example use of the class.

Documenting the filters code

The MSOOXML specification documents take many thousands of pages, and even as such are considered as incomplete. Therefore it is very important to maintain adequate documentation of the filters code and refer to the specification.

Any read_*() method should be documented. The proper place in the .cpp file, because the programmers may want to quickly update the documentation while working on the code.

Example of a documented method:

undef CURRENT_EL

define CURRENT_EL pgBorders

//! pgBorders handler (Page Borders)

/*! ECMA-376, 17.6.10, p. 646.

Parent elements:

- [done] sectPr (§17.6.17)

- sectPr (§17.6.18)

- sectPr (§17.6.19)

Child elements:

- [done] bottom (Bottom Border) §17.6.2

- [done] left (Left Border) §17.6.7

- [done] right (Right Border) §17.6.15

- [done] top (Top Border) §17.6.21

*/

//! @todo support all elements

KoFilter::ConversionStatus DocxXmlDocumentReader::read_pgBorders() /*..*/

Use Doxygen tags //! or /*!
The first line is lists the element name; the namespace can be presented if the method is only used with this single namespace.
Add MSOOXML specs section number and page number

From koffice/filters/libmsooxml/README:

[..] References to the ECMA-376 specification in the source code are made by page number or chapter number. Page numbers are the physical PDF document's page numbers as displayed by the viewer.

Optionally, copy the sentence describing purpose of element from the MSOOXML specs.
Add “Parent elements” section and list all the the parent elements, one per line. For parent elements that have calls to the read_*() method, e.g. using TRY_READ_IF(*), prepend [done] marker as an indication of which parent element depends on this method.
Carefully distinguish between multiple MSOOXML elements sharing the same tag name but having different section number.
Add “Child elements” section and list all the child elements, one per line, or write “No child elements.” For child elements that are already handled in the read_*() method, e.g. using TRY_READ_IF(*), prepend [done] marker as an indication of the progress made in the implementation.
Finally, add Doxygen todo note: //! @todo support all elements
Do not list attribute names in the documentation above the method's signature. Instead add extra explanation to the READ_ATTR* lines for the respective attributes if needed. For (yet) unhandled attributes, add line like //! @todo support attribute {name}
If there is no sense in or possibility of supporting a certain element or attribute or given attribute's value, add a note in a Doxygen comment in the function bode or above. It is very valuable to have the comments in the context of source code instead of having them floating in random README or TODO files.
Act similarly if there is insufficient knowledge about element or attribute or the meaning of them is unclear.
Add similar notes when a case has been identified that causes round trip conversion of documents hard, inaccurate or impossible.

Command line tests for filters

The koconverter command-line program is available within the KOffice for testing document opening, importing and exporting allocating any GUI:

koconverter <input-document-file> <output-document-file>

Formats of the input and output files are detected based on the mime type and/or the content and appropriate filter chain is set up.

Examples:

% koconverter document.docx document.odt

% koconverter sheet.xlsx sheet.ods

% koconverter presentation.pptx presentation.odp

The resulting ODF file (silently overwritten) will look the same if user opened the MSOOXML file in respective KOffice application, and the saved in the default ODF format.

The (usually rich) debug output of the filter execution can be examined. E.g.

% koconverter sheet.xlsx sheet.ods 2>&1 | less