Calligra/Architecture/AuthorRDF

From KDE Community Wiki

Materials to read about RDF in general

Note

Replace this with a link to a Calligra + RDF documentation, when it will be ready, and move links below to it.

Concept

Why we are using RDF to store Author's data? During design phase of Outliner, there was a goal to save it's data along with usual OpenDocument-formatted file, so we aren't planning to introduce some new file format, moreover task was to make Author's document be editable with usual text-writers as OpenOffice Writer, etc. We can add additional "Author-data-only" files to a package, but such behavior in bound of OpenDocument Standard can only be treated as attaching file to a document, that distort meaning of "attached file". Another disadvantage is that, when we modify file outside Author, this file will become outdated. At that point RDF seems to be an ideal function to implement storing of various data along with a document, and furthermore RDF allows to place in-text linkage to this data. And this data can be contained in a separate file of package.

OpenDocument allows us to mark specific elements in document contents as a related to some RDF subject. So, for example, if we want to mark that actor Vasya is participating some scene, we will make such sequence of actions to store this info with RDF:

Note

Prefixes in predicates and objects are omited below in purpose of simplicity


  1. Create Vasya subject of type Actor:
    RDF: (subject - predicate - object)
         {ActorId} - type - Actor
         {ActorId} - name - Vasya
    
  2. Create scene subject:
    RDF: {SceneID} - type - Scene
    

    Note

    {ActorId} and {SceneId} is URIs, as RDF states, or some unique identification strings
  3. Store "Vasya is participating this scene" info:
    RDF: {SceneID} - hasActor - {ActorId}
    

    Note

    At this point all information on Author data level is complete, then we produce with text-linkage of scene
  4. Assign a <text:section> element corresponding to a scene unique xml:id.
    contents.xml:
        <text:section xml:id="{xmlId}"> {scene contents here} </text:section>
    
    1. Ensure that manifest.rdf contains such set of triples:
          RDF: {PackageId} - type - Package
      	 {PackageId} - hasPart - {ContentFileId}
      	 {ContentFileId} - type - ContentFile
      	 {ContentFileId} - path - "content.xml"
      

      Note

      Package, hasPart names are described in OpenDocument v1.2 specs part 3, chapter 6
    2. Than we add such set of triples, to note that {SceneId} has specific {xmlId} as reference:
          RDF: {ContentFileId} - hasPart - {SceneId}
      	 {SceneId} - type - Element
      	 {SceneId} - idref - {xmlId}
      

      Note

      Element class and idref property are described in OpenDocument v1.2 specs part 3, chapter 6.
      Steps 4 and 5 are implemented with help of KoTextInlineRdf class.
  5. Profit! Now with simple SPARQL queries to Soprano we can obtain all needed information as RDF triples. Convenient function to update information contained in KoRdfBasicSemanticItem class (see KoRdfBasicSemanticItem::updateTriple()).

Now from example you can see, that we can store arbitary type of the data, and place links to this data from any contents element that support xml:id property.

From now and all along the code I have been using "section" word instead of "scene", because "section" seems more neutral and denotes that stored data is linked to a <text:section> element.

Implementation foreword

At first, I(deniskup) have changed some behavior of default RDF implementation of Calligra Words: we have KoRdfSemanticItem before, that has functionality of handling "in-document text representation". As we don't need such thing for our metadata (section metadata, for example), I have moved all base functions of KoRdfSemanticItem to KoRdfBasicSemanticItem. So:

Old KoRdfSemanticItem ==
    KoRdfBasicSemanticItem (base functions)
  + new KoRdfSemanticItem (additional handling that we don't need for every metadata)

Note

"in-document text representation" allows to modify text in document based on data of the object text is linking. For example, you can mention an actor Vasya in text, then you can change his name to John in some "Actor information editor", and "in-document text representation" functionality will change Vasya to John everywhere he is mentioned in text.


Author specific classes for RDF handling

CAuMetaDataManager were introduced to add a author.rdf file to package (.odt) and create RDF contexts for writing needed RDF info to this file. Also it registers Author RDF elements within a system and has some helper functions.

For easy creation of new semantic items for Author I created a CAuSemanticItemBase class. Most of the elements will have common code base to update values of different types in RDF and generate queries to Soprano. All of this were extracted to this base class. When subclassing you only need to specify a list of integer and string properties, and base class will handle updates and other stuff automatically. I think elements factories can also be base-classed this way.

Sample of integration of section info to document on real document contents

Lets look at the example of how all of this looks on XML level. Download this file: Media:author-rdf-sample.odt. Lets unpack it:

author-rdf-sample.odt -> {unpacking}
    META-INF <DIR>
    Thumbnails <DIR>
    author.rdf <--- this is where Author stores its data
    content.xml <--- this is where document contents are stored
    manifest.rdf <--- this is where aliases from content.xml to author.rdf are placed
    meta.xml
    mimetype
    setting.xml
    styles.xml

If you will open author-sample.odt you will see header and two sections with assigned in Outliner data (badge, status, synopsis). This how it looks in contents.xml:

<?xml version="1.0" encoding="UTF-8"?>
...
<office:body> <office:text>
    ...
    <text:section
	text:name="New section 1"
	xml:id="id-9fa48e52-48e1-49cc-83ae-cf0a55c79759">
	<text:p text:style-name="P2">
	    Section 1 text
	</text:p>
    </text:section>
    ...
</office:text> </office:body>

Here we see section with name "New section 1" and with specified id, remember this id we will see it later.

This is manifest.rdf contents:

 
Remember
Replace code below, when implementation will correspond to the newest specs. Example below is outdated and doesn't go along with OpenDocument v1.2 specs. See this for details.
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    ...
    <rdf:Description rdf:about="%7Bc18429f8-ce63-4af8-a989-0fff163579f9%7D">
	<ns2:idref xmlns:ns2="http://docs.oasis-open.org/ns/office/1.2/meta/pkg#">
	    id-9fa48e52-48e1-49cc-83ae-cf0a55c79759
	</ns2:idref>
    </rdf:Description>
    ...
    <ns27:MetaDataFile xmlns:ns27="http://docs.oasis-open.org/ns/office/1.2/meta/odf#">
	<ns28:path
	    xmlns:ns28="http://docs.oasis-open.org/ns/office/1.2/meta/pkg#"
	    rdf:datatype="http://www.w3.org/2001/XMLSchema#string">
	    author.rdf
	</ns28:path>
    </ns27:MetaDataFile>
    ...
</rdf:RDF>

First block says: metadata with id from rdf:about is associated with specified idref (remember xml:id). Second block says: that in package we have additional metadata file - author.rdf, that contains:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    ...
    <ns6:Section
        xmlns:ns6="http://www.calligra.org/author/"
        rdf:about="%7Bc18429f8-ce63-4af8-a989-0fff163579f9%7D">
	<ns7:badge xmlns:ns7="http://www.calligra.org/author/Section#"
	    rdf:datatype="http://www.w3.org/2001/XMLSchema#string">
	    Sec1Badge
	</ns7:badge>
	<ns8:magicid xmlns:ns8="http://www.calligra.org/author/Section#"
	    rdf:datatype="http://www.w3.org/2001/XMLSchema#string">
	    {c18429f8-ce63-4af8-a989-0fff163579f9}
	</ns8:magicid>
	<ns9:status xmlns:ns9="http://www.calligra.org/author/Section#"
	    rdf:datatype="http://www.w3.org/2001/XMLSchema#string">
	    1
	</ns9:status>
	<ns10:synop xmlns:ns10="http://www.calligra.org/author/Section#"
	    rdf:datatype="http://www.w3.org/2001/XMLSchema#string">
	    Sec1Synop
	</ns10:synop>
    </ns6:Section>
    ...
</rdf:RDF>

Here we see Section object, that describes (see rdf:about -> idref -> xml:id) our section. It has badge, status and synopsis info written in it.

Lets draw how all of this looks in RDF triples representation (simplified version):

SUBJECT PREDICATE OBJECT
%7Bc18429f8-ce63-4af8-a989-0fff163579f9%7D http://www.calligra.org/author/Section#badge Sec1Badge
%7Bc18429f8-ce63-4af8-a989-0fff163579f9%7D http://www.calligra.org/author/Section#status 1
%7Bc18429f8-ce63-4af8-a989-0fff163579f9%7D http://www.calligra.org/author/Section#synop Sec1Synop
%7Bc18429f8-ce63-4af8-a989-0fff163579f9%7D http://docs.oasis-open.org/ns/office/1.2/meta/pkg#idref id-9fa48e52-48e1-49cc-83ae-cf0a55c79759

RDF context (were mentioned above) is a root element of author.rdf file:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    ...
</rdf:RDF>

Materials about Calligra and RDF