KDE PIM/KItinerary/Generalized schema.org Extractor

Generalized schema.org Extractor

Notes from an Akademy 2020 discussion between Dan, Kai and Volker.

Goal

Remove the restriction of the extractor engine and data models to travel documents from kitinerary, as we have two users already wanting to do more with this:

  • PBI: extracting schema.org annotations from websites (business information, locations, etc)
  • KMail: Order and invoice data extracted from business email

Plan

Split what is now kitinerary into 3+ parts:

  • a generic core library containing basic types, JSON-LD de/serialization and the extractor engine
  • a domain-specific library with the travel data types, and travel-specific extensions for the extractor (pkpass, vdv, uic918.3, iata, etc)
  • a domain-specific library for the order/invoice data model
  • additional domain-specific libraries can be added for whatever else PBI needs and isn't covered yet

The idea is to do this in-place in the current repo for now, as we lack a proper name for the generalized library yet. Possibly aim for the core part to become a low/medium tier framework.

Refactoring

We will need to add extension points for the domain-specific extensions in the following places:

  • the JSON-LD import filter, probably some kind of @type to function registration, and a @type to @type mapping
  • the JSON-LD deserializer, a simple type registering mechanism should cover this
  • the extractor engine. Idea: remove the hardcoded types in there and make this generic on mimetypes, and allow to register additional extractor functions by mimetype. While at it, consider making the interface async.
  • the JS API for the custom extractors: this can probably somewhat follow the API used in QJSEngine to export this
  • the post-processor: probably some kind of QMetaObject to function registration mechanism, for filters and for validators

Related to this are attempts to get some of the basic knowledge db features to a low-tier framework.


This page was last edited on 12 September 2020, at 10:23. Content is available under Creative Commons License SA 4.0 unless otherwise noted.
-->