KDE PIM/KItinerary/Generalized schema.org Extractor: Difference between revisions
< KDE PIM | KItinerary
(Created page with "= Generalized schema.org Extractor = Notes from an Akademy 2020 discussion between Dan, Kai and Volker. == Goal == Remove the restriction of the extractor engine and data m...") |
|||
Line 27: | Line 27: | ||
* the JS API for the custom extractors: this can probably somewhat follow the API used in QJSEngine to export this | * the JS API for the custom extractors: this can probably somewhat follow the API used in QJSEngine to export this | ||
* the post-processor: probably some kind of QMetaObject to function registration mechanism, for filters and for validators | * the post-processor: probably some kind of QMetaObject to function registration mechanism, for filters and for validators | ||
Related to this are attempts to get some of the basic knowledge db features to a low-tier framework. |
Latest revision as of 10:23, 12 September 2020
Generalized schema.org Extractor
Notes from an Akademy 2020 discussion between Dan, Kai and Volker.
Goal
Remove the restriction of the extractor engine and data models to travel documents from kitinerary, as we have two users already wanting to do more with this:
- PBI: extracting schema.org annotations from websites (business information, locations, etc)
- KMail: Order and invoice data extracted from business email
Plan
Split what is now kitinerary into 3+ parts:
- a generic core library containing basic types, JSON-LD de/serialization and the extractor engine
- a domain-specific library with the travel data types, and travel-specific extensions for the extractor (pkpass, vdv, uic918.3, iata, etc)
- a domain-specific library for the order/invoice data model
- additional domain-specific libraries can be added for whatever else PBI needs and isn't covered yet
The idea is to do this in-place in the current repo for now, as we lack a proper name for the generalized library yet. Possibly aim for the core part to become a low/medium tier framework.
Refactoring
We will need to add extension points for the domain-specific extensions in the following places:
- the JSON-LD import filter, probably some kind of @type to function registration, and a @type to @type mapping
- the JSON-LD deserializer, a simple type registering mechanism should cover this
- the extractor engine. Idea: remove the hardcoded types in there and make this generic on mimetypes, and allow to register additional extractor functions by mimetype. While at it, consider making the interface async.
- the JS API for the custom extractors: this can probably somewhat follow the API used in QJSEngine to export this
- the post-processor: probably some kind of QMetaObject to function registration mechanism, for filters and for validators
Related to this are attempts to get some of the basic knowledge db features to a low-tier framework.