Jump to content

KDE PIM/Akonadi Next/Design: Difference between revisions

From KDE Community Wiki
Cmollekopf (talk | contribs)
Created page with " == Axioms == # Personal information is stored in multiple sources (address books, email stores, calendar files, ...) # These sources may local, remote or a mix of local and r..."
 
Cmollekopf (talk | contribs)
No edit summary
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
== Axioms ==
# Personal information is stored in multiple sources (address books, email stores, calendar files, ...)
# These sources may local, remote or a mix of local and remote
== Requirements ==
# Local mirrors of these sources must be available to 1..N local clients simultaneously
# Local clients must be able to make (or at least request) changes to the data in the local mirrors
# Local mirrors must be usable without network, even if the source is remote
# Local mirrors must be able to syncronoize local changes to their sources (local or remote)
# Local mirrors must be able to syncronize remote changes and propagate those to local clients
# Content must be searchable by a number of terms (dates, identities, body text ...)
# This must all run with acceptable performance on a moderate consumer-grade desktop system
Nice to haves:
# As-close-to-zero-copy-as-possible for data
# Simple change notification semantics
# Source-specific syncronization techniques
# Data agnostic storage
Immediate goals:
# Ease development of new features in existing sources
# Ease maintenance of existing sources
# Make adding new sources easy
# Make adding new types of data or data relations easy
# Improve performance relative to existing Akonadi implementation
Long-term goals:
# Project view: given a query, show all items in all stores that match that query easily and quickly
Implications of the above:
* Local mirrors must support multi-reader, but are probably best served with single-writer semantics as this simplifies both local change recording as well as remote synchronization by keeping it in one process which can process write requests (local or remote) in sequential fashion.
* There is no requirement for a central server if the readers can concurrently access the local mirror directly
* A storage system which requires a schema (e.g. relational databases) are a poor fit given the desire for data agnosticism and low memory copying


== A Proposed System Design ==
== A Proposed System Design ==
=== Glossary ===
* Source: the canonical data set, which may be a remote IMAP server, a local iCal file, a local maildir, etc.
* Store: the local mirror for a given data set
* Syncronizer: the component responsible for modifying and synchronizing the store
* Resource: a set of configuration describing which syncronizer to use with what settings (e.g. server settings, local paths, etc)
* Message: the atomic unit of a store (an email, a calendar item, a contact, a note, a tag...)
* Filter: a component that takes a message and performs some modification of it (e.g. changes the folder an email is in) or processes it in some way (e.g. indexes it)
* Pipeline: a run-time definable set of filters which are applied to a message after a resource has performed a specific kind of function on it (add, update, remove...)
* Query: a well-defined structure for requesting messages from one or more sources that match a given set of constraints


Each syncronizer will know how to work with a specific source. For each configured instance of that kind of source, a store will be created which will be managed by an instance of the syncronizer. Syncronizers will run in their own process and be responsible to writing changes to the store, syncronizing these changes with the source and syncronizing the store with changes made in the source. Syncronizers will be plugins implementing a small but well defined API which allows for command input and notification of changes in the store.
Each syncronizer will know how to work with a specific source. For each configured instance of that kind of source, a store will be created which will be managed by an instance of the syncronizer. Syncronizers will run in their own process and be responsible to writing changes to the store, syncronizing these changes with the source and syncronizing the store with changes made in the source. Syncronizers will be plugins implementing a small but well defined API which allows for command input and notification of changes in the store.

Latest revision as of 22:46, 8 December 2014

A Proposed System Design

Each syncronizer will know how to work with a specific source. For each configured instance of that kind of source, a store will be created which will be managed by an instance of the syncronizer. Syncronizers will run in their own process and be responsible to writing changes to the store, syncronizing these changes with the source and syncronizing the store with changes made in the source. Syncronizers will be plugins implementing a small but well defined API which allows for command input and notification of changes in the store.

A shell application will load the configuration for a resource and dynamically load the correct syncronizer and instantiate the correct pipelines. This shell will also open a local socket that follows a convention allowing clients to address a given resource's shell. The shell will be responsible for communication between clients and the resource, queueing changes for later syncrhonization, and running the pipelines for each class of command.

Stores will

  • be simple key/value stores
  • support multi-reader, single-writer
  • values representing messages will be binary buffers that are directly readable using flatbuffers (privisional choice for buffer impl)
  • values of secondary indexes will be keys that map to the message entry
  • support tree objects making folder trees or threaded message list structures readable in single, non-recursive requests
  • support lazy loading: entries can be marked as needing to be loaded (e.g. an email folder name but no messages)
  • provide an incrementing revision ID on every message record and tree object; the current maximum revision ID is the store's current revision
  • keep track of the unsyncronized revision IDs (a set of revision endpoints, e.g.: 100-150, 160-170 would imply revisions less than 100 and between 151 and 159 are sync'd; the empty set means everything is sycronized); this is the store's sync state

Clients will read directly from the stores using a client library that implements:

  • resource enumeration (e.g. from local configuration data)
  • query of stores
  • discovery of shell sockets
  • command delivery to shells
  • keeps record of the last revision ID of a store it has seen

When reading from a store, the client library will:

  • locate the store and open it for reading
  • check if the shell socket is accepting connections
    • if so, connect to it for change notification
    • if not, watch for the local socket to become available
  • if the local socket becomes available (possibly following being closed), connect automatically
  • send all modification commands to the shell
    • commands are queued and sent one at a time to shell
      • the shell may request more commands before it completes the last one to drain client-side queues
      • the client-side queue should be considered transient: if the client exits before the queue is drained, the queue is simply discarded
    • if the shell socket is not available, it will start the shell (this must allow for multiple clients attempting to do this simultaneously)
  • release all resources when the client is finished with a resource

Sycronizers and filters may update the store via a helper API in the shell application. This helper API will ensure that the revision ID on message records is updated accordingly.

When a client comes across a to-be-loaded data set in the store it will:

  • ensure the shell is started
  • when a connection has been established request loading by the resource of the data set in question

This allows each syncronizer to define what gets lazy loaded and how without requiring it.

When a message is requested:

  • the message buffer will be loaded from the store and handed to a flatbuffer object, giving direct access to the data
  • each buffer will contain an identifying tag noting which synronizer is responsible for it
  • the message read interface of the syncronizer will be loaded into the client at this point (if not already done)
  • the flatbuffer will be passed through this interface for adaptation to an appropriate application-visible API, such as a C++ calendar item class
    • this also allows for locally stored data (e.g. in a maildir) to not require the whole dataset to be replicated in the store as the syncronizer can load the relevant mail from disk at this point with the message buffer in the store merely providing the correct local path to the mail

This also provides an innexpensive way to get at only the data necessary, such as for mail header lists, as flatbuffers can read from a memmap'd buffer without parsing the entire contents.

Commands include the following classes:

  • Syncronize: this triggers a syncronization of the store with the source; this may result in the store's revision increasing
  • Modify: this replaces the existing data in the store with the new data; this causes the store's revision to increase
  • Delete: removes a message entry from the local store and adds a removal entry; this causes the store's revision to increase
  • Create: this adds a new message to the store; this causes the store's revision to increase

On syncronization:

  • The syncronizer repicates changes in the source to the store
  • The syncronizer replicates changes in the store to the source
    • This may involved conflict resolution
    • This may spawn sets of change commands (modify, delete, create) for the store (which puts source and client mutations through the same pipelines)
    • The syncronized ID is updated in the store sync state to reflect progress and prevent duplicate replays

Modify/delete/create commands result in:

  • The change being queued in the store by the syncronizer
  • The relevant pipeline being run which may create further changes to the message
  • The resulting message is recorded in the store
  • If not the result of sycronization, the sycronizer may then elect to reflect the changes immediately in the source (reflected in the store's sync state)

Pipelines include:

  • Message mutation (e.g. local mail filter, spam/scam detection)
  • Content indexing (e.g. full text indexing)
  • Update indexes (add to index on new, change indexes on modify, remove from indexes on delete)

On store revision increase:

  • The shell notifies every client connected to the local socket of the new revision
  • Clients request, at their leisure, all changes since their last revision
    • This is the point when a client will show the user the modifications they requested (e.g. marking a mail as read) ... which implies these roundtrips have to be damn fast, but which also implies that the client always reflects the current real state of the store!
  • Once all clients have recieved removal entries (satisfied also by all clients disconnecting) and the removal entry has been sycnronized, it is removed from the store

Shell lifecycle management:

  • If a shell crashes, the client library is responsible for restarting it on next command request (or repeating the last command?)
  • Each client that connects and send a command is flagged as such in the shell's connection bookkeeping
  • When the last client that sent a command disconnects an automatic shutdown algorithm is begun:
    • all pipelines are drained (if in progress) and no more actions are allowed to be started
    • if a command from one of the remaining clients (or a newly connected lient) arrives, the algorithm is aboted and the shell goes back to normal operaton
    • new client connections are refused
    • any write handle to the store is closed
    • all remaining clients (which by definition have not sent a command, so are only listening) are disconnected
    • the shell process exit()s

Autostart on command queueing on the client side will restart the shell automatically and cause all non-commmand issuing clients to re-connect as well. This allows resources to come and go on an as-needed basis as they are only required for processing commands and updating clients of revision id updates.

With the above, all that is required is the shell application, a resource plugin API, a filter plugin API and a client-side library. All reads will happen in the client process (though potentially in a separate thread in that process), all writes will happen in the auto-started shell process. No central server is required. No database process is required.

This means all data retrieval and colation is done on the client-side. For instance:

  • Colating results between different resources will be a client-side task. (Projet view...)
  • Requesting all emails in a given folder will be done via a query processed client-side (note: query!)
  • Requesting all calender events from a given set of calendars for a given set of dates will be done through a similar query (yes: just one query)

In turn, this opens the door for a low-latency, low memory-usage, high-performance, declarative approach to data retrieval.