Result of our requirements brainstorming session on Friday

Flat Access
Categorization by Attributes
Job Priorities
Virtual Folders
Filtering
Access to parts of objects (mimetypes)
Change Notification
Shared Cache
Asynchronous Access
Out-of-process service plugins
Online/Offline state management
PIM Object Handle
No hard locks
Conflict Handling
Referencing objects (local or on server)
Capacities and Capabilities of Storage Backends
URI scheme to identify resources
Lazy Loading
Copy-on-write implementation for PIM objects (using snapshots)
Using Changesets
Syncing with groupware servers
Undo
Resources (storage units)
Non-global resource activation profiles
Till's mail about the requirements from the mail side

From: Till Adam <[email protected]>
To: KDE PIM <[email protected]>, [email protected]
Date: Thu, 6 Oct 2005 09:18:58 +0200

On Thursday 06 October 2005 08:15, Cornelius Schumacher wrote:
> On Thursday 06 October 2005 05:06, Mark Bucciarelli wrote:
> > With this approach, I imagine we would see gains of two orders of
> > magnitude in memory usage for large (year-long) files. If korg only
> > loads event headers for the current month, then startup would be a
> > constant speed no matter how large the data set.
>
> That's what I called "proxy objects" in my reply to the PIM daemon
> proposal. The drawback of this would be that you have a delay when loading
> the missing data. When you for example navigate through several months in
> KOrganizer then you would see an empty month at first and the events would
> pop up later. Not a very user-friendly solution. It would also mean that if
> you open an editor there would be a delay until all the data is loaded, so
> that you would start with an empty, disabled editor and the content of the
> fields would be filled in later until you are finally able to use the
> editor. Not pretty.

I've been thinking about the mail side of things a bit and come to the 
conclusion that something like a proxy of facade object is definitely needed 
for mail. We currently have three sorts of pointers to messages, and then two 
flags per message that signal the state of their "completeness". This is not 
sufficient and the fact that pointers to messages go away and are replaced by 
something else, is a major problem and our number one source of crashes. Yet 
the reason for this design was the need to have something extremely light 
weight to represent a message until more information is needed, because 
otherwise a folder with 10000 or 100000 mails would become completely 
unusable. Having the to-be-lazy-loaded information readily enough available 
that the user perceives no or only little delay when requesting it, is of 
course a challenge, and in the presence of across the network retrieval also 
has physical limitations, but online IMAP in KMail, which already works like 
that, to an extent, proves that it can be done. Caching could be a lot 
better, but more on that later.

Before I go into details a couple of general comments:

I think a design meeting is a great idea, I would welcome it.

I agree that this design is crucial, should be very well thought out, and not 
rushed in any way. We need to get this right.

I agree that we should look at EDS and also other solutions, they must have 
solved many of the same problems. If compatability with EDS seems achievable, 
I would consider that a worthy goal, but not if it hurts our power or 
flexibility.

Braindump of my musings on mail storage thus far, in no particular order:

- mails are identified by a globally unique serial number (whether to expose 
that outside of KMail or use an URI scheme for that - possibly including the 
sernum - is a separate discussion)

- there is a one to one mapping between the serial number and a ref-counted 
pointer to a Message object, which is initially an empty skeleton, containing 
no information beyond the serial number

- internally, the mail store holds mappings of serialnumber, storage URL and 
cache URL 

- the Message API allows retrieval of those parts of the mail that are needed, 
such as Envelope (what is needed for display in the headers list), Headers, 
body parts, etc. If they are in the cache, they come from there, otherwise 
from the storage location (server)

- access to all of these parts is asynchronous, with possibly synchronous 
convenience wrappers where access needs to be immediate for preformance 
reasons and can reasonably be expected to be immediate, such as envelope 
reqeuests

- caching policies, which can apply to accounts, folders, even messages, 
govern how much information of a mail is locally present, and how much of the 
lazy loaded information that isn't, initially, is kept around. This allows 
scenarios such as "in this folder, don't download anything from the imap 
server beyond the envelope, but if I look at the mail, keep the bodies around 
in the cache", or "sync everything for this account, but not attachments, and 
not mails over 5 MB on mailcheck or mails in my SPAM folder"

- messages (sernums) can have an arbitrary set of category flags associated 
with them, a la GMail labels, references to other PIM data, via URIs maybe

- storage folder location can be used as one (but not the only) grouping 
criterion, possibly modelled as a category flag, internally

- local mail (cache) storage is in maildir format, a local maildir account is 
simply one with cache URL == storage URL (implementation detail)

- the current folderstorage subclasses become machines for mapping storage URL 
to cache URL and shifting data from one to the other on request

- the internal mapping of sernum, storage URL, cache URL, category flags and 
performance critical envelope data (what used to be the index) is stored in a 
relational database, such as SQLite, which provides central, transactional, 
integrity guaranteed access to that information through the API 
(implementation detail)

- I imagine access to all of this via a libemailstorage (or even libpimdata, 
or something) which dishes out handles to read-only (vast majority, for mail) 
and read-write instances of mails, handles locking, copy-on-write, etc. 
Whether that is implemented via a server process, which the lib talks to, or 
by concurrent access to the above mentioned database is a yet to be resolved 
implementation detail, and mostly orthogonal to the storage layer API, I 
believe

Open questions:

- how do accounts fit in? Should an account be a set of credentials for access 
to a set of storage URLs plus a set of attributes, such as cache policies, 
and managed by a pim-wide entity? How about connection tracking, is that 
orthogonal?

- are all of the special features of certain server types (IMAP, Groupwise, 
HTTPMail, etc) integrateable into such a scheme? Things like quota, ACLs, etc

- what should the query language look like? A special API, aware of mail 
semantics? should URI schemes (mail:/#12345/headers/from, 
mail:/#12345/body/attachment) be used, SQL, IMAP?

- how to integrate this with Interview? Should folders be filtered (proxy) 
models on a global mailstore model? Sorting and threading as sorted (proxy) 
models on top of that? How much of that should be in the library, and how 
much in KMail? Does it make sense to be able to display a folder in any 
QAbstractItemView? 

- probably many more ...