So Akonadi is already a “cache” for your PIM-data, and now we’re trying hard to feed all that data into a second “cache” called Nepomuk, just for some searching? We clearly must be crazy.
The process of keeping these to caches in sync is not entirely trivial, storing the data in Nepomuk is rather expensive, and obviously we’re duplicating all data. Rest assured we have our reasons though.
- Akonadi handles the payload of items stored in it transparently, meaning it has no idea what it is actually caching (apart from some hints such as mimetypes). While that is a very good design decision (great flexibility), it has the drawback that we can’t really search for anything inside the payload (because we don’t know what we’re searching through, where to look, etc)
- The solution to the searching problem is of course building an index, which is a cache of all data optimized for searching. It essentially structures the data in a way that content->item lookups become fast (while normal usage does this the other way round). So that already means duplicating all your data (more or less), because we’re trading disk-space and memory for searching speed. And Nepomuk is what we’re using as index for that.
Now there would of course be simpler ways to build an index for searching than using Nepomuk, but Nepomuk provides way more opportunities than just a simple, textbased index, allowing us to build awesome features on top of it, while the latter would essentially be a dead end.
To build that cache we’re doing the following:
- analyze all items in Akonadi
- split them up into individual parts such as (for an email example): subject, plaintext content, email addresses, flags
- store that separated data in Nepomuk in a structured way
This results in networks of data stored in Nepomuk:
PersonA [hasEMailAddress] addressA
PersonA [hasEMailAddress] addressB
emailA [hasSender] addressA
emailB [hasSender] addressB
So this “network” relates emails to email-addresses, and email-addresses to contacts, and contacts to actual persons, and suddenly you can ask the system for all emails from a person, no matter which of the person’s email-addresses have been used in the mails. Of course we can add to that IM conversations with the same Person, or documents you exchanged during that conversation, … the possibilities are almost endless.
Based on that information much more powerful interfaces can be written. For instance one could write a communication tool which doesn’t really care anymore which communication channel you’re using and dynamically mixes IM and email depending on whether/where the other person is currently available for a chat or would rather have a mail, which can be read later on, and doing so without splitting the conversation across various mail/chat interfaces.
This is of course just one example of many (neither am I claiming the idea, it’s just a nice example for what is possible).
So that’s basically why we took the difficult route for searching (At least that is why I am working on this).
Now, we’re not quite there yet, but we already start to get the first fruits of our labor;
- KMail can now automatically complete addresses from all emails you have ever received
- Filtering in KMail does fulltext searching, making it a lot easier to find old conversations
- The kpeoples library already uses this data for contacts merging, which will result in a much nicer addressbook
- And of course having the data available in Nepomuk enables other developers to start working with it
I’ll follow up on that post with some more technical background on how the feeders are working and possibly some information on the problematic areas from a client perspective (such as the address auto-completion in KMail).