Amarok/Development/DynamicCollectionDevel

This is a simple copy of the original Dynamic Collection wiki page. While the old page is mainly aimed at users, this page still contains the original developer information.

About dynamic collection

Amarok currently does not support tracks on removable devices very well. It assumes that all tracks which are part of the collection are available all the time and that their path does not change over time. Dynamic collection is the working title I just invented for a project to improve the way Amarok identifies the tracks in its collection (feel free to propose others). It should solve much of bug #87391.

Please note that this is only about including files which are not on a local disks in Amarok's local collection DB in a better way. It is not about actually using remote collection databases. If I understood Andrew correctly, his SoC proposal was about just that, but by now his ideas are pretty close to mine. Andrew, maybe you have time to add your thoughts to this page.

Current state of Amarok

Amarok uses an audio file's absolute path as an unique identifier. That works fine as long as Amarok's whole collection is stored in a way that makes it accessible by Amarok permanently under the same path. When part of the collection are not stored that way, e.g. on external harddisks or network shares, well known problems arise:

Amarok forgets about the files if incremental scanning is activated and the device is not mounted
Amarok does not handle changing mount points.

The idea behind dynamic collection

To play a file, a valid URL pointing to that file is required. The core of my idea is a change of the way Amarok stores that URL. Each URL can be split into two parts, a first part which identifies the device a file is stored on, and a second part which gives the relative path of the file on that device.

Amarok can then automagically generate a unique identifier of the device and store that identifier together with the relative path in the collection database. All files on the same device would have the same identifier, but different relative paths. Instead of a single field as primary key Amarok would the use the device identifier and the relative path as composite primary key.

For this to work, we need ways to identify devices. Some ideas:

the HAL volume.uuid if available
Andrew proposed a hash of the server's name/IP and the share name for network shares (ie hashing on the URL)
For CDs:

CD label for data CDs, if sufficient (md5 hash of the CD's TOC? how large is a TOC?)
the freedb.org disc id - there should be code in KsCD or KAudioCreator

for a fallback, we could just hash on the path of the mount point. loses some functionality, but you expect that with a fallback solution
if everything fails: use a special value (like 0 or -1) to mark that the URL/directory field contains the absolute path and not a relative path. Probably the best thing to do when upgrading the database. (Do we need to upgrade the DB? Wouldn't bumping the version number and thus rescanning be OK? -- the statistics table uses a song's URL as PK and would have to be migrated too...rescanning would give us deviceIds for all songs in the tags table, but we would not have the deviceIds in the statistics table->statistics would not be associated with songs anymore. It might be solvable with additional migration code)
your ideas here....

"Plugins" is probably the first thing which comes to one's mind when reading that list.

There are probably better terms than device which could be used here, but it is the only one which came to my mind while writing this. It can mean, among others, partitions on external hardisks, cd-roms or mounted network shares.

AmaroK can replace a file's absolute path with a device id and a relative path just before storing the song in the database, and generate an absolute path which is valid at that time from the device id and relative path right after retrieving the information from the collection database. This would make it quite transparent for the rest of the code.

It may prove sensible to eventually remove the current URL field, given it would be generated from the device ID and relative path, and would thus be redudant. This would be a large task (with little gain other than a slightly more logical DB schema) and would be best suited to doing after the new system is fully working.

Necessary refactoring

In the proof-of-concept patch I sent to the amarok mailing list, the class MountPointManager encapsulates all the code to generate a device id (usually called mediaid there) and a relative path from an absolute path and vice versa. CollectionDB and QueryBuilder handle the changed DB schema as transparently as possible for the rest of the code. Generally, the existence of the dynamic collection can be hidden in the persistence layer to a very high degree.

At least one media device plugin (iPod) stores the file's path in the media device's database and uses it to update Amarok's statistics. That would need to be changed.
A recent commit allows Amarok to use inotify to watch for directory changes. That won't work for unmountable devices (does it work for mountable network shares?). How are we going to implement a directory watch for those? The directory change date might not be available either.
more stuff here...

SQL changes

Everywhere where Amarok uses the URL or directory return value, we have to add the device id field as return value. Additionally, when we use the URL or directory value as a filter, we have to filter using the URL/directoy field ( containing the relative path ) and the device id field. Special case: using filtering using LIKE on the URL/directory field. It is not going to work if the search string matches a part of the mount point.

We can restrict all SQL queries to the songs which are currently available by adding deviceId IN (<list of available device ids>) to the query's where clause.

We might think about introducing an INTEGER primary key field to the tags table while we are refactoring amarok's SQL queries anyway. It is easier to handle a primary key consisting of a single field than a composite primary key, and we could add the primary key as a uint field to Metabundle. That would allow us direct access to a song's primary key just like it is now with the song's path. --Mkossick 19:49, 9 June 2006 (EDT)

Access protocols

AmaroK uses only the file access protocol at the moment. If I understand the news from k4m correctly, one of the devs there made it possible to play songs directly using network protocols like scp. Using an URL like scp://hostname/aDir/aSong as example, scp://hostname simply identifies the device, and aDir/aSong is the relative path on that device.

can't rely on scp://hostname to stay constant, still need some way to identify it

we can allow the user to change the IP/hostname/share. We need a dialog to manage the devices anyway.

ATF and dynamic collection

ATF associates a unique value with a file to keep track of the file even if the user moves it around/renames it outside of Amarok. At the moment, ATF uses the file's unique value to update the file's primary key, the absolute file name. So it should be no problem to use ATF with a dynamic collection by using the composite primary key instead.

Thinking about remote collections (this should probably should go somewhere further down the page):

ATF is impossible for collections that have remote databases eg iTunes' DAAP
Do we care enough to support ATF for things like unmounted SMB and FTP?
If not, we are just left with mounted filesystems, which shouldn't be too hard

User-defined device names

We could add a user-definable name to each device in the database. That would allow us to show the user a message like "The file is stored on a cd-rom which is currently no mounted. Please insert and mount CD <name> and try again." The same could happen for tracks on a USB drive, network share, etc.

In my opinion, this is definitely necessary for the user to be able to keep track of all the devices. One other example of where this would necessary, aside from the above, is a device configuration/add/remove screen. --Andrewt512 09:22, 31 May 2006 (EDT)

Database Rescanning

Currently, when there are changes to tables such as the tags tables, we drop them and recreate them. This doesn't matter as everything is mounted, so we just recreate the information in the new format by scanning again.

If we had removable media devices in the collection, then it would be bad to lose the information due to a minor update of the tags table. Hence, it will be necessary to write code to upgrade after a version number bump of the tags table.

It would also be bad to lose information due to a full database rescan, which is effectively the same as above, but started by the user. In that case, we should probably only rescan the devices that are available, and leave everything else untouched.

We need to find a better way to store Amarok's collection folders. Currently they are stored as simple string list in Amarok's config file. To make the collection folders work with the Dynamic Collection, they have to be associated with the device they are on. The easiest way would be to store them in the database too, but it might be a requirement to keep the collection folders out of the database so that Amarok can rescan the folders even if the whole database is damaged/deleted. --Mkossick 19:40, 10 June 2006 (EDT)

[Collection]

device1 = list of paths

device2 = list of paths?

Playlists

Amarok saves playlists as m3u files and stores a file's absolute path at the time of the playlist's creation in it. That won't work with Dynamic Collection where we try to avoid storing absolute paths anywhere. I'm not sure how to solve this problem yet.

Remote Collections: The many types of collection

A first thought about collections might suggest that there are simply sources that can be mounted (eg Hard Drives, USB drives, CDs, Samba) and those that can't (eg iTunes' DAAP and Ampache shares, Samba without smbfs, FTP). This, however naïvely overlooks one far more important distinction: how the information is stored.

In an Ampache share, there is a database of all the music on the server. All the artists present can be listed, without having to see all the song titles. There are even IDs for every artist, album and song. It can even perform searches. In short, it has does all the hard work for us.

Compare this to a USB drive, or an FTP share, where we only get to see the files and their locations and have to get the metadata ourselves.

So, there are really four different types of collections.

Mountable database-less:

Hard Drive
USB stick
CD
Samba with smbfs

Mountable with database:

iPod? (the database can be ignored for reading)
some other media players?

Non-mountable database-less:

Samba without smbfs
FTP

Non-mountable with database:

DAAP
Ampache
Magnatune.com or other online music stores

Remote Collections: Designing a plugin architecture

To interoperate with the Dynamic Collection ideas above, a plugin would need to:

Generate unique IDs for each device using a method sensible for the type of device (discussed above in "The Idea Behind Dynamic Collection")
Add and remove rows to/from Amarok's database allowing for scoring, album art etc. to work
Handle conversion from the device ID + relative path to a URL usable by the playback engine (By this stage, the static URL column probably should be removed from the Amarok DB to make this stage flexible: it could even involve downloading the file to a temporary location, if the plugin author thought it necessary.)
Probably far more to be added here...

It is however the second point where the real difference between collections with and without databases show

Without a database, there is no option but to scan all the files (eg with taglib) to generate the metadata and to add them to Amarok's DB.

With a database, however, there are many possible ways the task can be approached all with different drawbacks:

Importing all the songs in one go into the Amarok DB by asking for the metadata of all songs from the remote DB:

Simple
Could take a long time when a collection is added... might be very bad for DAAP if there are 10s of shares
Updating would presumably require the process be repeated again...
Bandwidth hungry

Importing on demand (ie for an Artist->Album->Song view in the Collection Browser, first get the artist list, then get the album list for an artist when it is expanded. When a song is finally added to the playlist, add that song to Amarok's DB):

No long initial delay
Delays when expanding entries in the Collection Browser could be annoying
No easy way to tell if a song is removed from a collection, unless either we happen to see enough of the tree, or alternatively we go about deliberately checking whether entries still exist (which would be bandwidth hungry)

To integrate remote collection databases seamless into the rest of the collection, we should simply copy the whole remote database (if the network is fast enough to play music from the remote location, bandwidth is probably not an issue). It is almost certain that there will be an option to search in the whole collection instead of only the active collection (active collection: the songs which are stored on devices that are mounted/accessible at the time of the search). To give the user a consistent view on the collection, we have to import the whole remote DB into amarok's local DB: the user expects to be able to search for all the files in the remote database even if he is not connected to it, just like he is able to do with, for example, unmounted CDs or USB drives. --Mkossick 18:50, 31 May 2006 (EDT)

Personally, I favour the Importing on Demand - it has less bandwidth requirements, and I think we can assume the latency to be reasonable for a remote collection. The problem will all remote collections is that you can't tell if things have been changed. With Importing on Demand, you shrug and carry on, until you can see for certain something has been changed, then you the change yourself as best you can (realistically, you only remove things). With the Import-It-All approach, you need to resynchronise periodically. If you have lots of remote collections (I personally am on a university network - there are tens of people running iTunes), that's going to be a lot of downloading. That means, it could well feel slower to the user, especially when they first out the features (and give up because it takes 10 minutes to scan and eats half of their bandwidth whilst doing so). --Andrewt512 18:41, 10 June 2006 (EDT)

I think we need both options. For things that have a remote database that can be easily queried there is no need to make a persistent local copy of the database. Other collections might require just that however. A plugin interface was mentioned earlier and I definetely think that is the way to go. We might however need two different kinds of plugins. Some that copy all info about the collection to a local database, and some that acts as a proxy for a remote database. How this fits together architecturally however, I have no idea at the moment! --Freespirit 13:22, 12 June 2006 (EDT)

To be continued...

(Reminders of topics still to be covered: Searching remote collections)