Latest revision as of 01:23, 6 November 2012

This page attempts to catalogue the list of files formats Nepomuk supports, and what formats are remaining.

Mime Types

MimeType	Status	Plugin	Comments
image/jpeg	Testing	Exiv2Extractor	No Comments
image/png	Testing	Exiv2Extractor	-
image/gif	?	?
image/exif
image/tiff
image/bmp
image/svg
audio/mpeg	Requires Polish	Taglib Extractor
audio/mp4
audio/wav
audio/x-aiff
application/pdf	Implemented - Requires Testing	PopplerExtractor	---
Other Office Formats	?
Ebook Formats	?
Archives	?
video/mpeg	Testing	FFmpeg
video/x-msvideo	Testing	FFmpeg
Other video formats	?
text/plain	Plain Text Extractor	Implemented	This should be extended to support other text files

Notes

Documents

Microsoft Formats

DOC - OLE 2 Compound Document and Office Open XML - Custom parser by Strigi. What can we use? <br\> XSL - http://qt-project.org/wiki/Handling_Microsoft_Excel_file_format <br\> spreadsheet formats <br\>

Maybe we can use some libreoffice or calligra libraries?

Open document formats

ODF - Strigi had their own inbuilt. What are our options?

Ebook formats

epub - Strigi reuses their ODF parser for epub. We could use libepub
mobi
rtf
lrf

Checkout what Okular uses for all these files and use that.

Other

lyx
tex
cbz - Comic books

Emails

mbox format - How? Something from pim?

@@ Line 1: / Line 1: @@
-Nepomuk currently acts as the file indexer for the KDE platform, applications and workspaces. Even though we frequently tout that we are not just a file indexer, we need to index the files properly.
+This page attempts to catalogue the list of files formats Nepomuk supports, and what formats are remaining.
-= File indexing solutions =
+= Mime Types =
-== Strigi ==
+{| class="wikitable" style="text-align: center;"
-The KDE software releases in version 4.9, currently use libstreamanalyzer to index the files. Current problems with strigi -
+! MimeType
+! Status
+! Plugin
+! Comments
-* Difficult to contribute to
+|-
-* No documentation
+| image/jpeg
-* Un-maintained
+| Testing
-* Does not reuse libraries
+| Exiv2Extractor
-* Has its own huge parsers for archives, utf, etc.
+| No Comments
-== Roll our own? ==
+|-
-Maybe it would be better to roll our own file parsers which are just light wrappers over the existing libraries.
+| image/png
+| Testing
+| Exiv2Extractor
+| -
-= File Formats =
+|-
+| image/gif
+| ?
+| ?
-We list down all the different file formats, and which all are supported by the different file indexing solutions.
+|-
+| image/exif
-== Images ==
+|-
+| image/tiff
-* JPEG - Use exiv - strigi also uses exiv - currently broken
+|-
-* PNG - Strigi rolls its own - detects the application name, color depth and interlace mode as well
+| image/bmp
-* GIF - there isn't much metadata
-* EXIF
-* TIFF
-* BMP
-* SVG - Strigi stores them as plain text
-We just use exiv2 and cover almost everything. Plus the code would be super simple.
+|-
+| image/svg
-== Videos ==
+|-
+| audio/mpeg
+| Requires Polish
+| Taglib Extractor
-Strigi uses ffmpeg except for ID3, vorbis and OggS. It also has to seek through the file. Not sure what that is for.
+|-
+| audio/mp4
-Overall, we could just use ffmpeg for everything. It's very fast and pretty much supports all the formats.
+|-
+| audio/wav
-== Audio ==
+|-
-* MP3
+| audio/x-aiff
-* FLAC
-* WAV
-Strigi rolls its own for id3 metadata. We should use taglib or ffmpeg. It seems to handle flac and wav files pretty well.
+|-
+| application/pdf
+| Implemented - Requires Testing
+| PopplerExtractor
+| ---
+|-
+| Other Office Formats
+| ?
+|-
+| Ebook Formats
+| ?
+|-
+| Archives
+| ?
+|-
+| video/mpeg
+| Testing
+| FFmpeg
+|-
+| video/x-msvideo
+| Testing
+| FFmpeg
+|-
+| Other video formats
+| ?
+|-
+| text/plain
+| Plain Text Extractor
+| Implemented
+| This should be extended to support other text files
+|}
+= Notes =
 == Documents ==
-PDF - Strigi uses their own which is crap. We should use poppler.
-ODF - Strigi inbuilt. We should
 === Microsoft Formats ===
@@ Line 58: / Line 105: @@
 === Open document formats ===
-ODF? Custom analyzer by Strigi.
+ODF - Strigi had their own inbuilt. What are our options?
 === Ebook formats ===
-* epub - Strigi reuses their ODF parser for epub
+* epub - Strigi reuses their ODF parser for epub. We could use libepub
 * mobi
 * rtf
 * lrf
-Checkout what Okular uses. Try using that.
+Checkout what Okular uses for all these files and use that.
 === Other ===
@@ Line 75: / Line 122: @@
 == Archives ==
-* tar
+We just need to add the <tt>nfo:Archive</tt> type based on the mimetype. Is there anything else that we can add?
-* gzip
-* whatever ..
-Strigi has its own analyzers for each archive which doesn't really add any metadata. It just adds the type <tt>nfo:Archive</tt>. We can do the same based on the mimetype.
 == Emails ==
-* mbox format - There was a bug report
+* mbox format - How? Something from pim?
-== Text Files ==
-* Text files
-* Source Code
-== ISO images ==
-Add the type based on the mimetype
-== Executable files ==
-Use Mimetype