Projects/Nepomuk/FileIndexing: Difference between revisions

From KDE Community Wiki
No edit summary
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
Nepomuk currently acts as the file indexer for the KDE platform, applications and workspaces. Even though we frequently tout that we are not just a file indexer, we need to index the files properly.
This page attempts to catalogue the list of files formats Nepomuk supports, and what formats are remaining.


= File indexing solutions =
= Mime Types =


== Strigi ==
{| class="wikitable" style="text-align: center;"
The KDE software releases in version 4.9, currently use libstreamanalyzer to index the files. Current problems with strigi -
! MimeType
! Status
! Plugin
! Comments


* Difficult to contribute to
|-
* No documentation
| image/jpeg
* Un-maintained
| Testing
* Does not reuse libraries
| Exiv2Extractor
* Has its own huge parsers for archives, utf, etc.
| No Comments


== Roll our own? ==
|-
Maybe it would be better to roll our own file parsers which are just light wrappers over the existing libraries.
| image/png
| Testing
| Exiv2Extractor
| -


= File Formats =
|-
| image/gif
| ?
| ?


We list down all the different file formats, and which all are supported by the different file indexing solutions.
|-
| image/exif


== Images ==
|-
| image/tiff


* JPEG - Use exiv - strigi also uses exiv - currently broken
|-
* PNG - Strigi rolls its own - detects the application name, color depth and interlace mode as well
| image/bmp
* GIF - there isn't much metadata
* EXIF
* TIFF
* BMP
* SVG - Strigi stores them as plain text


We just use exiv2 and cover almost everything. Plus the code would be super simple.
|-
| image/svg


== Videos ==
|-
| audio/mpeg
| Requires Polish
| Taglib Extractor


Strigi uses ffmpeg except for ID3, vorbis and OggS. It also has to seek through the file. Not sure what that is for.
|-
| audio/mp4


Overall, we could just use ffmpeg for everything. It's very fast and pretty much supports all the formats.
|-
| audio/wav


== Audio ==
|-
* MP3
| audio/x-aiff
* FLAC
* WAV


Strigi rolls its own for id3 metadata. We should use taglib or ffmpeg. It seems to handle flac and wav files pretty well.
|-
| application/pdf
| Implemented - Requires Testing
| PopplerExtractor
| ---
 
|-
| Other Office Formats
| ?
 
|-
| Ebook Formats
| ?
 
|-
| Archives
| ?
 
|-
| video/mpeg
| Testing
| FFmpeg
 
|-
| video/x-msvideo
| Testing
| FFmpeg
 
|-
| Other video formats
| ?
 
|-
| text/plain
| Plain Text Extractor
| Implemented
| This should be extended to support other text files
 
|}
 
= Notes =


== Documents ==
== Documents ==
PDF - Strigi uses their own which is crap. We should use poppler.
ODF - Strigi inbuilt. We should


=== Microsoft Formats ===
=== Microsoft Formats ===
Line 58: Line 105:
=== Open document formats ===
=== Open document formats ===


ODF? Custom analyzer by Strigi.
ODF - Strigi had their own inbuilt. What are our options?


=== Ebook formats ===
=== Ebook formats ===
* epub - Strigi reuses their ODF parser for epub
* epub - Strigi reuses their ODF parser for epub. We could use libepub
* mobi
* mobi
* rtf
* rtf
* lrf
* lrf


Checkout what Okular uses. Try using that.
Checkout what Okular uses for all these files and use that.


=== Other ===
=== Other ===
Line 75: Line 122:
== Archives ==
== Archives ==


* tar
We just need to add the <tt>nfo:Archive</tt> type based on the mimetype. Is there anything else that we can add?
* gzip
* whatever ..
 
Strigi has its own analyzers for each archive which doesn't really add any metadata. It just adds the type <tt>nfo:Archive</tt>. We can do the same based on the mimetype.


== Emails ==
== Emails ==
* mbox format - There was a bug report
* mbox format - How? Something from pim?
 
== Text Files ==
* Text files
* Source Code
 
== ISO images ==
 
Add the type based on the mimetype
 
== Executable files ==
 
Use Mimetype

Latest revision as of 01:23, 6 November 2012

This page attempts to catalogue the list of files formats Nepomuk supports, and what formats are remaining.

Mime Types

MimeType Status Plugin Comments
image/jpeg Testing Exiv2Extractor No Comments
image/png Testing Exiv2Extractor -
image/gif ? ?
image/exif
image/tiff
image/bmp
image/svg
audio/mpeg Requires Polish Taglib Extractor
audio/mp4
audio/wav
audio/x-aiff
application/pdf Implemented - Requires Testing PopplerExtractor ---
Other Office Formats ?
Ebook Formats ?
Archives ?
video/mpeg Testing FFmpeg
video/x-msvideo Testing FFmpeg
Other video formats ?
text/plain Plain Text Extractor Implemented This should be extended to support other text files

Notes

Documents

Microsoft Formats

DOC - OLE 2 Compound Document and Office Open XML - Custom parser by Strigi. What can we use? <br\> XSL - http://qt-project.org/wiki/Handling_Microsoft_Excel_file_format <br\> spreadsheet formats <br\>

Maybe we can use some libreoffice or calligra libraries?

Open document formats

ODF - Strigi had their own inbuilt. What are our options?

Ebook formats

  • epub - Strigi reuses their ODF parser for epub. We could use libepub
  • mobi
  • rtf
  • lrf

Checkout what Okular uses for all these files and use that.

Other

  • lyx
  • tex
  • cbz - Comic books

Archives

We just need to add the nfo:Archive type based on the mimetype. Is there anything else that we can add?

Emails

  • mbox format - How? Something from pim?