Difference between revisions of "Projects/Nepomuk/FileIndexing"

Jump to: navigation, search
 
Line 1: Line 1:
 
This page attempts to catalogue the list of files formats Nepomuk supports, and what formats are remaining.
 
This page attempts to catalogue the list of files formats Nepomuk supports, and what formats are remaining.
  
{|
+
= Mime Types =
! MimeType !! Status !! Plugin !! Comments
 
  
|- image/jpeg
+
{| class="wikitable" style="text-align: center;"
| Implemented - Requires Testing
+
! MimeType
| Exiv2Extractor
+
! Status
| No Comments
+
! Plugin
 +
! Comments
  
 
|-  
 
|-  
| application/pdf
+
| image/jpeg
| Implemented - Requires Testing  
+
| Testing  
| PopplerExtractor
+
| Exiv2Extractor
 
| No Comments
 
| No Comments
  
 +
|-
 +
| image/png
 +
| Testing
 +
| Exiv2Extractor
 +
| -
  
|}
+
|-
= File indexing solutions =
+
| image/gif
 +
| ?
 +
| ?
  
== Strigi ==
+
|-
The KDE software releases in version 4.9, currently use libstreamanalyzer to index the files. Current problems with strigi -
+
| image/exif
  
* Difficult to contribute to
+
|-
* No documentation
+
| image/tiff
* Un-maintained
 
* Does not reuse libraries
 
* Has its own huge parsers for archives, utf, etc.
 
  
== Roll our own? ==
+
|-
Maybe it would be better to roll our own file parsers which are just light wrappers over the existing libraries.
+
| image/bmp
  
= File Formats =
+
|-
 +
| image/svg
  
We list down all the different file formats, and which all are supported by the different file indexing solutions.
+
|-
 +
| audio/mpeg
 +
| Requires Polish
 +
| Taglib Extractor
  
== Images ==
+
|-
 +
| audio/mp4
  
* JPEG - Use exiv - strigi also uses exiv - currently broken
+
|-
* PNG - Strigi rolls its own - detects the application name, color depth and interlace mode as well
+
| audio/wav
* GIF - there isn't much metadata
 
* EXIF
 
* TIFF
 
* BMP
 
* SVG - Strigi stores them as plain text
 
  
We just use exiv2 and cover almost everything. Plus the code would be super simple.
+
|-
 +
| audio/x-aiff
  
== Videos ==
+
|-
 +
| application/pdf
 +
| Implemented - Requires Testing
 +
| PopplerExtractor
 +
| ---
  
Strigi uses ffmpeg except for ID3, vorbis and OggS. It also has to seek through the file. Not sure what that is for.
+
|-
 +
| Other Office Formats
 +
| ?
  
Overall, we could just use ffmpeg for everything. It's very fast and pretty much supports all the formats.
+
|-
 +
| Ebook Formats
 +
| ?
  
== Audio ==
+
|-
* MP3
+
| Archives
* FLAC
+
| ?
* WAV
 
  
Strigi rolls its own for id3 metadata. We should use taglib or ffmpeg. It seems to handle flac and wav files pretty well.
+
|-
 +
| video/mpeg
 +
| Testing
 +
| FFmpeg
 +
 
 +
|-
 +
| video/x-msvideo
 +
| Testing
 +
| FFmpeg
 +
 
 +
|-
 +
| Other video formats
 +
| ?
 +
 
 +
|-
 +
| text/plain
 +
| Plain Text Extractor
 +
| Implemented
 +
| This should be extended to support other text files
 +
 
 +
|}
 +
 
 +
= Notes =
  
 
== Documents ==
 
== Documents ==
 
PDF - Strigi uses their own which is crap. We should use poppler.
 
ODF - Strigi inbuilt. We should
 
  
 
=== Microsoft Formats ===
 
=== Microsoft Formats ===
Line 74: Line 105:
 
=== Open document formats ===
 
=== Open document formats ===
  
ODF? Custom analyzer by Strigi.
+
ODF - Strigi had their own inbuilt. What are our options?
  
 
=== Ebook formats ===
 
=== Ebook formats ===
* epub - Strigi reuses their ODF parser for epub
+
* epub - Strigi reuses their ODF parser for epub. We could use libepub
 
* mobi
 
* mobi
 
* rtf
 
* rtf
 
* lrf
 
* lrf
  
We could use libepub. + Checkout what Okular uses. Try using that.
+
Checkout what Okular uses for all these files and use that.
  
 
=== Other ===
 
=== Other ===
Line 91: Line 122:
 
== Archives ==
 
== Archives ==
  
* tar
+
We just need to add the <tt>nfo:Archive</tt> type based on the mimetype. Is there anything else that we can add?
* gzip
 
* whatever ..
 
 
 
Strigi has its own analyzers for each archive which doesn't really add any metadata. It just adds the type <tt>nfo:Archive</tt>. We can do the same based on the mimetype.
 
  
 
== Emails ==
 
== Emails ==
* mbox format - There was a bug report
+
* mbox format - How? Something from pim?
 
 
== Text Files ==
 
* Text files
 
* Source Code
 
 
 
== ISO images ==
 
 
 
Add the type based on the mimetype
 
 
 
== Executable files ==
 
 
 
Use Mimetype
 

Latest revision as of 01:23, 6 November 2012

This page attempts to catalogue the list of files formats Nepomuk supports, and what formats are remaining.

Mime Types

MimeType Status Plugin Comments
image/jpeg Testing Exiv2Extractor No Comments
image/png Testing Exiv2Extractor -
image/gif ? ?
image/exif
image/tiff
image/bmp
image/svg
audio/mpeg Requires Polish Taglib Extractor
audio/mp4
audio/wav
audio/x-aiff
application/pdf Implemented - Requires Testing PopplerExtractor ---
Other Office Formats ?
Ebook Formats ?
Archives ?
video/mpeg Testing FFmpeg
video/x-msvideo Testing FFmpeg
Other video formats ?
text/plain Plain Text Extractor Implemented This should be extended to support other text files

Notes

Documents

Microsoft Formats

DOC - OLE 2 Compound Document and Office Open XML - Custom parser by Strigi. What can we use? <br\> XSL - http://qt-project.org/wiki/Handling_Microsoft_Excel_file_format <br\> spreadsheet formats <br\>

Maybe we can use some libreoffice or calligra libraries?

Open document formats

ODF - Strigi had their own inbuilt. What are our options?

Ebook formats

  • epub - Strigi reuses their ODF parser for epub. We could use libepub
  • mobi
  • rtf
  • lrf

Checkout what Okular uses for all these files and use that.

Other

  • lyx
  • tex
  • cbz - Comic books

Archives

We just need to add the nfo:Archive type based on the mimetype. Is there anything else that we can add?

Emails

  • mbox format - How? Something from pim?

This page was last edited on 6 November 2012, at 01:23. Content is available under Creative Commons License SA 4.0 unless otherwise noted.