Difference between revisions of "Projects/Nepomuk/FileIndexing"

Jump to: navigation, search
(Images)
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Nepomuk currently acts as the file indexer for KDE. Even though we frequently tout that we are not just a file indexer, we need to index the files properly.
+
This page attempts to catalogue the list of files formats Nepomuk supports, and what formats are remaining.
  
= File Indexing solutions =
+
= Mime Types =
  
== Strigi ==
+
{| class="wikitable" style="text-align: center;"
KDE 4.9, currently uses libstreamanalyzer to index the files. Current problems with strigi -
+
! MimeType
 +
! Status
 +
! Plugin
 +
! Comments
  
* Difficult to contribute to
+
|-
* No documentation
+
| image/jpeg
* Un-maintained
+
| Testing
* Does not reuse libraries
+
| Exiv2Extractor
 +
| No Comments
  
Lists the current status of indexing different files.
+
|-
 +
| image/png
 +
| Testing
 +
| Exiv2Extractor
 +
| -
  
== Roll our own? ==
+
|-
 +
| image/gif
 +
| ?
 +
| ?
  
= File Formats =
+
|-
 +
| image/exif
  
We list down all the different file formats, and which all are supported by the different file indexing solutions.
+
|-
 +
| image/tiff
  
== Images ==
+
|-
 +
| image/bmp
  
* JPEG - Use exiv - strigi also uses exiv - currently broken
+
|-
* PNG - Strigi rolls its own - detects the application name, color depth and interlace mode as well
+
| image/svg
* GIF - there isn't much metadata
 
* EXIF
 
* TIFF
 
* BMP
 
* SVG - Strigi stores them as plain text
 
  
 +
|-
 +
| audio/mpeg
 +
| Requires Polish
 +
| Taglib Extractor
  
We just just use exiv2 and cover almost everything. Plus the code would be super simple.
+
|-
 +
| audio/mp4
  
== Videos ==
+
|-
 +
| audio/wav
  
== Audio ==
+
|-
** MP3
+
| audio/x-aiff
 +
 
 +
|-
 +
| application/pdf
 +
| Implemented - Requires Testing
 +
| PopplerExtractor
 +
| ---
 +
 
 +
|-
 +
| Other Office Formats
 +
| ?
 +
 
 +
|-
 +
| Ebook Formats
 +
| ?
 +
 
 +
|-
 +
| Archives
 +
| ?
 +
 
 +
|-
 +
| video/mpeg
 +
| Testing
 +
| FFmpeg
 +
 
 +
|-
 +
| video/x-msvideo
 +
| Testing
 +
| FFmpeg
 +
 
 +
|-
 +
| Other video formats
 +
| ?
 +
 
 +
|-
 +
| text/plain
 +
| Plain Text Extractor
 +
| Implemented
 +
| This should be extended to support other text files
 +
 
 +
|}
 +
 
 +
= Notes =
  
 
== Documents ==
 
== Documents ==
** doc
 
** docx
 
** odf
 
** pdfs
 
** epub
 
** mobi
 
** spreadsheet formats
 
** Presentation Formats
 
** lyx
 
** tex
 
** cbz - Comic books
 
  
* Archives
+
=== Microsoft Formats ===
** tar
+
DOC - OLE 2 Compound Document and Office Open XML - Custom parser by Strigi. What can we use? <br\>
** gzip
+
XSL - http://qt-project.org/wiki/Handling_Microsoft_Excel_file_format <br\>
** whatever ..
+
spreadsheet formats <br\>
 +
 
 +
Maybe we can use some libreoffice or calligra libraries?
 +
 
 +
=== Open document formats ===
 +
 
 +
ODF - Strigi had their own inbuilt. What are our options?
 +
 
 +
=== Ebook formats ===
 +
* epub - Strigi reuses their ODF parser for epub. We could use libepub
 +
* mobi
 +
* rtf
 +
* lrf
 +
 
 +
Checkout what Okular uses for all these files and use that.
 +
 
 +
=== Other ===
 +
* lyx
 +
* tex
 +
* cbz - Comic books
  
* Emails
+
== Archives ==
** There was a bug report
 
  
* Text Files
+
We just need to add the <tt>nfo:Archive</tt> type based on the mimetype. Is there anything else that we can add?
** Text files
 
** Source Code
 
  
* ISO images
+
== Emails ==
* Executable Files
+
* mbox format - How? Something from pim?

Latest revision as of 01:23, 6 November 2012

This page attempts to catalogue the list of files formats Nepomuk supports, and what formats are remaining.

Mime Types

MimeType Status Plugin Comments
image/jpeg Testing Exiv2Extractor No Comments
image/png Testing Exiv2Extractor -
image/gif ? ?
image/exif
image/tiff
image/bmp
image/svg
audio/mpeg Requires Polish Taglib Extractor
audio/mp4
audio/wav
audio/x-aiff
application/pdf Implemented - Requires Testing PopplerExtractor ---
Other Office Formats ?
Ebook Formats ?
Archives ?
video/mpeg Testing FFmpeg
video/x-msvideo Testing FFmpeg
Other video formats ?
text/plain Plain Text Extractor Implemented This should be extended to support other text files

Notes

Documents

Microsoft Formats

DOC - OLE 2 Compound Document and Office Open XML - Custom parser by Strigi. What can we use? <br\> XSL - http://qt-project.org/wiki/Handling_Microsoft_Excel_file_format <br\> spreadsheet formats <br\>

Maybe we can use some libreoffice or calligra libraries?

Open document formats

ODF - Strigi had their own inbuilt. What are our options?

Ebook formats

  • epub - Strigi reuses their ODF parser for epub. We could use libepub
  • mobi
  • rtf
  • lrf

Checkout what Okular uses for all these files and use that.

Other

  • lyx
  • tex
  • cbz - Comic books

Archives

We just need to add the nfo:Archive type based on the mimetype. Is there anything else that we can add?

Emails

  • mbox format - How? Something from pim?

This page was last edited on 6 November 2012, at 01:23. Content is available under Creative Commons License SA 4.0 unless otherwise noted.