Sonnet is a spell checking library for Qt-based applications, with automatic language detection.

Generating trigram data files

To generate a trigram data file for a new language you first need a corpus for the language. One easy way to get this is to use Wikipedia dumps. Try using your favorite search engine to find information on how to generate a plain text corpus from Wikipedia.

Then you need to use the "gentrigrams" tool to generate compatible trigram files from this text corpus. It is available from here:

Check it out, build it with "qmake && make", and then run it as so: "./gentrigrams ../path/to/corpus.txt languagecode", which will read in the corpus.txt file and spit out a file named "languagecode". This can then be copied into data/trigrams in the sonnet repository. The sonnet build system will automatically parse in the files in that directory and create a file that is easy and quick for sonnet to load.

This page was last edited on 29 December 2013, at 22:20. Content is available under Creative Commons License SA 4.0 unless otherwise noted.