Download the PHP package andywer/language-detector without Composer
On this page you can find all versions of the php package andywer/language-detector. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download andywer/language-detector
More information about andywer/language-detector
Files in andywer/language-detector
Package language-detector
Short Description PHP library to detect the language of any free text.
License BSD-4-Clause
Informations about the package language-detector
LanguageDetector 
PHP library to detect languages from any free text.
It follows the approach described in the paper, a given text is tokenized into N-Grams (we cleanup whitespaces before doing this step). Then we sort the tokens
and we compare against a language model
.
Fork of crodas/languagedetector, since the original package seems abandoned.
How it works
The first thing we need is a language model
(which looks like this file) that is used to compare the texts against at classification time. This process must done before anything, and it can be generated with an script similar to this file.
Once we have our language model file (in this case language.php
) we're ready to classify texts by their language.
And that's it.
Algorithms
The project is designed to work with modules, which means you can provide your own algorithm for sorting
and comparing
the N-Grams. By default the library implements the PageRank as sorting
algorithm, and out of place (described in the paper) as comparing
.
In order to supply your own algorithms, you must change the $config
at learning stage to load your own classes (which by the way should implement some interaces).
Language Detection Training Files
Have a look at example/samples
directory. For more advanced traning data, visit the Leipzig Corpora Download Page.
Languages with non-latin characters
Remember to set the Config's mb
property (already before creating the language model) if you train for languages based on non-latin characters. Use UTF-8 encoded texts.