Download the PHP package nitotm/efficient-language-detector without Composer
On this page you can find all versions of the php package nitotm/efficient-language-detector. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download nitotm/efficient-language-detector
More information about nitotm/efficient-language-detector
Files in nitotm/efficient-language-detector
Package efficient-language-detector
Short Description Fast and accurate natural language detection. Detector written in PHP. Nito-ELD, ELD.
License Apache-2.0
Homepage https://github.com/nitotm/efficient-language-detector
Informations about the package efficient-language-detector
Efficient Language Detector
Efficient language detector (Nito-ELD or ELD) is a fast and accurate natural language detection software, written in PHP, with a speed comparable to existent fast C++ compiled detectors, and accuracy within the range of the heaviest and slowest detectors.
It has no dependencies, 100% PHP, easy installation, all it's needed is PHP with the mb extension.
ELD is also available in Javascript and Python.
- Installation
- How to use
- Benchmarks
- Testing
- Languages
Installation
Alternatively, download / clone the files will work just fine.
How to use?
detect()
expects a UTF-8 string, returns an object, with a value (ISO 639-1 code or null
) named language
- To reduce the languages to be detected, there are 3 different options, they only need to be executed once. (Check available languages below)
If needed, we can get the current status of eld: languages, database type and subset
Benchmarks
I compared ELD with a different variety of detectors, since the interesting part is the algorithm.
URL | Version | Language |
---|---|---|
https://github.com/nitotm/efficient-language-detector/ | 1.0.0 | PHP |
https://github.com/pemistahl/lingua-py | 1.3.2 | Python |
https://github.com/CLD2Owners/cld2 | Aug 21, 2015 | C++ |
https://github.com/google/cld3 | Aug 28, 2020 | C++ |
https://github.com/wooorm/franc | 6.1.0 | Javascript |
https://github.com/patrickschur/language-detection | 5.2.0 | PHP |
Benchmarks: Tweets: 760KB, short sentences of 140 chars max.; Big test: 10MB, sentences in all 60 languages supported; Sentences: 8MB, this is the Lingua sentences test, minus unsupported languages.
Short sentences is what ELD and most detectors focus on, as very short text is unreliable, but I included the Lingua Word pairs 1.5MB, and Single words 880KB tests to see how they all compare beyond their reliable limits.
These are the results, first, execution time and then accuracy.
1. Lingua could have a small advantage as it participates with 54 languages, 6 less.
2. CLD2 and CLD3, return a list of languages, the ones not included in this test where discarded, but usually they return one language, I believe they have a disadvantage.
Also, I confirm the results of CLD2 for short text are correct, contrary to the test on the Lingua page, they did not use the parameter "bestEffort = True", their benchmark for CLD2 is unfair.
Lingua is the average accuracy winner, but at what cost, the same test that in ELD or CLD2 lasts 2 seconds, in Lingua takes more than 5 hours! It acts like a brute-force software. Also, its lead comes from single and pair words, which are unreliable regardless.
I added ELD-L for comparison, which has a 2.3x bigger database, but only increases execution time marginally, a testament to the efficiency of the algorithm. ELD-L is not the main database as it does not improve language detection in sentences.
Here is the average, per benchmark, of Tweets, Big test & Sentences.
Testing
-
To make sure everything works on your setup, you can execute the following file:
-
Also, for composer "autoload-dev", the following line will also execute the tests
- To run the accuracy benchmarks run the
benchmark/bench.php
file .
Languages
These are the ISO 639-1 codes of the 60 supported languages for Nito-ELD v1
am, ar, az, be, bg, bn, ca, cs, da, de, el, en, es, et, eu, fa, fi, fr, gu, he, hi, hr, hu, hy, is, it, ja, ka, kn, ko, ku, lo, lt, lv, ml, mr, ms, nl, no, or, pa, pl, pt, ro, ru, sk, sl, sq, sr, sv, ta, te, th, tl, tr, uk, ur, vi, yo, zh
Full name languages:
Amharic, Arabic, Azerbaijani (Latin), Belarusian, Bulgarian, Bengali, Catalan, Czech, Danish, German, Greek, English, Spanish, Estonian, Basque, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Armenian, Icelandic, Italian, Japanese, Georgian, Kannada, Korean, Kurdish (Arabic), Lao, Lithuanian, Latvian, Malayalam, Marathi, Malay (Latin), Dutch, Norwegian, Oriya, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovene, Albanian, Serbian (Cyrillic), Swedish, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Yoruba, Chinese
Future improvements
- Train from bigger datasets, and more languages.
- The tokenizer could separate characters from languages that have their own alphabet, potentially improving accuracy and reducing the N-grams database. Retraining and testing is needed.
Donate / Hire
If you wish to Donate for open source improvements, Hire me for private modifications / upgrades, or to Contact me, use the following link: https://linktr.ee/nitotm
All versions of efficient-language-detector with dependencies
ext-mbstring Version *