Download the PHP package onoi/tesa without Composer
On this page you can find all versions of the php package onoi/tesa. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Package tesa
Short Description A simple library to sanitize text elements
License GPL-2.0+
Homepage https://github.com/onoi/tesa
Informations about the package tesa
Tesa (text sanitizer)
The library contains a small collection of helper classes to support sanitization of text or string elements of arbitrary length with the aim to improve search match confidence during a query execution that is required by Semantic MediaWiki project and is deployed independently.
Requirements
- PHP 5.3 / HHVM 3.5 or later
- Recommended to enable the ICU extension
Installation
The recommended installation method for this library is by adding the following dependency to your composer.json.
Usage
SanitizerFactory
is expected to be the sole entry point for services and instances when used outside of this libraryIcuWordBoundaryTokenizer
is a preferred tokenizer in case the ICU extension is availableNGramTokenizer
is provided to increase CJK match confidence in case the back-end does not provide an explicit ngram tokenizerStopwordAnalyzer
together with aLanguageDetector
is provided as a means to reduce ambiguity of frequent "noise" words from a possible search indexSynonymizer
currently only provides an interface
Contribution and support
If you want to contribute work to the project please subscribe to the developers mailing list and have a look at the contribution guidelinee. A list of people who have made contributions in the past can be found here.
Tests
The library provides unit tests that covers the core-functionality normally run by the
continues integration platform. Tests can also be executed manually using the
composer phpunit
command from the root directory.
Release notes
- 0.1.0 Initial release (2016-08-07)
- Added
SanitizerFactory
with support for a Tokenizer
,LanguageDetector
,Synonymizer
, andStopwordAnalyzer
interface
- Added
Acknowledgments
- The
Transliterator
uses the same diacritics conversion table as http://jsperf.com/latinize (except the German diaeresis ä, ü, and ö) - The stopwords used by the
StopwordAnalyzer
have been collected from different sources, eachjson
file identifies its origin CdbStopwordAnalyzer
relies onwikimedia/cdb
to avoid using an external database or cache layer (with extra stopwords being available here)JaTinySegmenterTokenizer
is based on the work of Taku Kudo and his tiny_segmenter.jsTextCatLanguageDetector
uses thewikimedia/textcat
library to make predictions about a language
License
All versions of tesa with dependencies
ext-mbstring Version *
wikimedia/cdb Version ~1.0
wikimedia/textcat Version ~1.1