Download the PHP package vladan-me/fingerprint without Composer
On this page you can find all versions of the php package vladan-me/fingerprint. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download vladan-me/fingerprint
More information about vladan-me/fingerprint
Files in vladan-me/fingerprint
Package fingerprint
Short Description Provides a custom implementation of fingerprint and ngram algorithms in PHP
License MIT
Informations about the package fingerprint
Fingerprint
Fingerprint is an algorithm that was developed by Google Refine (later OpenRefine). The (optional) improvement over original algorithm is bolded.
- remove leading and trailing whitespace
- change all characters to their lowercase representation
- remove all punctuation and control characters
- ~~normalize extended western characters to their ASCII representation (for example "gödel" → "godel")~~
- apply synonyms
- apply removals
- split the string into whitespace-separated tokens
- sort the tokens and remove duplicates
- join the tokens back together
Transliteration is the slowest part of original algorithm and if you dealing mostly with English language it is a waste of time. The original algorithm has limitations because it misses all synonyms and removals. Synonyms and removals are based on English language so it has limited appliance in languages other than English. Consider titles like:
- VP Sales and Marketing
- Vice President Marketing & Sales
- Vice President of Sales and Marketing
- Vice President - Sales and Marketing ... (+100 more ways to write that title, literally)
Use cases
- Simple and fast clustering of data.
- Standardization and grouping similar values in the database.
- Situations where you have users typing city/company/street/title in so many ways and you're slowly dying inside with so many combinations...
Documentation
Initialize Fingerprint type and pass it as a parameter in Fingerprint.
More advanced usage is for specific types, for example:
Please look at tests for common usage.
Synonyms and Removals
They are broken down in two categories, basic synonyms/removals that have the most common ones and all other possible combinations that can be heavier for computation. For the fastest usage, you don't need all synonyms/removals. All of them are handpicked based on a clusters from large dataset. Of course, there are a lot more but only ones that make sense are listed. In some cases there are synonyms and removals in the same time, for example, for Company type:
'corp' first becomes 'corporation' and then is removed completely.
System Requirements
You need PHP >= 5.4.0.
Install
Install fingerprint
using Composer.
Additional Notes
There's another package named fingerprint-elasticsearch that fully prepares Elasticsearch analyzer and filters to use this version of fingerprint algorithm. This project currently also has ngram implementation that should likely be separated at some point.
Contributing
Contributions are welcome and will be fully credited. Please see CONDUCT for details.
License
The MIT License (MIT). Please see LICENSE for more information.