Download the PHP package turanjanin/serbian-language-tools without Composer

On this page you can find all versions of the php package turanjanin/serbian-language-tools. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package serbian-language-tools

Serbian Language Tools - PHP library for Transliteration & Diacritic Restoration

Serbian Language Tools is a PHP library for dealing with text written in Serbian language. It features:

Requirements

This library requires PHP 7.4 or greater with sqlite3, intl and mbstring extensions.

Installation

You can install the package via composer:

Usage

In order to use the library, you need to tokenize the string. Tokenization is a process of splitting the string into a series of related characters. This library can recognize the following tokens: Word, Whitespace, URI (which includes URLs, hashtags and at-mentions), Interpunction, HTML and Emoticon.

Tokenizing can be achieved by creating a new instance of Text class using the named constructor:

Text object will now contain an array of various tokens that can be processed. You can use this object as any other PHP array since it implements ArrayAccess interface.

Diacritic Restoration / Diacritization

Serbian Latin alphabet includes a couple of specific characters that are not found in ASCII encoding table. These characters feature diacritics - č, ć, š, ž, dž, đ - which are often omitted in everyday communication (social media, emails and SMS), mainly due to the widespread usage of English keyboard layouts.

This degraded Latin alphabet can be easily understood by human readers but it poses significant challenge for search engines and natural language processing. Therefore, this library features an algorithm that allows automated restoration of ASCII text by using a dictionary of Serbian words and phrases for context disambiguation.

The algorithm inspects all Word tokens and looks for restoration candidates - the words with s, c, z or dj characters. After that, the following two steps are applied:

  1. The most common phrases are searched for inside the text and, if found, words are replaced with their diacritical equivalents. This step takes word context into consideration which allows us to give advantage to some less used variations. For example, sto hiljada won't be replaced with što hiljada, even though the form što (why) has much greater frequency compared to word sto (hundred).

  2. Every restoration candidate is looked up in the dictionary and, if there are known variations, token is replaced with RestoredWord (if there is only one possible variation) or MultipleRestoredWord (if there are more possible variations). In case of more than one variation, the one with the highest frequency will be marked as preferred.

Diacritic restoration can be performed by calling the invokable class:

Dictionary needed for this algorithm is stored in custom-made SQLite database that is included with this library. You can extend this database or use different storage solution by providing custom implementation of Turanjanin\SerbianLanguageTools\Dictionary\Dictionary interface.

Transliteration

Library supports transliteration of text between Cyrillic, Latin and ASCII alphabets. Transliteration can be performed by calling appropriate invokable class:

If you need only transliteration between Latin and Cyrillic alphabets, take a look at the simpler library - turanjanin/serbian-transliterator.

Alphabet Detection

Library can be used to detect if text is written in Serbian Cyrillic or Latin alphabet:

Author

License

The MIT License (MIT). Please see License File for more information.


All versions of serbian-language-tools with dependencies

PHP Build Version
Package Version
Requires php Version ^7.4|^8.0
ext-sqlite3 Version *
ext-intl Version *
ext-mbstring Version *
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package turanjanin/serbian-language-tools contains the following files

Loading the files please wait ....