Download the PHP package farzai/thai-word without Composer
On this page you can find all versions of the php package farzai/thai-word. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download farzai/thai-word
More information about farzai/thai-word
Files in farzai/thai-word
Package thai-word
Short Description Thai word segmentation library for PHP
License MIT
Homepage https://github.com/parsilver/thai-word-php
Informations about the package thai-word
Thai Word Segmentation - PHP Library
A library for Thai word segmentation in PHP.
Features
- Thai word segmentation with high accuracy
- Word suggestions for typos and misspellings
- Dictionary loading from local file, remote file, and remote URL
- Performance optimizations with caching and memory management
- Batch processing for large text volumes
- Custom configuration with caching, memory limit, and batch size
- Mixed content support (Thai, English, numbers, punctuation)
Requirements
- PHP 8.4+
- Composer
Installation
You can install the package via composer:
Basic Usage
Using the Facade (Recommended)
Using ThaiSegmenter Directly
Word Suggestions for Typos
How It Works
This library segments Thai text into words and provides intelligent word suggestions through a highly optimized process. Here's how it works step by step:
Step 1: Text Input & Validation
- You provide Thai text as a string to the
ThaiSegmenter
- Example:
'สวัสดีครับผมชื่อสมชาย'
- The library validates UTF-8 encoding and handles empty strings
Step 2: Dictionary Loading (Automatic)
The library automatically loads Thai words using several sources with intelligent fallback:
- LibreOffice Thai Dictionary: Downloads from official LibreOffice repository (primary source)
- Local Dictionary Files: Falls back to local dictionary files if available
- Basic Dictionary: Uses built-in common Thai words as last resort
The dictionary is stored in a HashDictionary
with O(1) lookup performance.
Step 3: Smart Text Processing
The LongestMatchingStrategy
algorithm processes text intelligently:
Character Classification:
- Thai characters: Unicode range 0x0E00-0x0E7F for fast detection
- English words: Handled as complete word units
- Numbers: Processed as number sequences (with decimals, commas)
- Punctuation: Handled appropriately with whitespace normalization
Step 4: Longest Matching Algorithm
Step 5: Word Suggestion System (Optional)
When enabled, the library can suggest corrections for typos using advanced similarity algorithms:
Levenshtein Distance Algorithm:
Smart Suggestion Integration:
- Single-character only:
segmentWithSuggestions()
only provides suggestions for single-character segments that are NOT in the dictionary - Multi-character words: Use
suggest()
method directly for multi-character word suggestions - Threshold requirements: Single-character similarities max out at 0.5, so use threshold ≤ 0.5 for best results
- Configurable similarity thresholds: 0.4-0.5 for single characters, 0.6-0.7 for multi-character words
- Performance-optimized: Caching and length-based filtering for large dictionaries
- Unicode-aware for proper Thai character handling
Step 6: Performance Optimizations
The library includes several optimizations:
- Caching: Recently segmented texts are cached for faster repeat processing
- Batch Processing: Large texts are processed in chunks to manage memory
- Memory Management: Automatic garbage collection and memory optimization
- Adaptive Processing: Different strategies for short, medium, and long texts
- Suggestion Caching: Distance calculations cached for repeated similarity checks
Step 7: Mixed Content Handling
- Thai words are processed with dictionary lookup
- English words are kept as complete units
- Numbers and punctuation are handled appropriately
Key Components
- ThaiSegmenter: Main orchestrator with performance monitoring and suggestion integration
- HashDictionary: O(1) hash-based word lookup with 70% less memory usage than trie structures
- LongestMatchingStrategy: Optimized algorithm with character classification
- LevenshteinSuggestionStrategy: Unicode-aware word suggestion algorithm with caching
- DictionaryLoaderService: Handles loading from files, URLs, and remote sources
Performance Features
- 3-5x faster processing speed with optimized algorithms
- 50% lower memory usage with hash-based dictionary
- Intelligent suggestions with configurable accuracy thresholds
- Automatic optimization based on text characteristics
- Built-in statistics for performance monitoring
Real Usage Examples
Using the Facade (Simple & Clean)
Using ThaiSegmenter Directly (Advanced Control)
This architecture ensures both accuracy and performance while remaining simple to use.
Advanced Usage
Custom Suggestion Strategies
Performance Monitoring with Suggestions
Batch Processing with Suggestions
Understanding Suggestion Behavior
Important: The segmentWithSuggestions()
method only provides suggestions for single-character segments that are NOT found in the dictionary.
Threshold Guidelines:
- Single characters: Use 0.4-0.5 (similarities max out at 0.5)
- Multi-character words: Use 0.6-0.7 (higher precision possible)
Configuration Options
Testing
Changelog
Please see CHANGELOG for more information on what has changed recently.
Contributing
Please see CONTRIBUTING for details.
Security Vulnerabilities
Please review our security policy on how to report security vulnerabilities.
Credits
- parsilver
- All Contributors
Data Sources
- LibreOffice Thai Dictionary - Primary Thai word dictionary source
License
The MIT License (MIT). Please see License File for more information.
All versions of thai-word with dependencies
psr/http-client Version ^1.0
psr/http-factory Version ^1.0
php-http/discovery Version ^1.19
guzzlehttp/psr7 Version ^2.0