Download the PHP package pcrov/unicode without Composer
On this page you can find all versions of the php package pcrov/unicode. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download pcrov/unicode
More information about pcrov/unicode
Files in pcrov/unicode
Package unicode
Short Description Miscellaneous Unicode utility functions
License MIT
Homepage https://github.com/pcrov/unicode
Informations about the package unicode
Unicode
Miscellaneous Unicode utility functions.
Functions
Namespace pcrov\Unicode
.
surrogate_pair_to_code_point(int $high, int $low): int
Translates a UTF-16 surrogate pair into a single code point. Wikipedia's UTF-16 article explains what this is fairly well.
utf8_find_invalid_byte_sequence(string $string): ?int
Returns the position of the first invalid byte sequence or null if the input is valid.
utf8_get_invalid_byte_sequence(string $string): ?string
Returns the first invalid byte sequence or null if the input is valid.
utf8_get_state_machine(): array
Provides a state machine letting you walk a (potentially endless) UTF-8 sequence byte by byte.
It is in the form of [byte => [valid next byte => ...,], ...]
Example use:
utf8_validate(string $string): bool
Does what it says on the box.
Data
The test/data directory holds two files containing all possible UTF-8 encoded characters.
All 1,112,064 of them. One as plain text, the other as json. These are not included in
packaged stable releases but can be generated with the example utf8_generate_all_code_points()
function above (returns the plain text string.)
Excerpts from the Unicode 10.0.0 standard:
Recreated here for ease of reference. Nobody likes PDFs.
Table 3-6. UTF-8 Bit Distribution
Scalar Value | First Byte | Second Byte | Third Byte | Fourth Byte |
---|---|---|---|---|
00000000 0xxxxxxx | 0xxxxxxx | |||
00000yyy yyxxxxxx | 110yyyyy | 10xxxxxx | ||
zzzzyyyy yyxxxxxx | 1110zzzz | 10yyyyyy | 10xxxxxx | |
000uuuuu zzzzyyyy yyxxxxxx | 11110uuu | 10uuzzzz | 10yyyyyy | 10xxxxxx |
Table 3-7. Well-Formed UTF-8 Byte Sequences
Code Points | First Byte | Second Byte | Third Byte | Fourth Byte |
---|---|---|---|---|
U+0000..U+007F | 00..7F | |||
U+0080..U+07FF | C2..DF | 80..BF | ||
U+0800..U+0FFF | E0 | A0..BF | 80..BF | |
U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
U+D000..U+D7FF | ED | 80..9F | 80..BF | |
U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF |