Download the PHP package affinity4/tokenizer without Composer
On this page you can find all versions of the php package affinity4/tokenizer. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download affinity4/tokenizer
More information about affinity4/tokenizer
Files in affinity4/tokenizer
Package tokenizer
Short Description A zero-dedpendency tokenizer written in PHP. Returns an easily navigatable Stream object of Token objects with public type, value, offset and length properties
License MIT
Informations about the package tokenizer
Tokenizer
A zero-dedpendency tokenizer written in PHP. Returns an easily navigatable Stream object of Token objects with public type, value, offset and length properties
Simply pass an associative array [match_pattern => type] ('\s+' => 'T_WHITESPACE', '[a-zA-Z]\w+' => 'T_STRING'
), and the Tokenizer will return all matches as an array of Token objects
Installation
Composer
composer require affinity4/tokenizer
Basic Example
Let's assume we want to create a DSL (Domain Specific Language) for a template engine language that looks more like code, instead of markup
Example template snippet:
Now we define our "lexicon", which is passed to the tokenizer:
NOTE:
The lexicon must supply all characters and patterns you expect to encounter in your grammar. Currently you cannot skip any characters. Everything must be tokenized, whether you use it later or not.
We pass the lexicon to the tokenizer...
From here you just need to write your "finite automata" and or/your parser.
TIPS
debug()
The Tokenizer has a debug() method, which will return the compiled regex, for you to examine.
TIP:
A good website for testing PHP regexes is: https://regexr.com/
The debug method will by default return the regex as a string, however, you can also echo, var_dump and "dump and die" (or dd() for you Laravel users).
There are constants defined for all of these to help you avoid using the switches for these
$Tokenizer->debug(Tokenizer::DEBUG_ECHO)
$Tokenizer->debug(Tokenizer::DEBUG_DUMP)
$Tokenizer->debug(Tokenizer::DEBUG_DUMP_AND_DIE)
preg_match_all(): Compilation failed: missing closing parenthesis at offset x
Attempting to match backslashes, or newline chars (e.g. \r|\n|\r\n) is most likely the cause of your troubles.
You will need to double escape backslashes. To help you avoid needing to figure this out I have provided the correct regex patterns for T_ESCAPE_CHAR.
Newlines
Newlines will need to be replaced with a token before they can be matched. By default the T_NEWLINE_ALL constant will match ;T_NEWLINE;
If you need to match individual newline characters for a specific environment you can use the following constants
See the following section on Matching Backslashes and Special Characters if you want more info
Matching Backslashes and Special Characters
As mentioned above, backslashes must be double escaped.
So to match a single backslash your must use the regex '\\\\'
(I know, it sucks, but you have to)
To match special characters (tabs, newlines, cariage returns etc) you will need to replace them with another token first, and then add a token for the replacement string.
I am working on some better detection internally for these patterns and attempt to provide better error messages when these errors are encountered (I'll go real meta and regex the regex before it's ran or something)