Download the PHP package envoymediagroup/columna without Composer
On this page you can find all versions of the php package envoymediagroup/columna. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download envoymediagroup/columna
More information about envoymediagroup/columna
Files in envoymediagroup/columna
Package columna
Short Description Columnar analytics for PHP - a pure PHP library to read and write simple columnar files in a performant way.
License MIT
Informations about the package columna
Columnar Analytics (in pure PHP)
On GitHub: https://github.com/envoymediagroup/columna
About the project
What does it do?
This library allows you to write and read a simple columnar file format in a performant way with a lightweight, pure PHP implementation.
Why columnar analytics in PHP?
This library started as a scratch-our-own-itch project at Envoy Media Group. We needed fast, columnar analytics that would work well with our all-PHP stack, but found PHP's support and performance for mainstream columnar formats (Parquet, ORC, etc.) to be lacking. So we rolled our own simple columnar format with its own speedy writer and reader.
How battle tested is it?
This library has been in production use as the backbone of Envoy's analytics and business intelligence since early 2022. It processes hundreds of thousands of reads and writes per day, serving both custom reports for business users and automated requests for monitoring and machine learning applications. Bug fixes, feature adds, and improvements are ongoing based on our experience using this library every day in production.
Installation
Add this library to your project using Composer:
File format
What file format does this library use to store data? The file extension .scf
is for Simple Columnar Format, and it is simple: all the metadata about the file, its columns, and their definitions and offsets are stored on line 1 in a JSON header. The rest of the record is CSV-like data in a columnar arrangement (each column corresponding to one line in the file) using RLE compression and a Record Separator character as the RLE delimiter. There is some extra escaping applied to the strings to increase the range of valid values that can be stored and retrieved. See a sample file here.
Usage
Writer
Each columnar file is specific to one date and one metric, with any number of dimensions. For this example, we will assume a metric named clicks
and three dimensions named platform_id
, site_id
, and url
. Note that we provide the headers and values as separate inputs to the Writer; this makes sense when we are working with large data sets and want to preserve some memory by not duplicating associative string keys on every array item.
Data Types
Currently supported data types include strings, ints, floats, and bools, and a special "datetime" type. Datetimes are treated as strings except when evaluating query conditions, when they are parsed with strtotime() and compared with integer operations >, <, =, etc. Nested data is not currently supported. While it is possible to store JSON or other serializations in the string type, these values will not be unserialized by the engine and so cannot be evaluated for nested values. The column definitions include an empty value which will always be used in place of nulls in the data set, so null is never stored in the files or returned when reading a file.
Usage
Let's walk through using the Writer in the comments below:
Now we have a complete file at $file_path
.
CombinedWriter
The regular Writer
allows you to take a row-based data set and transform it into a columnar file. The CombinedWriter
then allows you to take multiple existing columnar files and combine them into a new columnar file containing all the data from the provided files. This only works if the files you provide are all for the same metric, on the same date, with the same columns. You can use this to distribute the work of generating data sets and files across a large number of workers, and then use another worker to combine those results into a single large file containing all the data for that metric on that date. You can use it like so:
We now have a file at $combined_file_path
with all the data in it from the array of $partial_files
we collected.
Reader
Here's how to read a file. Note that this library contains both Reader
and BundledReader
classes. They both do the same thing and you can use them interchangeably, but you will see a slight performance win by using the BundledReader
because it reduces the number of include()
s PHP has to perform. It's a small win that can add up at scale.
Call with arguments, get array results
To call the Reader normally with arguments:
Call with JSON string workload, get JSON+CSV string results
The Reader is designed for easy use when running a large number of requests distributed over many worker processes using an RPC or messaging framework such as AWS SQS, RabbitMQ, or our own envoymediagroup/lib-rpc
. For this reason, the Reader can accept a string as its input and return a string as its output. The request string is a JSON serialization of the Reader arguments. For the result string, the first line is the metadata of the response encoded as JSON, and the following lines are the result data encoded as CSV with a bit of extra escaping for more safety in encoding/decoding strings. The Response
class will handle unserializing this string for you. Be sure to use this Response class to parse results, as it will handle unescaping those strings properly.
An example caller:
An example worker:
Metadata
Metadata looks like this:
Results
Result data set looks like this. Note that you can reference the 'index' field in the 'column_meta' of the metadata to map the indexes in each record to the appropriate column names.
Q&A
Why didn't you use library X, built-in function Y, or design pattern Z?
The short answer is performance. I kept the requirements of this library as small as possible to make the autoload very lightweight and reduce time spent include()
ing files, which adds up quickly when you are optimizing for every millisecond. Many of PHP's built-in array functions actually run slower than foreach
ing the same array. Design patterns with more abstraction mean more classes and more weight. Keeping it simple keeps it fast.
Issues, Feature Requests
See the open issues for a full list of known issues or to submit an issue or feature request.
Of course, if you spot any egregious bugs or security holes, please create an issue and notify me right away (contact info below).
Contributing
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Copy
.env.base
to.env
(required) and update any environment variables (optional) - Run
docker-compose up
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Make changes
- Run
docker exec -it columna composer run test
to make sure the unit tests pass - Run
docker exec -it columna composer run bundle
to create a newBundledReader.php
- Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
Creator: Ryan Marlow
Twitter:@myanrarlow
Email: [email protected]
Acknowledgments
Here are some resources I've found helpful for this project.
All versions of columna with dependencies
ext-json Version *
ext-mbstring Version *