PHP download

Download the PHP package codename/parquet without Composer

On this page you can find all versions of the php package codename/parquet. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

Table of contents
Download codename/parquet
More information about codename/parquet
Files in codename/parquet

Vendor codename
Package parquet
Short Description Thrift-based PHP implementation for using the Apache Parquet format
License MIT

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:

If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.

Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
To use Composer is sometimes complicated. Especially for beginners.
Composer needs much resources. Sometimes they are not available on a simple webspace.
If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.

Please rate this library. Is it a good library?

Example code of codename/parquet

Informations about the package parquet

php-parquet

GitHub Workflow Status (with event)

This is the first parquet file format reader/writer implementation in PHP, based on the Thrift sources provided by the Apache Foundation. Extensive parts of the code and concepts have been ported from parquet-dotnet (see https://github.com/elastacloud/parquet-dotnet and https://github.com/aloneguid/parquet-dotnet). Therefore, thanks go out to Ivan Gavryliuk (https://github.com/aloneguid).

This package enables you to read and write Parquet files/streams w/o the use of exotic external extensions (except you want to use exotic compression methods). It has (almost?) 100% test compatibility with parquet-dotnet, regarding the core functionality, done via PHPUnit.

Important

This repository (and associated package on Packagist) is the official project continuation of jocoon/parquet. Due to various improvements and essential bugfixes, here in codename/parquet, using the legacy package is highly discouraged.

Index

Requirements
Installation
General Remarks
Usage / API
- Reading files
- Writing files
Simplified usage
- Reading using ParquetDataIterator
- Writing using ParquetDataWriter
Complex data handling

Preamble

For some parts of this package, some new patterns had to be invented as I haven't found any implementation that met the requirements. For most cases, there weren't any implementations available, at all.

Some highlights:

GZIP Stream Wrappers (that also write headers and checksums) for usage with fopen() and similar functions
Snappy Stream Wrappers (Snappy compression algorithm) for usage with fopen() and similar functions
Stream Wrappers that specify/open/wrap a resource id instead of (or in addition to) a file path or URI
TStreamTransport as a TTransport implementation for pure streaming Thrift data

Background

I started developing this library due to the fact, there was simply no implementation for PHP.

At my company, we needed a quick solution to archive huge amounts of data from a database in a format that is still queryable, extensible from a schema-perspective and fault-tolerant. We started testing live 'migrations' via AWS DMS to S3, which ended up crashing on certain amounts of data, due to memory limitations. And it simply was too db-oriented, next to the fact it's easy to accidentally delete data from previous loads. As we have a heavily SDS-oriented and platform-agnostic architecture, it is not my preferred way to store data as a 1:1 clone of database, like a dump. Instead, I wanted to have the ability to store data, structured dynamically, like I wanted, in the same way DMS was exporting to S3. Finally, the project died due to the reasons mentioned above.

But I couldn't get the parquet format out of my head..

The TOP 1 search result (https://stackoverflow.com/questions/44780419/how-to-create-orc-or-parquet-files-from-php-code) looked promising that it would not take that much effort to have a PHP implementation - but in fact, it did take some (about 2 weeks non-consecutive work). For me, as a PHP and C# developer, parquet-dotnet was a perfect starting point - not merely due to the fact the benchmarks are simply too compelling. But I expected the PHP implementation not to meet these levels of performance, as this is an initial implementation, showing the principle. And additionally, no one had done it before.

Raison d'être

As PHP has a huge share regarding web-related projects, this is a MUST-HAVE in times of growing need for big data applications and scenarios. For my personal motivation, this is a way to show PHP has (physically, virtually?) surpassed it's reputation as a 'scripting language'. I think - or at least I hope - there are people out there that will benefit from this package and the message it transports. Not only Thrift objects. Pun intended.

Requirements

You'll need several extensions to use this library to the full extent.

bcmath (today, this should be a must-have anyway)
gmp (for working with arbitrary large integers - and indirectly huge decimals!)
zlib (for GZIP (de-)compression)
snappy (https://github.com/kjdev/php-ext-snappy - sadly, not published yet to PECL - you'll have to compile it yourself - see Installation)

This library was originally developed to/using PHP 7.3, but it should work on PHP > 7 and will be tested on 8, when released. At the moment, tests on PHP 7.1 and 7.2 will fail due to some DateTime issues. I'll have a look at it. Tests fully pass on PHP 7.3 and 7.4. At the time of writing also 8.0.0 RC2 is performing well.

This library highly depends on

packaged/thrift for working with the Thrift-related objects and data (stripped-down version of apache/thrift)
__pear/Math_BigInteger__ for working with binary stored arbitrary-precision decimals (paradox, I know)

As of v0.2, I've also switched to an implementation-agnostic approach of using readers and writers. Now, we're dealing with BinaryReader(Interface) and BinaryWriter(Interface) implementations that abstract the underlying mechanism. I've noticed mdurrant/php-binary-reader is just way too slow. I just didn't want to refactor everything just to try out Nelexa's reading powers. Instead, I've made those two interfaces mentioned above to abstract various packages delivering binary reading/writing. This finally leads to an optimal way of testing/benchmarking different implementations - and also mixing, e.g. using wapmorgan's package for reading while using Nelexa's for writing.

As of v0.2.1 I've done the binary reader/writer implementations myself, as no implementation met the performance requirements. Especially for writing, this ultra-lightweight implementation delivers thrice the performance of Nelexa's buffer.
_{^{_^\}} intended, I love this word

Alternative 3rd party binary reading/writing packages in scope:

nelexa/buffer
mdurrant/php-binary-reader (reading only)
wapmorgan/binary-stream

Installation

Install this package via composer, e.g.

The included Dockerfile gives you an idea of the needed system requirements. The most important thing to perform, is to clone and install php-ext-snappy. At the time of writing, it has not been published do PECL, yet.

Please note: php-ext-snappy is a little bit quirky to compile and install on Windows, so this is just a short information for installation and usage on Linux-based systems. As long as you don't need the snappy compression for reading or writing, you can use php-parquet without compiling it yourself.

Helping tools to make life easier

I've found ParquetViewer (https://github.com/mukunku/ParquetViewer) by Mukunku to be a great way of looking into the data to be read or verifying some stuff on a Windows desktop machine. At least, this helps understanding certain mechanisms, as it more-or-less visually assists by simply displaying the data as a table.

API

Usage is almost the same as parquet-dotnet. Please note, we have no , like in C#. So you have to make sure to close/dispose unused resources yourself or let PHP's GC handle it automatically by its refcounting algorithm. (This is the reason why I don't make use of destructors like parquet-dotnet does.)

General remarks

As PHP's type system is completely different to C#, we have to make some additions on how to handle certain data types. For example, a PHP integer is nullable, somehow. An int in C#, isn't. This is a point I'm still unsure about how to deal with it. For now, I've set int (PHP integer) to be nullable - parquet-dotnet is doing this as not-nullable. You can always adjust this behaviour by manually setting on your DataField. Additionally, php-parquet uses a dual way of determining a type. In PHP, a primitive has it's own type (integer, bool, float/double, etc.). For class instances (especially DateTime/DateTimeImmutable), the type returned by get_type() is always object. This is the reason a second property for the DataTypeHandlers exist to match, determine and process it: phpClass.

At the time of writing, not every DataType supported by parquet-dotnet is supported here, too. F.e. I've skipped Int16, SignedByte and some more, but it shouldn't be too complicated to extend to full binary compatibility.

At the moment, this library serves the core functionality needed for reading and writing parquet files/streams. It doesn't include parquet-dotnet's Table, Row, Enumerators/helpers from the C# namespace .

Reading files

Writing files

Simplified Usage

You can also use ParquetDataIterator and ParquetDataWriter for working even with highly complex schemas (f.e. nested data). Though experimental at the time of writing, unit- and integration tests indicate we have a 100% compatibility with Spark, as most of the other Parquet implementations lack certain features or cases of super-complex nesting.

ParquetDataIterator and ParquetDataWriter leverage the 'dynamic-ness' of the PHP type system and (associative) arrays - which only comes to a halt when fully using unsigned 64-bit integers - those can only be partially supported due to the nature of PHP.

Reading

ParquetDataIterator automatically iterates over all row groups and data pages, over all columns of the parquet file in the most memory-efficient way possible. This means, it doesn't load all datasets into memory, but does it on a per-datapage/per-row-group basis.

Under the hood, it leverages the functionality of DataColumnsToArrayConverter which ultimately does all the 'heavy lifting' regarding Definition and Repetition Levels.

Writing

Vice-versa, ParquetDataWriter allows you two write a Parquet file (in-memory or on disk) by passing PHP associative array data, either one at a time or in batches. Internally, it uses ArrayToDataColumnsConverter to produce data, dictionaries, definition and repetition levels.

Complex data

php-parquet supports the full nesting capabilities of the Parquet format. You may notice, depending on what field types you're nesting, you'll somehow 'lose' key names. This is by design:

List elements don't have a key - they're array elements
A Map's value field itself has no key in an associative array - the key is provided by the Map's key column
A repeated field is implicitly converted to an array-like structure

Generally speaking, here are the PHP equivalents of the Logical Types of the Parquet Format:

Parquet	PHP	JSON	Note
DataField	primitive	primitive	f.e. string, integer, etc.
ListField	array	array `[]`	element type can be a primitive or even a List, Struct or Map
StructField	associative array	object `{}`	Keys of the assoc. array are the field names inside the StructField
MapField	associative array	object `{}`	Simplified: `array_keys($data['someField'])` and `array_values($data['someField'])`, but for each row

The format is compatible to JSON export data generated by Spark configured with spark.conf.set("spark.sql.jsonGenerator.ignoreNullFields", False). By default, Spark strips out null values completely when exporting to JSON.

Please note: All those field types can be made nullable or non-nullable/required on every nesting level (affects definition levels). Some nullabilities are used f.e. to represent empty lists and distinguish them from a null value for a list.

Performance

This package also provides the same benchmark as parquet-dotnet. These are the results on my machine:

	Parquet.Net (.NET Core 2.1)	php-parquet (bare metal 7.3)	php-parquet (dockerized* 7.3)	Fastparquet (python)	parquet-mr (Java)
Read	255ms	1'090ms	1'244ms	154ms**	untested
Write (uncompressed)	209ms	1'272ms	1'392ms	237ms**	untested
Write (gzip)	1'945ms	3'314ms	3'695ms	1'737ms**	untested

Dockerized on a Windows 10 machine with bind-mounts, which slow down most of those high-IOPS processes.
** It seems fastparquet or Python does some internal caching - the original results on first file opening are way worse (~ 2'700ms)

In general, these tests were performed with gzip compression level 6 for php-parquet. It will roughly halve with 1 (minimum compression) and almost double at 9 (maximum compression). Note, the latter might not yield the smallest file size, but always the longest compression time.

Coding Style

As this is a partial port of a package from a completely different programming language, the programming style is pretty much a pure mess. I decided to keep most of the casing (e.g. $writer->CreateRowGroup() instead of ->createRowGroup()) to keep a certain 'visual compatibility' to parquet-dotnet. At least, this is a desirable state from my perspective, as it makes comparing and extending much easier during initial development stages.

Acknowledgements

Some code parts and concepts have been ported from C#/.NET, see:

License

php-parquet is licensed under the MIT license. See file LICENSE.

Contributing

Feel free to do a PR, if you want. As this is a spare-time OSS project, contributions will help all users of this package, including yourself. Please apply a pinch of common sense when creating PRs and/or issues, there's no template.

All versions of parquet with dependencies

PHP Build Version

Package Version

Version v0.7.2 Release 29. Mar 2025
create-project require 0 people chose require and
0 people chose create-project.

Download

Download latest version of parquet from vendor codename

Requires php Version >=7.3 <=8.4.99
ext-gmp Version *
ext-bcmath Version *
ext-zlib Version *
pear/math_biginteger Version ^1.0
packaged/thrift Version ^0.16.0

Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package codename/parquet contains the following files

Loading the files please wait ....