Download the PHP package jbzoo/csv-blueprint without Composer

On this page you can find all versions of the php package jbzoo/csv-blueprint. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package csv-blueprint

CSV Blueprint

CI CI Psalm Coverage Coverage Bugs Code smells Docker Pulls

Strict and automated line-by-line CSV validation tool based on customizable Yaml schemas.

In seconds, make sure every char in a gigabyte file meets your expectations.

I believe it is the simplest yet flexible and powerful CSV validator in the world. ☺️

Features

CLICK to see a typical cross-team workflow 1. **Data Preparation:** Team A generates CSV data adhering to a predefined format and places the file in a shared location accessible to Team B (e.g., a shared repository or cloud storage). 2. **Notification:** Team A notifies Team B that the data is ready via corporate communication channels (email, chat, task management system). 3. **Validation:** Team B uses predefined validation rules stored in the repository to check the CSV file for accuracy and integrity before importing. This includes verifying date formats, numerical values, and the presence of required columns. 4. **Data Import:** After successful validation, Team B imports the data from the CSV file into their system for further processing. 5. **Error Handling:** If validation identifies errors, the process halts, and Team B provides feedback to Team A for data correction. **Why Validation is Necessary:** - **Reduce Errors:** Validating data before import minimizes the likelihood of errors, enhancing data quality. - **Efficiency:** Prevents time loss on manual error correction post-import. - **Data Consistency:** Ensures data meets the expectations and requirements of Team B, facilitating accurate processing and analysis. - **Automation:** Storing validation rules in the repository eases the process of checking automation and simplifies updating validation criteria.

Live demo

As a live demonstration of how the tool works, you can explore the super minimal repository at demo. For more complex examples and various reporting methods, take a look at the demo pipeline with different reports types.

See also:

Table of content

Usage

Docker container

Ensure you have Docker installed on your machine.

GitHub Action

You can find launch examples in the workflow demo.

Phar binary

CLICK to see using PHAR file Ensure you have PHP installed on your machine.

Schema definition

Define your CSV validation schema in YAML for clear and structured configuration. Alternative formats are also supported: PHP, accommodating various preferences and workflow requirements.

The provided example illustrates a schema for a CSV file with a header row. It mandates that the id column must not be empty and should only contain integer values. Additionally, the name column is required to have a minimum length of 3 characters, ensuring basic data integrity and usefulness.

Example schema in YAML

Full schema description

In the example YAML file, a detailed description of all features is provided. This documentation is verified through automated tests, ensuring it remains current.

Notes:

Below is a comprehensive list of rules, each accompanied by a brief explanation and example for clarity. This section is also validated through automated tests, ensuring the information is consistently accurate.

CLICK to see details about each rule

Extra checks

In addition to the schema outlined in the YAML configuration, the tool performs several underlying checks by default to ensure thorough validation of your CSV files.

Presets and reusable schemas

Presets significantly enhance the efficiency and reusability of schema definitions in CSV file validation by ensuring consistency across various files with common validation rules for fields like user IDs and email addresses.

This uniformity maintains data integrity and simplifies maintenance by allowing centralized updates that automatically apply to all linked schemas. Moreover, presets support customization through field-specific rule overrides, facilitating both standardization and specific needs adaptation.

Focus on your task rather than copy&pasting.

Example with presets

Let's look at a real life example. Suppose you have a "library" of different user profile validation rules that can be used in a wide variety of CSV files.

In order not to care about integrity and not to suffer from copy and paste, you can reuse ANY(!) existing schema. In fact, this can be considered as partial inheritance.

Important notes

If something went wrong

If you're having trouble working with presets and don't understand how the CSV Blueprint under the hood understands it, just add --dump-schema to see it. Also, there is a separate CLI command to dump schema:

Let's take a look at what this looks like in code.

As a result, you don't just get a bunch of schemas for validation, which is difficult to manage, but something like a framework(!) that will be targeted to the specifics of your project, especially when there are dozens or even hundreds of CSV files and rules. It will be much easier to achieve consistency. Very often it's quite important.

Database preset

CLICK to see source code

User data preset

CLICK to see source code

Usage of presets This short and clear Yaml under the hood as roughly as follows. As you can see it simplifies your work a lot.

CLICK to see what it looks like in memory.

As a result, readability and maintainability became dramatically easier. You can easily add new rules, change existing, etc.

Complete example with all available syntax

CLICK to see available syntax.

Note: All provided YAML examples pass built-in validation, yet they may not make practical sense. These are intended solely for demonstration and to illustrate potential configurations and features.

Parallel processing

The --parallel option is available for speeding up the validation of CSV files by utilizing more CPU resources effectively.

Key Points

Complete CLI help message

This section outlines all available options and commands provided by the tool, leveraging the JBZoo/Cli package for its CLI. The tool offers a comprehensive set of options to cater to various needs and scenarios, ensuring flexibility and efficiency in CSV file validation and manipulation.

For detailed information on each command and option, refer to the JBZoo/Cli documentation. This resource provides insights into the functionality and application of the CLI commands, helping users make the most out of the tool's capabilities.

./csv-blueprint validate-csv --help

CLICK to see validate-csv help messege

./csv-blueprint validate-schema --help

CLICK to see validate-schema help messege

./csv-blueprint dump-schema --help

CLICK to see debug-schema help messege

./csv-blueprint create-schema --help It's beta. Work in progress.

CLICK to see create-schema help messege

Report examples

The validation process culminates in a human-readable report detailing any errors identified within the CSV file. While the default report format is a table, the tool supports various output formats, including text, GitHub, GitLab, TeamCity, JUnit, among others, to best suit your project's needs and your personal or team preferences.

GitHub Action format

To see user-friendly error outputs in your pull requests (PRs), specify report: github. This utilizes annotations to highlight bugs directly within the GitHub interface at the PR level. This feature allows errors to be displayed in the exact location within the CSV file, right in the diff of your Pull Requests. For a practical example, view this live demo PR.

CLICK to see example in GitHub Actions terminal ![GitHub Actions - Terminal](.github/assets/github-actions-termintal.png)

Text format

Optional format text with highlited keywords for qucik navigation.

Table format

When using the table format (default), the output is organized in a clear, easily interpretable table that lists all discovered errors. This format is ideal for quick reviews and sharing with team members for further action.

Notes

Benchmarks

Understanding the performance of this tool is crucial, but it's important to note that its efficiency is influenced by several key factors:

However, to gain a general understanding of performance, refer to the table below.

Profiles:

Divisions:

Note: The Peak Memory metric is primarily of interest when aggregation rules are used, as non-aggregation scenarios typically require no more than 2-4 megabytes of memory, regardless of file size or rule count.

These benchmarks offer a snapshot of the tool's capabilities across a range of scenarios, helping you gauge its suitability for your specific CSV validation needs.

File / Profile
Metric
Quickest Minimum Realistic All aggregations
Columns: 1
Size: ~8 MB


Cell rules
Agg rules
Cell + Agg
Peak Memory
786K,  2.5 sec
1187K,  1.7 sec
762K,  2.6 sec
52 MB
386K,  5.2 sec
1096K,  1.8 sec
373K,  5.4 sec
68 MB
189K, 10.6 sec
667K,  3.0 sec
167K, 12.0 sec
208 MB
184K, 10.9 sec
96K, 20.8 sec
63K, 31.7 sec
272 MB
Columns: 5
Size: 64 MB


Cell rules
Agg rules
Cell + Agg
Peak Memory
545K,  3.7 sec
714K,  2.8 sec
538K,  3.7 sec
52 MB
319K,  6.3 sec
675K,  3.0 sec
308K,  6.5 sec
68 MB
174K, 11.5 sec
486K,  4.1 sec
154K, 13.0 sec
208 MB
168K, 11.9 sec
96K, 20.8 sec
61K, 32.8 sec
272 MB
Columns: 10
Size: 220 MB


Cell rules
Agg rules
Cell + Agg
Peak Memory
311K,  6.4 sec
362K,  5.5 sec
307K,  6.5 sec
52 MB
221K,  9.0 sec
354K,  5.6 sec
215K,  9.3 sec
68 MB
137K, 14.6 sec
294K,  6.8 sec
125K, 16.0 sec
208 MB
135K, 14.8 sec
96K, 20.8 sec
56K, 35.7 sec
272 MB
Columns: 20
Size: 1.2 GB


Cell rules
Agg rules
Cell + Agg
Peak Memory
103K, 19.4 sec
108K, 18.5 sec
102K, 19.6 sec
52 MB
91K, 22.0 sec
107K, 18.7 sec
89K, 22.5 sec
68 MB
72K, 27.8 sec
101K, 19.8 sec
69K, 29.0 sec
208 MB
71K, 28.2 sec
96K, 20.8 sec
41K, 48.8 sec
272 MB

Additional Benchmark Insights:

When running the same validation tests on different hardware configurations, the performance of the tool can vary significantly. Notably, testing on a MacBook 14" M2 Max (2023) yields results that are approximately twice as fast as those observed on the GitHub Actions hardware. This indicates the tool's exceptional performance on modern, high-spec devices.

Conversely, tests conducted on a MacBook Pro (2019) with an Intel 2.4 GHz processor align closely with the GitHub Actions results, suggesting that the benchmark table provided reflects an average performance level for typical engineering hardware.

Brief conclusions

Examples of CSV Files

The CSV files utilized for benchmark testing are described below. These files were initially generated using PHP Faker to create the first 2000 lines. Subsequently, they were replicated 1000 times within themselves, allowing for the creation of significantly large random files in a matter of seconds.

A key principle observed in these files is that as the number of columns increases, the length of the values within these columns also tends to increase, following a pattern akin to exponential growth.

Columns: 1, Size: 8.48 MB
Columns: 5, Size: 64 MB
Columns: 10, Size: 220 MB
Columns: 20, Size: 1.2 GB

Run benchmark locally

Make sure you have PHP 8.2+ and Docker installed.

Disadvantages?

The perception that PHP is inherently slow is a common misconception. However, with the right optimization strategies, PHP can perform exceptionally well. For evidence, refer to the article Processing One Billion CSV Rows in PHP!, which demonstrates that PHP can process, aggregate, and calculate data from CSV files at approximately 15 million lines per second! While not all optimizations are currently implemented, the performance is already quite impressive.

However, for the majority of scenarios, these are not deal-breakers. The utility effectively addresses the challenge of validating CSV files in continuous integration (CI) environments. 👍

This utility is designed for immediate use without necessitating a deep understanding of its inner mechanics. It adheres to rigorous testing standards, including strict typing, approximately seven linters and static analyzers at the highest rule level. Furthermore, every pull request is subjected to around ten different checks on GitHub Actions, spanning a matrix of PHP versions and modes, ensuring robustness. The extensive coverage and precautions are due to the unpredictability of usage conditions, embodying the spirit of the Open Source community.

In summary, the tool is developed with the highest standards of modern PHP practices, ensuring it performs as expected.

Coming soon

It's random ideas and plans. No promises and deadlines. Feel free to help me!.

CLICK to see the roadmap - **Batch processing** - If option `--csv` is not specified, then the STDIN is used. To build a pipeline in Unix-like systems. - **Validation** - Multi `filename_pattern`. Support list of regexs. - Multi values in one cell. - Custom cell rule as a callback. It's useful when you have a complex rule that can't be described in the schema file. - Custom agregate rule as a callback. It's useful when you have a complex rule that can't be described in the schema file. - Configurable keyword for null/empty values. By default, it's an empty string. But you will use `null`, `nil`, `none`, `empty`, etc. Overridable on the column level. - Handle empty files and files with only a header row, or only with one line of data. One column wthout header is also possible. - If option `--schema` is not specified, then validate only super base level things (like "is it a CSV file?"). - Complex rules (like "if field `A` is not empty, then field `B` should be not empty too"). - Extending with custom rules and custom report formats. Plugins? - Input encoding detection + `BOM` (right now it's experimental). It works but not so accurate... UTF-8 is the best choice for now. - **Performance and optimization** - Using [vectors](https://www.php.net/manual/en/class.ds-vector.php) instead of arrays to optimaze memory usage and speed of access. - Multithreading support for parallel validation of CSV by columns. - **Mock data generation** - Create CSV files based on the schema (like "create 1000 rows with random data based on schema and rules"). - Use [Faker](https://github.com/FakerPHP/Faker) for random data generation. - [ReverseRegex](https://github.com/enso-media/ReverseRegex) to generate text from regex. - **Reporting** - More report formats (like JSON, XML, etc). Any ideas? - Gitlab and JUnit reports must be as one structure. It's not so easy to implement. But it's a good idea. - Merge reports from multiple CSV files into one report. It's useful when you have a lot of files and you want to see all errors in one place. Especially for GitLab and JUnit reports. - **Misc** - Rewrite in Go/Rust. It's a good idea to have a standalone binary with the same functionality. - Install via brew on MacOS. - Install via apt on Ubuntu. - Use it as PHP SDK. Examples in Readme. - Warnings about deprecated options and features. - Add option `--recomendation` to show a list of recommended rules for the schema or potential issues in the CSV file or schema. It's useful when you are not sure what rules to use. - Add option `--error=[level]` to show only errors with a specific level. It's useful when you have a lot of warnings and you want to see only errors. - More examples and documentation. PS. [There is a file](tests/schemas/todo.yml) with my ideas and imagination. It's not valid schema file, just a draft. I'm not sure if I will implement all of them. But I will try to do my best.

Contributing

If you have any ideas or suggestions, feel free to open an issue or create a pull request.

License

MIT License: It's like free pizza - enjoy it, share it, just don't sell it as your own. And remember, no warranty for stomach aches! 😅

See also

just interesting fact I've achieved a personal milestone. The [initial release](https://github.com/jbzoo/csv-blueprint/releases/tag/0.1) of the project was crafted from the ground up in approximately 3 days, interspersed with regular breaks to care for a 4-month-old baby. Reflecting on the first commit and the earliest git tag, it's clear that this was accomplished over a weekend, utilizing spare moments on my personal laptop.

All versions of csv-blueprint with dependencies

PHP Build Version
Package Version
Requires php Version ^8.2
ext-mbstring Version *
league/csv Version ^9.15.0
jbzoo/data Version ^7.1.1
jbzoo/cli Version ^7.2.2
jbzoo/utils Version ^7.2.1
jbzoo/ci-report-converter Version ^7.2.1
symfony/yaml Version ^7.0.3
symfony/filesystem Version ^7.0.6
symfony/finder Version ^7.0.0
markrogoyski/math-php Version ^2.10.0
respect/validation Version ^2.3.7
giggsey/libphonenumber-for-php-lite Version ^8.13.35
fidry/cpu-core-counter Version ^1.1.0
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package jbzoo/csv-blueprint contains the following files

Loading the files please wait ....