Download the PHP package jbzoo/csv-blueprint without Composer
On this page you can find all versions of the php package jbzoo/csv-blueprint. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download jbzoo/csv-blueprint
More information about jbzoo/csv-blueprint
Files in jbzoo/csv-blueprint
Package csv-blueprint
Short Description CLI Utility for Validating and Generating CSV files based on custom rules. It ensures your data meets specified criteria, streamlining data management and integrity checks.
License MIT
Informations about the package csv-blueprint
CSV Blueprint
Strict and automated line-by-line CSV validation tool based on customizable Yaml schemas.
In seconds, make sure every char in a gigabyte file meets your expectations.
I believe it is the simplest yet flexible and powerful CSV validator in the world. ☺️
Features
- Just create a simple and friendly Yaml with your CSV schema and the tool will validate your files line by line. You will get a very detailed report with row, column and rule accuracy.
- Out of the box, you have access to over 330 validation rules that can be combined to control the severity of validation.
- You can validate each value (like, date has a strict format on each line), or the entire column (like, median of all values is within limits). It's up to you to choose the severity of the rules.
- Use it anywhere as it is packaged in GitHub Actions.
- Create a CSV in your pipelines/ETL/CI and ensure that it meets the most stringent expectations.
- Prepare your own libraries with complex rules using presets. This will help you work with hundreds of different files at the same time.
- Create schema on the fly based on an existing CSV file and also analyze data in CSV - find out what is stored in your file and get a summary report.
CLICK to see a typical cross-team workflow
1. **Data Preparation:** Team A generates CSV data adhering to a predefined format and places the file in a shared location accessible to Team B (e.g., a shared repository or cloud storage). 2. **Notification:** Team A notifies Team B that the data is ready via corporate communication channels (email, chat, task management system). 3. **Validation:** Team B uses predefined validation rules stored in the repository to check the CSV file for accuracy and integrity before importing. This includes verifying date formats, numerical values, and the presence of required columns. 4. **Data Import:** After successful validation, Team B imports the data from the CSV file into their system for further processing. 5. **Error Handling:** If validation identifies errors, the process halts, and Team B provides feedback to Team A for data correction. **Why Validation is Necessary:** - **Reduce Errors:** Validating data before import minimizes the likelihood of errors, enhancing data quality. - **Efficiency:** Prevents time loss on manual error correction post-import. - **Data Consistency:** Ensures data meets the expectations and requirements of Team B, facilitating accurate processing and analysis. - **Automation:** Storing validation rules in the repository eases the process of checking automation and simplifies updating validation criteria.Live demo
As a live demonstration of how the tool works, you can explore the super minimal repository at demo. For more complex examples and various reporting methods, take a look at the demo pipeline with different reports types.
See also:
- PR as a live demo - Note the automatic comments in Diff at PR's.
- .github/workflows/demo.yml
- demo_invalid.yml
- demo_valid.yml
- demo.csv
Table of content
- Features
- Table of content
- Usage
- Schema definition
- Presets and reusable schemas
- Parallel processing
- Complete CLI help message
- Report examples
- Benchmarks
- Disadvantages?
- Coming soon
- Contributing
- License
- See also
Usage
Docker container
Ensure you have Docker installed on your machine.
GitHub Action
You can find launch examples in the workflow demo.
Phar binary
CLICK to see using PHAR file
Ensure you have PHP installed on your machine.Schema definition
Define your CSV validation schema in YAML for clear and structured configuration. Alternative formats are also supported: PHP, accommodating various preferences and workflow requirements.
The provided example illustrates a schema for a CSV file with a header row. It mandates that the id
column must not be empty and should only contain integer values. Additionally, the name
column is required to have a
minimum length of 3 characters, ensuring basic data integrity and usefulness.
Example schema in YAML
Full schema description
In the example YAML file, a detailed description of all features is provided. This documentation is verified through automated tests, ensuring it remains current.
Notes:
- The traditional typing of columns (e.g.,
type: integer
) has been intentionally omitted in favor of rules. These rules can be sequenced and combined freely, offering extensive flexibility for CSV file validation. - All options are optional unless stated otherwise. You have the liberty to include or omit them as you see fit.
- Specifying an incorrect rule name, using non-existent values (not listed below), or assigning an incompatible variable
type for any option will result in a schema validation error. To bypass these errors, you may opt to use
the
--skip-schema
flag at your discretion, allowing the use of your custom keys in the schema. - All rules ignore the empty string except
not_empty
. It doesn't ignore empty strings (length 0). To enforce a non-empty value, applynot_empty: true
. Note that a single space counts as a character, making the string length1
. To prevent such scenarios, includeis_trimmed: true
. - Rules operate independently; they have no knowledge of or influence over one another.
- When a rule's value is
is_some_rule: true
, it merely serves as an activation toggle. Other values represent rule parameters. - The sequence of rule execution follows their order in the schema, affecting only the order of error messages in the report.
- Unless explicitly stated, most rules are case-sensitive.
- As a fallback, the
regex
rule is available. However, using clear rule combinations is recommended for greater clarity on validation errors.
Below is a comprehensive list of rules, each accompanied by a brief explanation and example for clarity. This section is also validated through automated tests, ensuring the information is consistently accurate.
CLICK to see details about each rule
Extra checks
In addition to the schema outlined in the YAML configuration, the tool performs several underlying checks by default to ensure thorough validation of your CSV files.
- The
filename_pattern
rule verifies that the file name adheres to the specified regex pattern, ensuring file naming conventions are followed. - Ensures that the
name
property is defined for each column, applicable only whencsv.header
is set totrue
, to guarantee header integrity. - The
required
property, when set totrue
, mandates the presence of the specified column in the CSV file, enhancing data completeness. This is only relevant ifcsv.header
is true. - Validates that each row contains the correct number of columns, aligning with the schema's defined structure, to prevent data misalignment.
- The
strict_column_order
rule checks for the correct sequential order of columns as defined in the schema, ensuring structural consistency. - The
allow_extra_columns
rule asserts no additional columns are present in the CSV file beyond those specified in the schema, maintaining strict data fidelity.- For
csv.header: true
, it checks if the schema contains any columnname
not found in the CSV file, addressing header discrepancies. - For
csv.header: false
, it compares the number of columns in the schema against those in the CSV file, ensuring schema conformity.
- For
Presets and reusable schemas
Presets significantly enhance the efficiency and reusability of schema definitions in CSV file validation by ensuring consistency across various files with common validation rules for fields like user IDs and email addresses.
This uniformity maintains data integrity and simplifies maintenance by allowing centralized updates that automatically apply to all linked schemas. Moreover, presets support customization through field-specific rule overrides, facilitating both standardization and specific needs adaptation.
Focus on your task rather than copy&pasting.
Example with presets
Let's look at a real life example. Suppose you have a "library" of different user profile validation rules that can be used in a wide variety of CSV files.
In order not to care about integrity and not to suffer from copy and paste, you can reuse ANY(!) existing schema. In fact, this can be considered as partial inheritance.
Important notes
- You can make the chain of inheritance infinitely long.
I.e. make chains of the form
grant-parent.yml
->parent.yml
->child.yml
->grandchild.yml
-> etc. Of course if you like to take risks ;). - But be careful with circular dependencies. The tool will not be able to handle them, and it can be an infinite loop.
- Any(!) of the schema files can be used alone or as a library. The syntax is the same.
- Schemas with presets validate themselves and if there are any obvious issues, you will see them when you try to use the schema. But logical conflicts between rules are not checked (It's almost impossible from a code perspective). As mentioned above, rules work in isolation and are not aware of each other. So the set of rules is your responsibility as always.
- Alias in presets must match the regex pattern
/^[a-z0-9-_]+$/i
. Otherwise, it might break the syntax.
If something went wrong
If you're having trouble working with presets and don't understand how the CSV Blueprint under the hood understands it,
just add --dump-schema
to see it. Also, there is a separate CLI command to dump schema:
Let's take a look at what this looks like in code.
- Define a couple of basic rules for database columns.
- Also, one of the files will contain rules specific only to the users profile.
- And of course, let's make a schema that will simultaneously reuse the rules from these two files.
As a result, you don't just get a bunch of schemas for validation, which is difficult to manage, but something like a framework(!) that will be targeted to the specifics of your project, especially when there are dozens or even hundreds of CSV files and rules. It will be much easier to achieve consistency. Very often it's quite important.
Database preset
CLICK to see source code
User data preset
CLICK to see source code
Usage of presets This short and clear Yaml under the hood as roughly as follows. As you can see it simplifies your work a lot.
CLICK to see what it looks like in memory.
As a result, readability and maintainability became dramatically easier. You can easily add new rules, change existing, etc.
Complete example with all available syntax
CLICK to see available syntax.
Note: All provided YAML examples pass built-in validation, yet they may not make practical sense. These are intended solely for demonstration and to illustrate potential configurations and features.
Parallel processing
The --parallel
option is available for speeding up the validation of CSV files by utilizing more CPU resources
effectively.
Key Points
- Experimental Feature: This feature is currently experimental and requires further debugging and testing. Although it performs well in synthetic autotests and benchmarks. More practical use cases are needed to validate its stability.
- Use Case: This option is beneficial if you are processing dozens of CSV files, with each file taking 1 second or more to process.
- Default Behavior: If you use
--parallel
without specifying a value, it defaults to using the maximum number of available CPU cores. - Thread Pool Size: You can set a specific number of threads for the pool. For example,
--parallel=10
will set the thread pool size to 10. It doesn't make much sense to specify more than the number of logical cores in your CPU. Otherwise, it will only slow things down a bit due to the system overhead to handle multithreading. - Disabling Parallelism: Using
--parallel=1
disables parallel processing, which is the default setting if the option is not specified. - Implementation: The feature relies on the
ext-parallel
PHP extension, which enables the creation of lightweight threads rather than processes. This extension is already included in our Docker image. Ensure that you have theext-parallel
extension installed if you are not using our Docker image. This extension is crucial for the operation of the parallel processing feature. The application always runs in single-threaded mode if the extension is not installed.
Complete CLI help message
This section outlines all available options and commands provided by the tool, leveraging the JBZoo/Cli package for its CLI. The tool offers a comprehensive set of options to cater to various needs and scenarios, ensuring flexibility and efficiency in CSV file validation and manipulation.
For detailed information on each command and option, refer to the JBZoo/Cli documentation. This resource provides insights into the functionality and application of the CLI commands, helping users make the most out of the tool's capabilities.
./csv-blueprint validate-csv --help
CLICK to see validate-csv help messege
./csv-blueprint validate-schema --help
CLICK to see validate-schema help messege
./csv-blueprint dump-schema --help
CLICK to see debug-schema help messege
./csv-blueprint create-schema --help
It's beta. Work in progress.
CLICK to see create-schema help messege
Report examples
The validation process culminates in a human-readable report detailing any errors identified within the CSV file. While the default report format is a table, the tool supports various output formats, including text, GitHub, GitLab, TeamCity, JUnit, among others, to best suit your project's needs and your personal or team preferences.
GitHub Action format
To see user-friendly error outputs in your pull requests (PRs), specify report: github
. This
utilizes annotations
to highlight bugs directly within the GitHub interface at the PR level. This feature allows errors to be displayed in
the exact location within the CSV file, right in the diff of your Pull Requests. For a practical example,
view this live demo PR.
CLICK to see example in GitHub Actions terminal
![GitHub Actions - Terminal](.github/assets/github-actions-termintal.png)Text format
Optional format text
with highlited keywords for qucik navigation.
Table format
When using the table
format (default), the output is organized in a clear, easily interpretable table that lists all
discovered errors. This format is ideal for quick reviews and sharing with team members for further action.
Notes
- Report format for GitHub Actions is
table
by default. - Tools uses JBZoo/CI-Report-Converter as SDK to convert reports to different formats. So you can easily integrate it with any CI system.
Benchmarks
Understanding the performance of this tool is crucial, but it's important to note that its efficiency is influenced by several key factors:
- File Size: The dimensions of the CSV file, both in terms of rows and columns, directly impact processing time. Performance scales linearly with file size and is dependent on the capabilities of your hardware, such as CPU and SSD speed.
- Number of Rules: More validation rules per column mean more iterations for processing. Each rule operates independently, so the total time and memory consumption are cumulative across all rules.
- Rule Intensity: While most validation rules are optimized for speed and low memory usage, some,
like
interquartile_mean
, can be significantly slower. For instance,interquartile_mean
might process around 4,000 lines per second, whereas other rules can handle upwards of 50 million lines per second.
However, to gain a general understanding of performance, refer to the table below.
- All tests were conducted on a dataset comprising
2 million lines
plus an additional line for the header. - These results are derived from the most current version, as verified by tests run
using GitHub Actions (See workflow.yml).
The link provides access to a variety of builds, which are essential for different testing scenarios and experiments.
The most representative data can be found under
Docker (latest, XX)
. - Developer mode was activated for these tests, using the flags
-vvv --debug --profile
. - Testing environment included the latest Ubuntu + Docker. For more information about the GitHub Actions (GA) hardware used, please see details about GA hardware.
- The primary performance indicator is the processing speed, measured in lines per second. Note that speeds are
presented in thousands of lines per second (
100K
equals100,000 lines per second
). - Peak RAM usage throughout the duration of each test case serves as an additional performance metric.
Profiles:
- Quickest: Focuses on the fastest rules, either cell or aggregation, providing a baseline.
- Minimum: Uses a set of normal performance rules, with two instances of each.
- Realistic: Represents a mix of rules likely encountered in typical use cases.
- All Aggregations: Tests all aggregation rules simultaneously, illustrating maximum load.
Divisions:
- Cell Rules: Tests only individual cell validation rules.
- Agg Rules: Focuses solely on column-wide aggregation rules.
- Cell + Agg: Combines cell and aggregation rules for comprehensive validation.
- Peak Memory: Indicates the maximum RAM usage, particularly relevant in scenarios with aggregation rules.
Note: The Peak Memory
metric is primarily of interest when aggregation rules are used, as non-aggregation
scenarios typically require no more than 2-4 megabytes of memory, regardless of file size or rule count.
These benchmarks offer a snapshot of the tool's capabilities across a range of scenarios, helping you gauge its suitability for your specific CSV validation needs.
File / Profile |
Metric |
Quickest | Minimum | Realistic | All aggregations |
Columns: 1 Size: ~8 MB |
Cell rules Agg rules Cell + Agg Peak Memory |
786K, 2.5 sec 1187K, 1.7 sec 762K, 2.6 sec 52 MB |
386K, 5.2 sec 1096K, 1.8 sec 373K, 5.4 sec 68 MB |
189K, 10.6 sec 667K, 3.0 sec 167K, 12.0 sec 208 MB |
184K, 10.9 sec 96K, 20.8 sec 63K, 31.7 sec 272 MB |
Columns: 5 Size: 64 MB |
Cell rules Agg rules Cell + Agg Peak Memory |
545K, 3.7 sec 714K, 2.8 sec 538K, 3.7 sec 52 MB |
319K, 6.3 sec 675K, 3.0 sec 308K, 6.5 sec 68 MB |
174K, 11.5 sec 486K, 4.1 sec 154K, 13.0 sec 208 MB |
168K, 11.9 sec 96K, 20.8 sec 61K, 32.8 sec 272 MB |
Columns: 10 Size: 220 MB |
Cell rules Agg rules Cell + Agg Peak Memory |
311K, 6.4 sec 362K, 5.5 sec 307K, 6.5 sec 52 MB |
221K, 9.0 sec 354K, 5.6 sec 215K, 9.3 sec 68 MB |
137K, 14.6 sec 294K, 6.8 sec 125K, 16.0 sec 208 MB |
135K, 14.8 sec 96K, 20.8 sec 56K, 35.7 sec 272 MB |
Columns: 20 Size: 1.2 GB |
Cell rules Agg rules Cell + Agg Peak Memory |
103K, 19.4 sec 108K, 18.5 sec 102K, 19.6 sec 52 MB |
91K, 22.0 sec 107K, 18.7 sec 89K, 22.5 sec 68 MB |
72K, 27.8 sec 101K, 19.8 sec 69K, 29.0 sec 208 MB |
71K, 28.2 sec 96K, 20.8 sec 41K, 48.8 sec 272 MB |
Additional Benchmark Insights:
When running the same validation tests on different hardware configurations, the performance of the tool can vary significantly. Notably, testing on a MacBook 14" M2 Max (2023) yields results that are approximately twice as fast as those observed on the GitHub Actions hardware. This indicates the tool's exceptional performance on modern, high-spec devices.
Conversely, tests conducted on a MacBook Pro (2019) with an Intel 2.4 GHz processor align closely with the GitHub Actions results, suggesting that the benchmark table provided reflects an average performance level for typical engineering hardware.
Brief conclusions
-
Cell Rules: These rules are highly CPU-intensive but require minimal RAM, typically around 1-2 MB at peak. The more cell rules applied to a column, the longer the validation process takes due to the additional actions performed on each value.
-
Aggregation Rules: These rules operate at incredible speeds, processing anywhere from 10 million to billions of rows per second. However, they are significantly more RAM-intensive. Interestingly, adding over 100 different aggregation rules does not substantially increase memory consumption.
-
PHP Array Functions: Not all PHP array functions can operate by reference (
&$var
). Whether or not a dataset in a column can be manipulated in this way is highly dependent on the specific algorithm used. For example, a 20 MB dataset might be duplicated during processing, leading to a peak memory usage of 40 MB. Consequently, optimization techniques that rely on passing data by reference are often ineffective. -
Practical Implications: If processing a 1 GB file within 30-60 seconds is acceptable, and if there is 200-500 MB of RAM available, there may be little need to overly concern oneself with these performance considerations.
- Memory Management: Throughout testing, no memory leaks were observed.
Examples of CSV Files
The CSV files utilized for benchmark testing are described below. These files were initially generated using PHP Faker to create the first 2000 lines. Subsequently, they were replicated 1000 times within themselves, allowing for the creation of significantly large random files in a matter of seconds.
A key principle observed in these files is that as the number of columns increases, the length of the values within these columns also tends to increase, following a pattern akin to exponential growth.
Columns: 1, Size: 8.48 MB
Columns: 5, Size: 64 MB
Columns: 10, Size: 220 MB
Columns: 20, Size: 1.2 GB
Run benchmark locally
Make sure you have PHP 8.2+ and Docker installed.
Disadvantages?
The perception that PHP is inherently slow is a common misconception. However, with the right optimization strategies, PHP can perform exceptionally well. For evidence, refer to the article Processing One Billion CSV Rows in PHP!, which demonstrates that PHP can process, aggregate, and calculate data from CSV files at approximately 15 million lines per second! While not all optimizations are currently implemented, the performance is already quite impressive.
- Yes, it's acknowledged that this tool might not be the fastest available, but it's also far from the slowest. For more details, see the link provided above.
- Yes, the tool is built with PHP—not Python, Go, or PySpark—which may not be the first choice for such tasks.
- Yes, it functions like a standalone binary. The recommendation is simply to use it without overthinking its internal workings.
- Yes, it's recognized that this cannot be used as a Python SDK within a pipeline.
However, for the majority of scenarios, these are not deal-breakers. The utility effectively addresses the challenge of validating CSV files in continuous integration (CI) environments. 👍
This utility is designed for immediate use without necessitating a deep understanding of its inner mechanics. It adheres to rigorous testing standards, including strict typing, approximately seven linters and static analyzers at the highest rule level. Furthermore, every pull request is subjected to around ten different checks on GitHub Actions, spanning a matrix of PHP versions and modes, ensuring robustness. The extensive coverage and precautions are due to the unpredictability of usage conditions, embodying the spirit of the Open Source community.
In summary, the tool is developed with the highest standards of modern PHP practices, ensuring it performs as expected.
Coming soon
It's random ideas and plans. No promises and deadlines. Feel free to help me!.
CLICK to see the roadmap
- **Batch processing** - If option `--csv` is not specified, then the STDIN is used. To build a pipeline in Unix-like systems. - **Validation** - Multi `filename_pattern`. Support list of regexs. - Multi values in one cell. - Custom cell rule as a callback. It's useful when you have a complex rule that can't be described in the schema file. - Custom agregate rule as a callback. It's useful when you have a complex rule that can't be described in the schema file. - Configurable keyword for null/empty values. By default, it's an empty string. But you will use `null`, `nil`, `none`, `empty`, etc. Overridable on the column level. - Handle empty files and files with only a header row, or only with one line of data. One column wthout header is also possible. - If option `--schema` is not specified, then validate only super base level things (like "is it a CSV file?"). - Complex rules (like "if field `A` is not empty, then field `B` should be not empty too"). - Extending with custom rules and custom report formats. Plugins? - Input encoding detection + `BOM` (right now it's experimental). It works but not so accurate... UTF-8 is the best choice for now. - **Performance and optimization** - Using [vectors](https://www.php.net/manual/en/class.ds-vector.php) instead of arrays to optimaze memory usage and speed of access. - Multithreading support for parallel validation of CSV by columns. - **Mock data generation** - Create CSV files based on the schema (like "create 1000 rows with random data based on schema and rules"). - Use [Faker](https://github.com/FakerPHP/Faker) for random data generation. - [ReverseRegex](https://github.com/enso-media/ReverseRegex) to generate text from regex. - **Reporting** - More report formats (like JSON, XML, etc). Any ideas? - Gitlab and JUnit reports must be as one structure. It's not so easy to implement. But it's a good idea. - Merge reports from multiple CSV files into one report. It's useful when you have a lot of files and you want to see all errors in one place. Especially for GitLab and JUnit reports. - **Misc** - Rewrite in Go/Rust. It's a good idea to have a standalone binary with the same functionality. - Install via brew on MacOS. - Install via apt on Ubuntu. - Use it as PHP SDK. Examples in Readme. - Warnings about deprecated options and features. - Add option `--recomendation` to show a list of recommended rules for the schema or potential issues in the CSV file or schema. It's useful when you are not sure what rules to use. - Add option `--error=[level]` to show only errors with a specific level. It's useful when you have a lot of warnings and you want to see only errors. - More examples and documentation. PS. [There is a file](tests/schemas/todo.yml) with my ideas and imagination. It's not valid schema file, just a draft. I'm not sure if I will implement all of them. But I will try to do my best.Contributing
If you have any ideas or suggestions, feel free to open an issue or create a pull request.
License
MIT License: It's like free pizza - enjoy it, share it, just don't sell it as your own. And remember, no warranty for stomach aches! 😅
See also
- Cli - Framework helps create complex CLI apps and provides new tools for Symfony/Console.
- CI-Report-Converter - It converts different error reporting standards for popular CI systems.
- Composer-Diff - See what packages have changed after
composer update
. - Composer-Graph - Dependency graph visualization of
composer.json
based on Mermaid JS. - Mermaid-PHP - Generate diagrams and flowcharts with the help of the mermaid script language.
- Utils - Collection of useful PHP functions, mini-classes, and snippets for every day.
- Image - Package provides object-oriented way to manipulate with images as simple as possible.
- Data - Extended implementation of ArrayObject. Use Yml/PHP/JSON/INI files as config. Forget about arrays.
- Retry - Tiny PHP library providing retry/backoff functionality with strategies and jitter.
just interesting fact
I've achieved a personal milestone. The [initial release](https://github.com/jbzoo/csv-blueprint/releases/tag/0.1) of the project was crafted from the ground up in approximately 3 days, interspersed with regular breaks to care for a 4-month-old baby. Reflecting on the first commit and the earliest git tag, it's clear that this was accomplished over a weekend, utilizing spare moments on my personal laptop.All versions of csv-blueprint with dependencies
ext-mbstring Version *
league/csv Version ^9.15.0
jbzoo/data Version ^7.1.1
jbzoo/cli Version ^7.2.2
jbzoo/utils Version ^7.2.1
jbzoo/ci-report-converter Version ^7.2.1
symfony/yaml Version ^7.0.3
symfony/filesystem Version ^7.0.6
symfony/finder Version ^7.0.0
markrogoyski/math-php Version ^2.10.0
respect/validation Version ^2.3.7
giggsey/libphonenumber-for-php-lite Version ^8.13.35
fidry/cpu-core-counter Version ^1.1.0