Download the PHP package ecourty/text-chunker without Composer

On this page you can find all versions of the php package ecourty/text-chunker. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package text-chunker

php-text-chunker

PHP CI

A framework-agnostic PHP library for splitting text and files into meaningful chunks, using pluggable strategies and a composable post-processing pipeline.

Table of Contents


Installation

Requirements: PHP >= 8.3


Core Features


Quick Start

Chunk from a string:


Chunking Strategies

Strategy Splits on Key options
ParagraphChunkingStrategy Double newlines (\n\n)
SentenceChunkingStrategy Sentence-ending punctuation (. ! ?)
FixedSizeChunkingStrategy Fixed character count chunkSize (default: 1000)
DialogueChunkingStrategy Dialogue lines, context-aware grouping targetChunkSize, minChunkSize
MarkdownChunkingStrategy Markdown headers (# to ######) minHeadingLevel, maxHeadingLevel
WordCountChunkingStrategy Fixed word count, respects word boundaries wordCount (default: 200)
RegexChunkingStrategy Configurable regex pattern pattern, delimiterPosition (None | Prefix | Suffix)
LineChunkingStrategy N consecutive lines per chunk linesPerChunk (default: 10)
RecursiveChunkingStrategy Cascade of strategies with a size limit strategies[], maxChunkSize

RecursiveChunkingStrategy applies strategies[0] to the stream, and immediately re-splits any chunk exceeding maxChunkSize using strategies[1], then strategies[2], etc. Streaming-safe — never buffers more than one chunk at a time.


Post-Processors

Post-processors are applied in sequence after chunking. Chain them with withPostProcessor().

Post-processor Description Key options
OverlappingChunkPostProcessor Prepends the tail of the previous chunk for context continuity overlapSize (default: 200)
TokenLimitPostProcessor Splits chunks exceeding a token budget maxTokens, charactersPerToken
MetadataEnricherPostProcessor Adds chunk_index, total_chunks, word_count, char_count, source
ChunkFilterPostProcessor Removes empty or too-short chunks minLength, removeEmpty
ChunkMergerPostProcessor Merges consecutive small chunks until minChunkSize is reached minChunkSize (default: 200), separator
TextNormalizationPostProcessor Collapses whitespace, trims lines, strips control characters collapseWhitespace, trimLines, stripControlChars
DeduplicationPostProcessor Removes duplicate chunks by md5 content hash; adds content_hash metadata
RegexReplacePostProcessor Applies ordered [pattern => replacement] substitutions to each chunk's text replacements[]

Configuration Reference

TextChunker

Method Description
setFile(string $path) Set source file (streamed)
setText(string $text) Set source string
withMetadata(array $meta) Attach global metadata to every chunk
withPostProcessor(...) Add a post-processor to the pipeline
withPostProcessors(...) Add multiple post-processors at once (variadic)
withReader(ReaderInterface) Inject a custom reader (see below)
chunk(ChunkingStrategyInterface) Returns a Generator<Chunk>

Chunk

Method Returns
getText() string — the chunk content
getPosition() int — index in the sequence
getMetadata() array — associated metadata
getLength() int — character count
withMetadata(array) New Chunk with merged metadata

Custom Readers

By default, setFile() reads from the local filesystem via LocalFileReader. To read from a remote source (S3, Azure Blob, SFTP, etc.), implement ReaderInterface and inject it via withReader().

ReaderInterface has a single method: readChunks(string $path, int $bufferSize): \Generator<string>. Yield string chunks of arbitrary size — the chunking strategies handle the rest. The $path passed to readChunks() is whatever string you gave to setFile(), so it can be an S3 key, a URI, or any identifier your reader understands.

Example with Flysystem (works with S3, Azure, SFTP, GCS, and more):


Performance

Benchmarked with PHPBench on real-world datasets (Bible KJV, Les Misérables, Encyclopaedia Britannica 11th Ed.). See BENCHMARKS.md for the full results.

Strategy throughput (Bible KJV, 4.26 MB):

Strategy Time Throughput
SentenceChunkingStrategy 43 ms ~98 MB/s
FixedSizeChunkingStrategy 44 ms ~98 MB/s
LineChunkingStrategy 46 ms ~93 MB/s
ParagraphChunkingStrategy 293 ms ~15 MB/s
WordCountChunkingStrategy 377 ms ~11 MB/s

Post-processor overhead (50 KB excerpt): all 8 processors run in < 3 ms. Chain freely.

The library is streaming-first — most strategies hold only ~2 MB in memory regardless of input file size.


Datasets

The datasets/ directory contains large text corpora used for benchmarking chunking strategies. All texts are public domain sourced from Project Gutenberg.

File Source Size Notes
bible_kjv.txt King James Bible (PG #10) ~4.5 MB Great for sentence and paragraph benchmarks
les_miserables.txt Les Misérables by Victor Hugo (PG #17489–17496) ~2.6 MB All 5 tomes in French, ideal for paragraph chunking
britannica/ Encyclopaedia Britannica, 11th Edition ~118 MB 92 volumes of dense encyclopaedic text

Headers and Project Gutenberg license preambles can be stripped before benchmarking to work with clean content only.


Development

Extending the library

Implement ChunkingStrategyInterface to create a custom strategy, or ChunkPostProcessorInterface for a custom post-processor. See AGENTS.md for detailed guidelines.


All versions of text-chunker with dependencies

PHP Build Version
Package Version
Requires php Version >=8.3
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package ecourty/text-chunker contains the following files

Loading the files please wait ...