Download the PHP package ezimuel/phpvector without Composer
On this page you can find all versions of the php package ezimuel/phpvector. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download ezimuel/phpvector
More information about ezimuel/phpvector
Files in ezimuel/phpvector
Package phpvector
Short Description A vector database in PHP implementing HNSW for approximate nearest-neighbor search and BM25 for hybrid full-text + vector retrieval.
License MIT
Informations about the package phpvector
PHPVector
A pure-PHP vector database implementing HNSW (Hierarchical Navigable Small World) for approximate nearest-neighbour search and BM25 for full-text retrieval. Both engines can be combined into a single hybrid search pipeline.
Requirements
- PHP 8.2+
- No external PHP extensions required for core functionality
ext-pcntl(optional) — enables asynchronous document writes for lower insert latency
Installation
Quick start
1. Insert documents
A Document holds a dense embedding vector, optional raw text for BM25, and any metadata you want returned with results. The id field is optional — if omitted, a random UUID v4 is assigned automatically.
2. Vector search
Find the k most similar documents to a query vector using HNSW.
3. Full-text search
Rank documents by BM25 relevance against a text query.
4. Hybrid search
Fuse vector similarity and BM25 scores into a single ranked list.
Reciprocal Rank Fusion (recommended)
RRF is rank-based and scale-invariant — no tuning required.
Weighted combination
Normalises both score ranges to [0, 1] then applies explicit weights.
Configuration
Both the HNSW and BM25 engines are fully configurable. Pass config objects to the VectorDatabase constructor.
Distance metrics
| Metric | Best for |
|---|---|
Distance::Cosine |
Text embeddings, normalised vectors |
Distance::Euclidean |
Raw, unnormalized vectors |
Distance::DotProduct |
Unit-normalized vectors (faster than Cosine) |
Distance::Manhattan |
Sparse vectors, robustness to outliers |
HNSW tuning cheat-sheet
| Goal | Knob |
|---|---|
| Better recall | Increase efSearch or efConstruction |
| Faster queries | Decrease efSearch |
| Less memory | Decrease M |
| Better graph on clustered data | Keep useHeuristic: true |
Persistence
PHPVector uses a folder-based persistence model. Each database lives in its own directory containing separate files for the HNSW graph, the BM25 index, and one file per document. This design has two key advantages:
- Low memory footprint on load — only the HNSW graph and BM25 index are loaded into memory. Individual document files (
docs/{n}.bin) are read lazily, only for the documents that appear in search results. - Low insert latency — document files are written to disk asynchronously in a forked child process (requires
ext-pcntl), soaddDocument()returns immediately.
Folder layout
Saving
Pass a path to the constructor to enable persistence. Each addDocument() call writes the document file to docs/ (asynchronously when ext-pcntl is available). Call save() once to flush the HNSW graph and BM25 index — it waits for any outstanding async writes before proceeding.
Loading
Use VectorDatabase::open() to load a previously saved folder. Only hnsw.bin and bm25.bin are read into memory; document files are loaded on demand after search.
Pass the same HNSWConfig (including the same distance metric) that was used when building the index — a RuntimeException is thrown on mismatch.
Custom configuration on open
Note: Only
efSearchandbm25Config/tokenizeraffect query-time behaviour and can differ from build time.distanceand the graph parameters (M,efConstruction) are fixed at build time —distanceis validated onopen()and must match.
Incremental updates
You can add new documents to a database that was loaded from disk, then call save() again. The existing document files are left in place; only the new ones are written along with updated index files.
Typical workflow: build once, serve many
Multi-language stop words
Stop words are provided via StopWordsProviderInterface. Built-in providers:
txt
Italian stop words
e di a che il la
Available providers:
EnglishStopWords- English stop words (default)ItalianStopWords- Italian stop wordsFileStopWords- Load from file
Deleting and updating documents
Deleted documents are soft-deleted from the HNSW graph (kept for connectivity but excluded from results) and fully removed from the BM25 index. Document files are deleted from disk immediately.
Metadata filtering
Filter search results by document metadata. Filters can be combined with any search method — vector, text, or hybrid.
Creating filters
Use the MetadataFilter value object. All eleven operators are supported:
Filtering search results
Pass filters to any search method. Multiple filters are ANDed together by default.
OR groups (nested arrays)
Wrap filters in a nested array to create OR groups. Filters at the top level are ANDed; filters inside a nested array are ORed.
Over-fetching for filtered queries
When filters are applied, the search may need to examine more candidates than k to find enough matching documents. By default, the search fetches k * 5 candidates, then filters. You can tune this:
Note: Filtered queries may return fewer than
kresults if not enough documents match.
Updating metadata
Update metadata on existing documents without re-indexing vectors or text:
The patchMetadata() method:
- Merges patch into existing metadata (existing keys preserved unless overwritten)
- Does NOT touch HNSW or BM25 indexes (fast, metadata-only operation)
- Persists immediately when database has a path configured
Metadata-only search
Query documents by metadata alone, without a vector or text query:
Note: Documents missing the
sortBykey are placed at the end of results. All results havescore = 1.0(no ranking).
Strict type comparison
Metadata filtering uses strict type comparison (PHP ===). This means:
- String
'5'does NOT match integer5 - Float
1.0does NOT match integer1
Custom tokenizer
Implement TokenizerInterface to plug in stemming, lemmatization, or any language-specific logic.
Benchmark
A VectorDBBench-style CLI benchmark lives in benchmark/. It measures index build throughput, serial QPS, P99 tail latency, Recall@k against brute-force ground truth, and persistence speed.
Available scenarios
| Key | Vectors | Dims | Notes |
|---|---|---|---|
xs |
1,000 | 128 | Quick smoke test |
small |
10,000 | 128 | SIFT-small scale |
medium |
50,000 | 128 | SIFT-medium scale |
large |
100,000 | 128 | Requires ~512 MB RAM |
highdim |
10,000 | 768 | Text-embedding scale (Cohere-style) |
The report is printed as Markdown to stdout (or a file via --output). Progress messages go to stderr so piping works cleanly: php benchmark/benchmark.php > report.md.
Running the tests
Copyright
(C) 2026 by Enrico Zimuel