Download the PHP package textualization/semantic-search without Composer
On this page you can find all versions of the php package textualization/semantic-search. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download textualization/semantic-search
More information about textualization/semantic-search
Files in textualization/semantic-search
Package semantic-search
Short Description Semantic search using Ropherta embeddings and SQLite3 Vector extension.
License MIT
Informations about the package semantic-search
PHP Semantic Search Classes
This classes contain interfaces for semantic search and implementations for them using Ropherta as embedder and SQLite3 Vector Search as vector database. A keyword search using SQLite3 FTS4 and BM25 is also provided.
To populate the vector database, the Ingester
class contains a recursive chunker similar to the one available in LangChain but with some speed ups plus the ability to refer back to offsets in the source document.
More advanced uses include using vector search as reranker and using HyDE to obtain symmetric embeddings.
A demo source documents and indexes over 35,000 documents from StackOverflow and PHP documentation is available for download.
Document format for ingestion
The ingestion component takes documents in JSONL
format, with one well-formed JSON document per line.
The text ought to be plain text UTF-8 encoded (per JSON spec). The system will split it into chunks on ingestion.
Vector Index depencencies
You need sqlite-vss installed (both vector0 and vss0.so)
php.ini
Set
sqlite3-vss dependencies
Ropherta dependencies
Install the ONNX framework:
If you want do use the multilingual model, do not download this model and follow the instructions in the next section. For the English model, download the Sentence RoBERTa ONNX model (this takes a while, the model is 362Mb in size):
Multilingual model
Download the SentencePiece library:
Download the XLM Tokenizer SentencePiece BPE model:
Download the Multilingual-E5-small model (471Mb in size):
Please note: if you had downloaded the English model you'll need to delete it first. Currently only one model is possible, this limitation will be lifted in future versions.
Example
Install the SQLite3 extension and Ropherta per the instructions above.
Download the data and indexes from http://textualization.com/download/phpsemsearch_0.1.tar.bz2 (192Mb).
Decompress and move the files vector.db
and keyword.db
to the root folder.
Keyword search
The top documents seem pretty apt.
Vector search
Not as good as keyword but results are really different. Vector search using a symmetric embedder (like Sentence RoBERTa) works better when searching similar documents not queries against documents. For such cases, HyDE (presented below) is better. Alternative, asymmetric embedders (like InstructOR) can be ported to PHP through ONNX.
Reranked
Combining keyword and vector. Should be more precise.
HyDE
Using this functionality needs an OpenAI API key. It can be passed through a file, an environment variable or directly in the constructor. Note that calling OpenAI is very slow. In this example the ChatGPT text throws the vector search into some directions that are not necessarily ideal. A shorter text might had been better.
Fetching documents
Given a URL and chunk number, the fetch script does the trick:
Note that the keyword index is unchunked (all chunk numbers are 0).
Indexing files
The documents are provided as a JSONL
file in the format described at the top of this document.
A sample file is available for for download.
To use a reranked index, create a vector and keyword indexes separately.
The vector indexing takes about a day and consumes significant amount of RAM at the moment.
Chunking files
The recursive chunker can be used standalone:
The output JSONL
documents have keys:
- Title (
title
) - Text (
text
) - URL (
url
) - Chunk Number (
chunk_num
) - Offset Start (
offset_start
) - Offset End (
offset_end
) - License (
license
)
Other tokenizers are possible, see the code in the scripts
folder. Using the string null
(n-u-l-l) sets the size to characters instead of tokens.
Computing embeddings with custom ONNX model
HyDE-rating files
To expand answers using a completion service (like Open AI ChatGPT) use:
It populates the field completion
from the title
in the JSON
object.
Sponsors
We thank our sponsor: