Download the PHP package textualization/semantic-search without Composer

On this page you can find all versions of the php package textualization/semantic-search. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package semantic-search

PHP Semantic Search Classes

This classes contain interfaces for semantic search and implementations for them using Ropherta as embedder and SQLite3 Vector Search as vector database. A keyword search using SQLite3 FTS4 and BM25 is also provided.

To populate the vector database, the Ingester class contains a recursive chunker similar to the one available in LangChain but with some speed ups plus the ability to refer back to offsets in the source document.

More advanced uses include using vector search as reranker and using HyDE to obtain symmetric embeddings.

A demo source documents and indexes over 35,000 documents from StackOverflow and PHP documentation is available for download.

Document format for ingestion

The ingestion component takes documents in JSONL format, with one well-formed JSON document per line.

The text ought to be plain text UTF-8 encoded (per JSON spec). The system will split it into chunks on ingestion.

Vector Index depencencies

You need sqlite-vss installed (both vector0 and vss0.so)

php.ini

Set

sqlite3-vss dependencies

Ropherta dependencies

Install the ONNX framework:

If you want do use the multilingual model, do not download this model and follow the instructions in the next section. For the English model, download the Sentence RoBERTa ONNX model (this takes a while, the model is 362Mb in size):

Multilingual model

Download the SentencePiece library:

Download the XLM Tokenizer SentencePiece BPE model:

Download the Multilingual-E5-small model (471Mb in size):

Please note: if you had downloaded the English model you'll need to delete it first. Currently only one model is possible, this limitation will be lifted in future versions.

Example

Install the SQLite3 extension and Ropherta per the instructions above.

Download the data and indexes from http://textualization.com/download/phpsemsearch_0.1.tar.bz2 (192Mb).

Decompress and move the files vector.db and keyword.db to the root folder.

Keyword search

The top documents seem pretty apt.

Vector search

Not as good as keyword but results are really different. Vector search using a symmetric embedder (like Sentence RoBERTa) works better when searching similar documents not queries against documents. For such cases, HyDE (presented below) is better. Alternative, asymmetric embedders (like InstructOR) can be ported to PHP through ONNX.

Reranked

Combining keyword and vector. Should be more precise.

HyDE

Using this functionality needs an OpenAI API key. It can be passed through a file, an environment variable or directly in the constructor. Note that calling OpenAI is very slow. In this example the ChatGPT text throws the vector search into some directions that are not necessarily ideal. A shorter text might had been better.

Fetching documents

Given a URL and chunk number, the fetch script does the trick:

Note that the keyword index is unchunked (all chunk numbers are 0).

Indexing files

The documents are provided as a JSONL file in the format described at the top of this document.

A sample file is available for for download.

To use a reranked index, create a vector and keyword indexes separately.

The vector indexing takes about a day and consumes significant amount of RAM at the moment.

Chunking files

The recursive chunker can be used standalone:

The output JSONL documents have keys:

Other tokenizers are possible, see the code in the scripts folder. Using the string null (n-u-l-l) sets the size to characters instead of tokens.

Computing embeddings with custom ONNX model

HyDE-rating files

To expand answers using a completion service (like Open AI ChatGPT) use:

It populates the field completion from the title in the JSON object.

Sponsors

We thank our sponsor:


All versions of semantic-search with dependencies

PHP Build Version
Package Version
Requires textualization/sentence-transphormers Version ^0.0.9
orhanerday/open-ai Version ^4.8
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package textualization/semantic-search contains the following files

Loading the files please wait ....