Download the PHP package oneofftech/parse-client without Composer
On this page you can find all versions of the php package oneofftech/parse-client. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download oneofftech/parse-client
More information about oneofftech/parse-client
Files in oneofftech/parse-client
Package parse-client
Short Description Parse PDF document keeping the structure.
License MIT
Homepage https://github.com/oneofftech/oneofftech-parse-client
Informations about the package parse-client
OneOffTech Parse client
Parse client is a library to interact with OneOffTech Parse service. OneOffTech Parse is designed to extract text from PDF files preserving the structure of the document to improve interaction with Large Language Models (LLMs).
OneOffTech Parse is based on Parxy extractor. The client is also suitable to connect to self-hosted versions of Parxy.
[!NOTE] The Parse client package is under development and is not ready for production use.
Installation
You can install the package via Composer:
Usage
The Parse client is able to connect to self-hosted instances of the Parxy extractor service or the cloud hosted OneOffTech Parse service.
Use with self-hosted instance
Before proceeding a running instance of Parxy is required. Once you have a running instance, you can instantiate the connector by passing the url that the extractor service is listening on.
[!NOTE]
- The URL of the document must be accessible without authentication.
- Documents are downloaded for the time of processing and then the file is immediately deleted.
Use the cloud hosted service
[!IMPORTANT] The cloud hosted service is currently in private beta. Drop us a message.
Go to parse.oneofftech.de and obtain an access token. Instantiate the client and provide a URL of a PDF document.
[!NOTE]
- The URL of the document must be accessible without authentication.
- Documents are downloaded for the time of processing and then the file is immediately deleted.
Specify the preferred extraction method
Parse service supports different processors, pymupdf
or pdfact
, unstructured
and llamaparse
. You can specify the preferred processor for each request.
PDFAct vs PyMuPDF
PDFAct offers more flexibility than PyMuPDF. You should evaluate the extraction method best suitable for your application. Here is a small comparison of the two methods.
feature | PDFAct | PyMuPDF |
---|---|---|
Text extraction | :white_check_mark: | :white_check_mark: |
Pagination | :white_check_mark: | :white_check_mark: |
Headings identification | :white_check_mark: | - |
Text styles (e.g. bold or italic) | :white_check_mark: | - |
Page header | :white_check_mark: | - |
Page footer | :white_check_mark: | - |
Document structure
Parse is designed to preserve the document's structure hence the content is returned in a hierarchical fashion.
For a more in-depth explanation of the structure see Parse Document Model.
Testing
Parse client is tested using PEST. Tests run for each commit and pull request.
To execute the test suite run:
Changelog
Please see CHANGELOG for more information on what has changed recently.
Contributing
Thank you for considering contributing to the Parse client! The contribution guide can be found in the CONTRIBUTING.md file.
Security Vulnerabilities
Please review our security policy on how to report security vulnerabilities.
Credits
- OneOffTech
- All Contributors
Supporters
The project is provided and supported by OneOff-Tech (UG).
License
The MIT License (MIT). Please see License File for more information.