Download the PHP package content-extract/content-processor without Composer
On this page you can find all versions of the php package content-extract/content-processor. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download content-extract/content-processor
More information about content-extract/content-processor
Files in content-extract/content-processor
Package content-processor
Short Description Robust PHP library for batch document processing. Extracts content from PDFs/text and generates structured JSON according to user-defined schemas. Now with semantic structuring, OCR support for scanned PDFs, text normalization, and alias-driven field matching. Production-ready, secure, zero unnecessary dependencies.
License MIT
Homepage https://github.com/saul9809/content_extract-library
Informations about the package content-processor
Content Processor
Production-ready PHP library for batch document processing with intelligent content extraction and structuring.
Framework-agnostic, scalable, and optimized for real-world document pipelines from day one.
๐ฏ Purpose
Process multiple documents (PDFs, text files, images, etc.), extract their content, and convert it into configurable JSON structures ready for bulk loading into databases or services.
Quick Example
๐ฆ Installation
Or add to your composer.json:
๐๏ธ Project Structure
โก Quick Start
1. Define Your Schema
2. Configure the Processor
3. Consume Results
๐งช Testing
Run Examples
Full Test Suite
Code Quality
๐ Available Interfaces
ExtractorInterface
StructurerInterface
SchemaInterface
๐ Processor Options
โ Implemented Features (Blocks 1-5)
Block 1: Core โ
- Framework-agnostic design with clean interfaces
- Extractor/Structurer pattern
- JSON schema validation
- Batch processing
Block 2: PDF Support โ
- PdfTextExtractor with smalot/pdfparser
- Batch processing with multiple PDFs
- Robust error handling
Block 3: Semantic Structuring โ
- SchemaAwareStructurer for intelligent extraction
- Field aliases for semantic guidance
- Text normalization and segmentation
- Advanced warning system
- Type conversion and validation
Block 4: Final Result API โ
- Unified FinalResult object
- Error and warning normalization
- Summary with statistics
- JSON export and serialization
Block 5: Security & Hardening โ
- File size limits (10 MB default)
- Batch document limits (50 documents default)
- Path traversal protection
- Configurable security validation
- Production-ready defaults
Block 6: OCR Support (v1.5.0+) ๐
- PdfOcrExtractor for scanned PDFs using Tesseract
- Automatic fallback when digital extraction fails
- Transparent OCR processing without code changes
- Preserves semantic structuring pipeline
๐ OCR Support (Optional)
This library supports OCR for scanned PDFs using Tesseract OCR.
Requirements
- Tesseract OCR installed on the system
- Language data files (e.g.,
engfor English) - Installation is handled by the operating system, not Composer
Automatic Fallback
OCR is automatically used when:
- Digital text extraction returns insufficient text
- Extracted text is empty or below threshold (default: 50 characters)
- Extracted text contains no alphabetic characters
Example with OCR
Important Notes
- OCR is optional - the library works fine with digital PDFs
- OCR is NOT installed by Composer
- OCR support does not change schema behavior
- Aliases are still defined by your application
- If Tesseract is not available, clear error messages are provided
๐ Documentation
- ARCHITECTURE.md - Complete architectural design
- SECURITY.md - Security policy and configurable limits
- SEMANTIC_STRUCTURING_GUIDE.md - Schema aliases and matching
- QUICK_START_V1.4.0.md - Quick reference for v1.4.0+
๐ API Reference
FinalResult
๐ Production Ready
The library is tested and ready for production deployment. See SECURITY.md for deployment recommendations.
๐ Requirements
- PHP >= 8.1
- Composer
- (Optional) Tesseract OCR for scanned PDF support
๐ License
MIT