Download the PHP package prinsfrank/pdfparser without Composer

On this page you can find all versions of the php package prinsfrank/pdfparser. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package pdfparser

Banner

PDF Parser

GitHub PHP Version Support codecov PHPStan Level

A low-memory, fast and maintainable conforming PDF Parser

:mega: Call for testers

Why this library?

Previously, there wasn't a PDF library that allows parsing of PDFs that was open source, MIT licensed and under active development. The PDFParser by smalot, while having been very useful over the years isn't under active development anymore. The parser of Setasign is not MIT licensed and not open source. And several other packages rely on java/js/python dependencies being installed that are called by PHP behind the scenes, losing any type information and underlying structure.

Instead, this package allows for parsing of a wide variety of PDF files while not relying on external dependencies, all while being MIT licensed!

Setup

To start right away, run the following command in your composer project;

Opening a PDF

To open a PDF file, you'll first need to load it and retrieve a Document object. That can be done by either parsing a file directly, or parsing a PDF from a string variable.

Parsing a PDF file

Parsing a PDF from a file directly is the easiest option and also uses the least amount of memory. To do so, simply call the parseFile method on a PdfParser instance:

Parsing PDF from string

It is also possible to parse a PDF from a string in a variable. To do so, pass the string as an argument for the parseFile method on a PdfParser instance. This has a bigger memory footprint while loading the file into memory, but the file will be written to a temp file while processing.

The Document

Once you have opened a file from the filesystem with parseFile or from a string variable using parseString, you'll get back an instance of a Document.

While initially parsing the document, a small number of variables are populated in the Document instance that allow for further accessing of that document. This includes:

The document also contains several methods to retrieve specific objects from it. Those are discussed below.

If you want to quickly retrieve all text from a document, you can use the getText method.

Objects in a Document and their decorators

A PDF is organized in objects. Not all objects are created equally. Some objects might be a Page, while others a Font. Some objects might be Generic and without a specific type. There are currently 18 specific types, and a generic object type. Some of those will be specified below.

Code specific for certain object types lives in that object types' decorator. Retrieving text for a Page makes sense, retrieving the text from a Font not so much, so the Page decorator contains the getText method. Below you'll find some documentation for specific object decorators.

If you want to retrieve an object by its number, you can call the $document->getObject($objectNumber) method. If you know that the object with that number is supposed to be of a specific type, you can supply the second argument. For example, if you want to get object 42 which you know is of type Page, you can call the method like this:

If the object is not of the correct type, this will result in an exception. If you don't care about the object type, pass null as the second argument or don't supply the second argument at all.

Decorated InformationDictionary objects

If a PDF has a title, producer, author, creator, creationDate or modificationDate, it is stored in an InformationDictionary.

If a PDF has an InformationDictionary, it can be retrieved using the $document->getInformationDictionary() method. Not All PDFs have this available, so this method might return null.

To access information from the InformationDictionary, there are several methods available:

If you want to access non-standard data from the information dictionary, you can also retrieve the entire dictionary from the object:

Decorated Page objects

Page objects can be retrieved from a document by calling the $document->getPage($pageNumber) method for a single page, or $document->getPages() for all pages. Note that $pageNumber is zero-indexed, so even if different format page numbers are displayed at the bottom of a page, the first page in a document is still page 0, etc.

Once you have a Page object, there are several methods available to retrieve information from that page. The main method of interest here is the $page->getText() method. To retrieve all text from all pages, you could do something like this:

There is also a getText method available on the Document to retrieve all text at once without even having to retrieve pages.

There are also methods available to get the underlying textObjectCollection using $page->getTextObjectCollection(), the resource dictionary for a page using $page->getResourceDictionary() and the font dictionary using $page->getFontDictionary().


All versions of pdfparser with dependencies

PHP Build Version
Package Version
Requires php Version ~8.1.0 || ~8.2.0 || ~8.3.0 || ~8.4.0
ext-zlib Version *
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package prinsfrank/pdfparser contains the following files

Loading the files please wait ....