PHP download

Download the PHP package nadar/crawler without Composer

On this page you can find all versions of the php package nadar/crawler. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

Table of contents
Download nadar/crawler
More information about nadar/crawler
Files in nadar/crawler

Vendor nadar
Package crawler
Short Description A highly extendible, dependency free Crawler for HTML, PDFS or any other type of Documents.
License MIT

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:

If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.

Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
To use Composer is sometimes complicated. Especially for beginners.
Composer needs much resources. Sometimes they are not available on a simple webspace.
If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.

Please rate this library. Is it a good library?

Example code of nadar/crawler

Informations about the package crawler

Website Crawler for PHP

A highly extendible, dependency free Crawler for HTML, PDFS or any other type of Documents.

Why another Page Crawler? Yes, indeed, there are already very good Crawlers around, therefore those where my goals:

Dependency Free - we don't want to use any HTTP client, as much "native" PHP code as possible in order to keep the overhead small. It just requires the CURL extension.
Memory Efficent - As memory efficient as possible, less overhead, full code control.
Extendible - Attach your own parsers in order to determine how html or any other format is parsed. There are out of the box parsers for HTML and PDF. Its very easy to build your own data type parser.
Runtime Storage - When the crawler runs, certain informations must be stored. This is extendible to suit your use case. Either use your database or take the built in array or file storage system.
Async - It's possible to start the crawler and process any further run cycle as an asynchronus process, f.e. with a PHP queue system like Yii2 Queue.

Installation

Composer is required to install this library:

In order to use the PDF Parser, the optional library smalot/pdfparser must be installed:

Usage

First we need to provide the crawler the information what should be done with the results from a crawler run:

Create your handler, those are the classes which interact with the crawler in order to store your content/results somwehere. The afterRun() method will run whenever an URL is crawled and contains the results:

Then we attach the handler and setup all required informations for crawler:

Attention: Keep in mind that wen you enable the PDF Parser and have multiple concurrent requests this can drastically increases memory usage (Especially if there are large PDFs)! Therefore it's recommend to lower the concurrent value when enabling PDF Parser!

Benchmark

Of course those benchmarks may vary depending on internet connection, bandwidth, servers but we made all the tests under the same circumstances. The memory peak varys strong when using the PDF parsers, therefore we test only with HTML parser:

Index Size	Concurrent Requests	Memory Peak	Time	Storage
308	30	6MB	19s	ArrayStorage
308	30	6MB	20s	FileStorage

Still looking for a good website to use for benchmarking. See the benchmark.php file for the test setup.

Developer Informations

For a better understanding, here is en explenation of how the classes are capsulated and for what they are used.

Crawler: The Crawler is the main programm, it starts, runs and ends.
Job: The job contains the url logic for the next "CURL"/Download Job
Parsers: The parsers will take the job informations in combination with the RequestResponse in order to generate a ParserResult
ParserResult: The Job result represents the result from a Parser.
QueueItem: The queue item is extracted from the job and is only used to store those informations with use of StorageInterface

Lifecycle

Crawler -> Job -> (ItemQueue -> Storage) -> RequestResponse -> Parser -> ParserResult -> Result

All versions of crawler with dependencies

PHP Build Version

Package Version

Version 1.7.1 Release 05. Apr 2022
create-project require 0 people chose require and
0 people chose create-project.

Download

Download latest version of crawler from vendor nadar

Requires ext-curl Version *

Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package nadar/crawler contains the following files

Loading the files please wait ....

Download the PHP package nadar/crawler without Composer

FAQ

How can I use the PHP package after the download?

Do I need to create a project on this site?

When is it necessary to insert some auth.json content?

What is the advantage to use this site for my Composer projects?