Download the PHP package nadar/crawler without Composer
On this page you can find all versions of the php package nadar/crawler. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download nadar/crawler
More information about nadar/crawler
Files in nadar/crawler
Package crawler
Short Description A highly extendible, dependency free Crawler for HTML, PDFS or any other type of Documents.
License MIT
Informations about the package crawler
Website Crawler for PHP
A highly extendible, dependency free Crawler for HTML, PDFS or any other type of Documents.
Why another Page Crawler? Yes, indeed, there are already very good Crawlers around, therefore those where my goals:
- Dependency Free - we don't want to use any HTTP client, as much "native" PHP code as possible in order to keep the overhead small. It just requires the CURL extension.
- Memory Efficent - As memory efficient as possible, less overhead, full code control.
- Extendible - Attach your own parsers in order to determine how html or any other format is parsed. There are out of the box parsers for HTML and PDF. Its very easy to build your own data type parser.
- Runtime Storage - When the crawler runs, certain informations must be stored. This is extendible to suit your use case. Either use your database or take the built in array or file storage system.
- Async - It's possible to start the crawler and process any further run cycle as an asynchronus process, f.e. with a PHP queue system like Yii2 Queue.
Installation
Composer is required to install this library:
In order to use the PDF Parser, the optional library smalot/pdfparser
must be installed:
Usage
- First we need to provide the crawler the information what should be done with the results from a crawler run:
Create your handler, those are the classes which interact with the crawler in order to store your content/results somwehere. The afterRun() method will run whenever an URL is crawled and contains the results:
- Then we attach the handler and setup all required informations for crawler:
Attention: Keep in mind that wen you enable the PDF Parser and have multiple concurrent requests this can drastically increases memory usage (Especially if there are large PDFs)! Therefore it's recommend to lower the concurrent value when enabling PDF Parser!
Benchmark
Of course those benchmarks may vary depending on internet connection, bandwidth, servers but we made all the tests under the same circumstances. The memory peak varys strong when using the PDF parsers, therefore we test only with HTML parser:
Index Size | Concurrent Requests | Memory Peak | Time | Storage |
---|---|---|---|---|
308 | 30 | 6MB | 19s | ArrayStorage |
308 | 30 | 6MB | 20s | FileStorage |
Still looking for a good website to use for benchmarking. See the
benchmark.php
file for the test setup.
Developer Informations
For a better understanding, here is en explenation of how the classes are capsulated and for what they are used.
- Crawler: The Crawler is the main programm, it starts, runs and ends.
- Job: The job contains the url logic for the next "CURL"/Download Job
- Parsers: The parsers will take the job informations in combination with the RequestResponse in order to generate a ParserResult
- ParserResult: The Job result represents the result from a Parser.
- QueueItem: The queue item is extracted from the job and is only used to store those informations with use of StorageInterface
Lifecycle
Crawler -> Job -> (ItemQueue -> Storage) -> RequestResponse -> Parser -> ParserResult -> Result