Download the PHP package vdb/php-spider without Composer
On this page you can find all versions of the php package vdb/php-spider. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Informations about the package php-spider
PHP-Spider Features
- supports two traversal algorithms: breadth-first and depth-first
- supports crawl depth limiting, queue size limiting and max downloads limiting
- supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
- comes with a useful set of URI filters, such as robots.txt and Domain limiting
- supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
- supports custom request handling logic
- supports Basic, Digest and NTLM HTTP authentication. See example.
- comes with a useful set of persistence handlers (memory, file)
- supports custom persistence handlers
- collects statistics about the crawl for reporting
- dispatches useful events, allowing developers to add even more custom behavior
- supports a politeness policy
This Spider does not support Javascript.
Installation
The easiest way to install PHP-Spider is with composer. Find it on Packagist.
Usage
This is a very simple example. This code can be found in example/example_complex.php. That file contains a more real-world example.
Note that by default, the spider stops processing when it encounters a 4XX or 5XX error responses. To set the spider up to keep processing, please see the link checker example. It uses a custom request handler, that configures the default Guzzle request handler to not fail on 4XX and 5XX responses.
First create the spider
Add a URI discoverer. Without it, the spider does nothing. In this case, we want all <a>
nodes from a certain <div>
Set some sane options for this example. In this case, we only get the first 10 items from the start page.
Add a listener to collect stats from the Spider and the QueueManager. There are more components that dispatch events you can use.
Execute the crawl
When crawling is done, we could get some info about the crawl
Finally we could do some processing on the downloaded resources. In this example, we will echo the title of all resources
Contributing
Contributing to PHP-Spider is as easy as Forking the repository on Github and submitting a Pull Request. The Symfony documentation contains an excellent guide for how to do that properly here: Submitting a Patch.
There a few requirements for a Pull Request to be accepted:
- Follow the coding standards: PHP-Spider follows the coding standards defined in the PSR-0, PSR-1 and PSR-2 Coding Style Guides;
- Prove that the code works with unit tests and that coverage remains 100%;
Note: An easy way to check if your code conforms to PHP-Spider is by running the script
bin/static-analysis
, which is part of this repo. This will run the following tools, configured for PHP-Spider: PHP CodeSniffer, PHP Mess Detector and PHP Copy/Paste Detector.Note: To run PHPUnit with coverage, and to check that coverage == 100%, you can run
bin/coverage-enforce
.
Support
For things like reporting bugs and requesting features it is best to create an issue here on GitHub. It is even better to accompany it with a Pull Request. ;-)
License
PHP-Spider is licensed under the MIT license.
All versions of php-spider with dependencies
ext-dom Version *
ext-pcntl Version *
guzzlehttp/guzzle Version ^6.0.0||^7.0.0
pdepend/pdepend Version ^2.16.1
symfony/css-selector Version ^3.0.0||^4.0.0||^5.0.0||^6.0||^7.0
symfony/dom-crawler Version ^3.0.0||^4.0.0||^5.0.0||^6.0||^7.0
symfony/finder Version ^3.0.0||^4.0.0||^5.0.0||^6.0||^7.0
symfony/event-dispatcher Version ^4.0.0||^5.0.0||^6.0||^7.0
vdb/uri Version ^0.3.2
spatie/robots-txt Version ^2.0
phan/phan Version ^4.0||^5.0