Download the PHP package baqend/spider without Composer
On this page you can find all versions of the php package baqend/spider. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download baqend/spider
More information about baqend/spider
Files in baqend/spider
Package spider
Short Description URL spider which crawls a page and all its subpages
License MIT
Informations about the package spider
PHP Spider
URL spider which crawls a page and all its subpages
- Installation
- Usage
- Processors
- URL Handlers
- Alternatives
Installation
Make sure you have Composer installed. Then execute:
composer require baqend/spider
This package requires at least PHP 5.5.9 and has no package dependencies!
Usage
The entry point is the Spider
class. For it to work, it requires the following services:
- Queue: Collects URLs to be processed. This package comes with a breadth-first and a depth-first implementation.
- URL Handler: Checks if a URL should be processed. If no URL handler is provided, every URL is processed. More about URL handlers
- Downloader: Takes URLs and downloads them. To have no dependency on a HTTP client library like Guzzle, you have to implement this class by yourself.
- Processor: Retrieves downloaded assets and performs operations on it. More about Processors
You initialize the spider in the following way:
Processors
This package comes with the following built-in processors.
Processor
This is an aggregate processor which allows adding and removing other processors which it will execute one after the other.
HtmlProcessor
This processor can process HTML assets and enqueue its containing URLs.
It will also modify all relative URLs and make them absolute.
Also, if you provide a CssProcessor, style
attributes are found and URLs within CSS will be resolved.
CssProcessor
This processor can process CSS assets and enqueue its containing URLs from @import
s and url(...)
statements.
ReplaceProcessor
Performs simple str_replace
operations on asset contents:
The ReplaceProcessor
does not enqueue other URLs.
StoreProcessor
Takes a URL prefix and a directory and will store all assets relative to the prefix in the according file structure in directory.
The StoreProcessor
does not enqueue other URLs.
UrlRewriteProcessor
Changes the URL of an asset to another prefix. Use this to let CssProcessor resolve relative URLs from a different origin.
The UrlRewriteProcessor
does not enqueue other URLs.
Also, it does not modify the asset's content – only its URL.
URL Handlers
URL handlers tell the spider whether to download and process a URL. There are the following built-in URL handlers:
OriginUrlHandler
Handles only URLs coming from some given origin, i.e. "https://example.org".
BlacklistUrlHandler
Does not handle URLs being part of some blacklist. You can use glob patterns to provide a blacklist:
Alternatives
If this project does not match your needs, check the following other projects:
- spatie/crawler (Requires PHP 7)
- vdb/php-spider