Download the PHP package gyaaniguy/pcrawl without Composer
On this page you can find all versions of the php package gyaaniguy/pcrawl. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download gyaaniguy/pcrawl
More information about gyaaniguy/pcrawl
Files in gyaaniguy/pcrawl
Package pcrawl
Short Description PHP web scraping and crawling library. With support for multiple clients, fast parsing, debugging and on the fly changes to various options
License BSD-4-Clause
Informations about the package pcrawl
This is in alpha stage.
PCrawl
PCrawl is a PHP library for crawling and scraping web pages.
It supports multiple clients: curl, guzzle. Options to debug, modify and parse responses.
Features
- Rapidly create custom clients. Fluently change clients and client options like user-agent, with method chaining.
- Responses can be modified using reusable callback functions.
- Debug Responses using different criterias - httpcode, regex etc.
- Parse responses using querypath library. Several convenience functions are provided.
- Fluent API. Different debuggers, clients and response mod objects can be be changed on the fly !
Full Example
We'll try to fetch a bad page, then detect using a debugger and finally change client options to fetch the page correctly.
-
Setup up some clients
- Lets make some debugger objects
Start fetching!
For testing, we will fetch page with a client that does not support redirects, then use the redirectDetector to detect
- If so we change client option to support redirects and fetch again.
Use the fullPageDetector to detect if the page is proper.
Then parse the response body using Parser
Note: the debuggers, clients, parsers can be reused.
Detailed Usage
Usage of functions can be divided into parts:
- Fetching a page
- Modifying the response body
- Debugging the response
- Parsing the response body
Installation
-
Composer:
- github:
TODO list
- Leverage guzzlehttp asynchronous support
Standards
All versions of pcrawl with dependencies
guzzlehttp/guzzle Version ^7.5
gravitypdf/querypath Version ^3.0
ext-curl Version *
ext-json Version *