Download the PHP package fievel/webspider without Composer
On this page you can find all versions of the php package fievel/webspider. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download fievel/webspider
More information about fievel/webspider
Files in fievel/webspider
Package webspider
Short Description webspider
License MIT
Homepage https://github.com/Fievel90/WebSpider
Informations about the package webspider
WebSpider
This repository wraps Guzzle and some Symfony components providing an easy way for spidering websites.
Requirements
- PHP >=5.5
- Guzzle >= 6.0
- Doctrine ORM >= 2.2
- Symfony Components >= 2.7
Installation
Add fievel/webspider
as a require dependency in your composer.json
file:
composer require fievel/webspider
Usage
Extend class WebSpiderAbstract
as needed implementing these methods:
getDataFromResponse: used to extract data from response, default behaviour treats body as plain text;
protected function getDataFromResponse(ResponseInterface $response)
{
return (string) $response->getBody();
}
parseData: used to extract data information, it's possible to initialize Symfony DomCrawler
if needed;
protected function parseData($data)
{
$this->crawler->addHtmlContent($data);
$node = $this->crawler->filter('input');
$value = null;
if ($node->count() > 0) {
$value = $node->first()->attr('value');
}
return $value;
}
handleException: used to handle Guzzle exceptions;
protected function handleException(\Exception $e)
{
return null;
}
The only remaining thing to do is launch the spider created, in order to do that you can use the SpiderManager
service.
$manager = $this->container->get('fievel_web_spider.manager.spider');
$manager->setLogger($this->logger);
$response = null;
try {
$response = $manager->runSpider([
AppBundle\Spiders\CustomSpider::class, // Spider class created
'http://localhost/test-spider', // URL to spidering
'post', // Http method supported by Guzzle
['cookies' => true], // Custom config supported by Guzzle Client
[ // Custom options supported by Guzzle Client
RequestOptions::FORM_PARAMS => [
'full_name' => 'John Doe'
]
]
]);
} catch(\Exception $e) {
}
Features
It's possible to share a storage between subsequent spiders call.
$storage = new SpiderStorage();
$storage->add($sharedData);
$response = $manager->runSpider([
AppBundle\Spiders\CustomSpider::class, // Spider class created
'http://localhost/test-spider', // URL to spidering
'post', // Http method supported by Guzzle
['cookies' => true], // Custom config supported by Guzzle Client
[ // Custom options supported by Guzzle Client
RequestOptions::FORM_PARAMS => [
'full_name' => 'John Doe'
]
],
$storage // Shared storage
]);
It's even possible to create queues and leave the entire execution to the manager.
$queue = new SpiderCallQueue();
$queue->enqueue(
AppBundle\Spiders\FirstPageSpider::class,
'http://localhost/test-spider',
'post',
['cookies' => true],
[
RequestOptions::FORM_PARAMS => [
'full_name' => 'John Doe'
]
]
);
$queue->enqueue(
AppBundle\Spiders\SecondPageSpider::class,
'http://localhost/test-spider',
'get',
['cookies' => true],
[]
);
$response = $manager->runSpiderQueue($queue);
Last but not least, the SpiderManager
will handle retries on failure using a custom GuzzleMiddleware
.
Proxy
Links
All versions of webspider with dependencies
doctrine/orm Version >=2.2
guzzlehttp/guzzle Version ^6.0
symfony/css-selector Version >=2.7
symfony/dom-crawler Version >=2.7
symfony/framework-bundle Version >=2.7