Download the PHP package terminal42/escargot without Composer

On this page you can find all versions of the php package terminal42/escargot. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package escargot

Escargot - a Symfony HttpClient based Crawler framework


A library that provides everything you need to crawl anything based on HTTP and process the responses in whatever way you prefer based on Symfony components.

Why yet another crawler?

There are so many different implementations in so many programming languages, right? Well, the ones I found in PHP did not really live up to my personal quality standards and also I wanted something that's built on top of the Symfony HttpClient component and is not bound to crawl websites (HTML) only but can be used as the foundation for anything you may want to crawl. Hence, yet another library.

What about that name «Escargot»?

When I created this library I didn't want to name it «crawler» or «spider» or anything similar that's been used hundreds of times before. So I started to think about things that actually crawl and one thing that came to my mind immediately were snails. But «snail» doesn't really sound super beautiful and so I just went with the French translation for it which is «escargot». There you go! Also French is a beautiful language anyway and in case you didn't know: tons of libraries in the PHP ecosystem were invented and are still maintained by French people so it's also some kind of tribute to the French PHP community (and Symfony one for that matter).

By the way: Thanks to the Symfony HttpClient Escargot is actually not slow at all ;-)

Installation

Usage

Everything in Escargot is assigned to a job ID. The reason for this design is that crawling huge amounts of URIs can take very long and chances that you'll want to stop at some point and pick up where you left are pretty high. For that matter, every Escargot instance also needs a queue plus a base URI collection as to where to start crawling.

Instantiating Escargot

The factory method when you do not have a job ID yet has to be used as follows:

In case you already do have a job ID because you have initiated crawling previously we do not need any base URI collection anymore but the job ID instead (again $client is completely optional):

The different queue implementations

As explained before, the queue is an essential part of Escargot because it keeps track of all the URIs that have been requested already but it is also responsible to pick up where one left based on a given job ID. You can create your own queue and store the information wherever you like by implementing the QueueInterface. This library ships with the following implementations for you to use:

Start crawling

After we have our Escargot instance, we can start crawling which we do by calling the crawl() method:

Subscribers

You might be wondering how you can access the results of your crawl process. In Escargot, crawl() does not return anything but instead, everything is passed on to subscribers which lets you decide exactly on what you want to do with the results that are collected along the way. The flow of every request executed by Escargot is as follows which maps to the corresponding methods in the subscribers:

  1. Decide whether a request should be sent at all (if no subscriber requests the request, none is executed):

    SubscriberInterface:shouldRequest()

  2. If a request was sent, wait for the first response chunk and decide whether the whole response body should be loaded:

    SubscriberInterface:needsContent()

  3. If the body was requested, the data is passed on to the subscribers on the last response chunk that arrives:

    SubscriberInterface:onLastChunk()

Adding a subscriber is accomplished by implementing the SubscriberInterface and registering it using Escargot::addSubscriber():

According to the flow of every request, the SubscriberInterface asks you to implement 3 methods:

There are 2 other interfaces which you might want to integrate but you don't have to:

Tags

Sometimes you may want to add meta information to any CrawlUri instance so you can let other subscribers decide what they want to do with this information, or it may be relevant during another request. The RobotsSubscriber for instance, tags CrawlUri instances when they contained a <meta name="robots" content="nofollow"> in the body or the corresponding X-Robots-Tag header was set. All the links found on this URI are then not followed which happens during the next shouldRequest() call.

There may be use cases where a tag is not enough. Let's say you had a subscriber that wants to add information to a CrawlUri instance it actually has to load from the filesystem or over HTTP again. Maybe no other subscriber ever uses that data? And how would you store all that information in the queue anyway? That's why you can resolve tag values lazily. Doing so can be done by calling $escargot->resolveTagValue($tag). Escargot then asks all subscribers that implement the TagValueResolvingSubscriberInterface for the resolved value.

So if you want to provide lazy loaded information in your subscriber, just add a regular tag - say my-file-info and implement the TagValueResolvingSubscriberInterface that returns the real value once anybody asks for the value of that my-file-info tag.

In other words, sometimes it's enough to only ask $crawlUri->hasTag('foobar-tag') and sometimes you may want to ask Escargot to resolve the tag value using $escargot->resolveTagValue('foobar-tag'). This totally depends on the subscriber.

Crawling websites (HTML crawler)

When people read the word «crawl» or «crawler» they usually immediately think of crawling websites. Granted, this is also the main purpose of this library but if you think about it, nothing you have learnt about Escargot so far was related to crawling websites or HTML. Escargot can crawl anything that's based on HTTP and you could write a subscriber that extracts e.g. new URIs from JSON responses and continue from there.

Awesome isn't it?

To turn our Escargot instance into a proper web crawler, we can register the 2 following subscribers shipped by default:

Using them is done like so:

These two subscribers will help us to build our crawler but we still need to add a subscriber that actually returns a positive decision on shouldRequest(). Otherwise, no request will ever be executed. This is where you jump in and where you can freely decide on whether you want to respect tags of previous subscribers or not. A possible solution could look like this:

You now have a full-fledged web crawler. It's up to you now to see which tags of the different subscribers you actually want to respect or you don't care about and what you actually want to do with the results.

Logging in subscribers

Of course you can always use Dependency Injection and inject whatever logger service you want to use in your subscriber. However, there's also a general logger you can pass to Escargot using Escargot::withLogger(). By having all the subscribers log to this central logger (only or in addition to another one you injected yourself) makes sure, Escargot has one central place where all subscribers log their information. In 99% of all use cases, we want to know

This is why Escargot automatically passes on the logger instance you configured using Escargot::withLogger() on to every subscriber that implements the PSR-6 LoggerAwareInterface. It internally decorates it so the concerning subscriber is already known and you don't have to deal with that:

Because the logger is decorated automatically, it will eventually end up in the logger you configured using Escargot::withLogger() together with the PSR-6 $context that will contain ['source' => 'MyWebCrawler'].

To make things easy when you want to also make sure the CrawlUri instance is passed along in the PSR-6 $context array, use the SubscriberLoggerTrait as follows:

Configuration

There are different configurations you can apply to the Escargot instance:

Projects that use Escargot

Attributions

Roadmap / Ideas


All versions of escargot with dependencies

PHP Build Version
Package Version
Requires php Version ^8.1
ext-simplexml Version *
nyholm/psr7 Version ^1.1
psr/http-message Version ^1.0 || ^2.0
psr/log Version ^1.1 || ^2.0 || ^3.0
symfony/clock Version ^6.2 || ^7.0
symfony/dom-crawler Version ^5.4 || ^6.0 || ^7.0
symfony/event-dispatcher Version ^5.4 || ^6.0 || ^7.0
symfony/http-client Version ^5.4 || ^6.0 || ^7.0
webignition/robots-txt-file Version ^3.0
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package terminal42/escargot contains the following files

Loading the files please wait ....