Download the PHP package spatie/crawler without Composer

On this page you can find all versions of the php package spatie/crawler. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package crawler

🕸 Crawl the web using PHP 🕷

Latest Version on Packagist Tests Total Downloads

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.

Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.

Support us

We invest a lot of resources into creating best in class open source packages. You can support us by buying one of our paid products.

We highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using. You'll find our address on our contact page. We publish all received postcards on our virtual postcard wall.

Installation

This package can be installed via Composer:

Usage

The crawler can be instantiated like this

The argument passed to setCrawlObserver must be an object that extends the \Spatie\Crawler\CrawlObservers\CrawlObserver abstract class:

Using multiple observers

You can set multiple observers with setCrawlObservers:

Alternatively you can set multiple observers one by one with addCrawlObserver:

Executing JavaScript

By default, the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:

In order to make it possible to get the body html after the javascript has been executed, this package depends on our Browsershot package. This package uses Puppeteer under the hood. Here are some pointers on how to install it on your system.

Browsershot will make an educated guess as to where its dependencies are installed on your system. By default, the Crawler will instantiate a new Browsershot instance. You may find the need to set a custom created instance using the setBrowsershot(Browsershot $browsershot) method.

Note that the crawler will still work even if you don't have the system dependencies required by Browsershot. These system dependencies are only required if you're calling executeJavaScript().

Filtering certain urls

You can tell the crawler not to visit certain urls by using the setCrawlProfile-function. That function expects an object that extends Spatie\Crawler\CrawlProfiles\CrawlProfile:

This package comes with three CrawlProfiles out of the box:

Custom link extraction

You can customize how links are extracted from a page by passing a custom UrlParser to the crawler.

By default, the LinkUrlParser is used. This parser will extract all links from the href attribute of a tags.

There is also a built-in SitemapUrlParser that will extract & crawl all links from a sitemap. It does support sitemap index files.

Ignoring robots.txt and robots meta

By default, the crawler will respect robots data. It is possible to disable these checks like so:

Robots data can come from either a robots.txt file, meta tags or response headers. More information on the spec can be found here: http://www.robotstxt.org/.

Parsing robots data is done by our package spatie/robots-txt.

Accept links with rel="nofollow" attribute

By default, the crawler will reject all links containing attribute rel="nofollow". It is possible to disable these checks like so:

Using a custom User Agent

In order to respect robots.txt rules for a custom User Agent you can specify your own custom User Agent.

You can add your specific crawl rule group for 'my-agent' in robots.txt. This example disallows crawling the entire site for crawlers identified by 'my-agent'.

Setting the number of concurrent requests

To improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency method.

Defining Crawl Limits

By default, the crawler continues until it has crawled every page it can find. This behavior might cause issues if you are working in an environment with limitations such as a serverless environment.

The crawl behavior can be controlled with the following two options:

Let's take a look at some examples to clarify the difference between these two methods.

Example 1: Using the total crawl limit

The setTotalCrawlLimit method allows you to limit the total number of URLs to crawl, no matter how often you call the crawler.

Example 2: Using the current crawl limit

The setCurrentCrawlLimit will set a limit on how many URls will be crawled per execution. This piece of code will process 5 pages with each execution, without a total limit of pages to crawl.

Example 3: Combining the total and crawl limit

Both limits can be combined to control the crawler:

Example 4: Crawling across requests

You can use the setCurrentCrawlLimit to break up long running crawls. The following example demonstrates a (simplified) approach. It's made up of an initial request and any number of follow-up requests continuing the crawl.

Initial Request

To start crawling across different requests, you will need to create a new queue of your selected queue-driver. Start by passing the queue-instance to the crawler. The crawler will start filling the queue as pages are processed and new URLs are discovered. Serialize and store the queue reference after the crawler has finished (using the current crawl limit).

Subsequent Requests

For any following requests you will need to unserialize your original queue and pass it to the crawler:

The behavior is based on the information in the queue. Only if the same queue-instance is passed in the behavior works as described. When a completely new queue is passed in, the limits of previous crawls -even for the same website- won't apply.

An example with more details can be found here.

Setting the maximum crawl depth

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the setMaximumDepth method.

Setting the maximum response size

Most html pages are quite small. But the crawler could accidentally pick up on large files such as PDFs and MP3s. To keep memory usage low in such cases the crawler will only use the responses that are smaller than 2 MB. If, when streaming a response, it becomes larger than 2 MB, the crawler will stop streaming the response. An empty response body will be assumed.

You can change the maximum response size.

Add a delay between requests

In some cases you might get rate-limited when crawling too aggressively. To circumvent this, you can use the setDelayBetweenRequests() method to add a pause between every request. This value is expressed in milliseconds.

Limiting which content-types to parse

By default, every found page will be downloaded (up to setMaximumResponseSize() in size) and parsed for additional links. You can limit which content-types should be downloaded and parsed by setting the setParseableMimeTypes() with an array of allowed types.

This will prevent downloading the body of pages that have different mime types, like binary files, audio/video, ... that are unlikely to have links embedded in them. This feature mostly saves bandwidth.

Using a custom crawl queue

When crawling a site the crawler will put urls to be crawled in a queue. By default, this queue is stored in memory using the built-in ArrayCrawlQueue.

When a site is very large you may want to store that queue elsewhere, maybe a database. In such cases, you can write your own crawl queue.

A valid crawl queue is any class that implements the Spatie\Crawler\CrawlQueues\CrawlQueue-interface. You can pass your custom crawl queue via the setCrawlQueue method on the crawler.

Here

Change the default base url scheme

By default, the crawler will set the base url scheme to http if none. You have the ability to change that with setDefaultScheme.

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Testing

First, install the Puppeteer dependency, or your tests will fail.

To run the tests you'll have to start the included node based server first in a separate terminal window.

With the server running, you can start testing.

Security

If you've found a bug regarding security please mail [email protected] instead of using the issue tracker.

Postcardware

You're free to use this package, but if it makes it to your production environment we highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using.

Our address is: Spatie, Kruikstraat 22, 2018 Antwerp, Belgium.

We publish all received postcards on our company website.

Credits

License

The MIT License (MIT). Please see License File for more information.


All versions of crawler with dependencies

PHP Build Version
Package Version
Requires php Version ^8.1
guzzlehttp/guzzle Version ^7.3
guzzlehttp/psr7 Version ^2.0
illuminate/collections Version ^10.0|^11.0
nicmart/tree Version ^0.8.0
spatie/browsershot Version ^3.45|^4.0
spatie/robots-txt Version ^2.0
symfony/dom-crawler Version ^6.0|^7.0
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package spatie/crawler contains the following files

Loading the files please wait ....