Download the PHP package heimrichhannot/crawler without Composer

On this page you can find all versions of the php package heimrichhannot/crawler. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package crawler

H&H Crawler

A fork of spatie/crawler v2 with some adjustments. Only used for an internal project.

Crawl links on a website

Latest Version on Packagist Build Status Quality Score StyleCI Total Downloads

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.

Because the crawler can execute JavaScript, it can crawl JavaScript rendered site. Under the hood headless Chrome is used to power this feature.

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

Installation

This package can be installed via Composer:

Usage

The crawler can be instantiated like this

The argument passed to setCrawlObserver must be an object that implements the \Spatie\Crawler\CrawlObserver interface:

Executing JavaScript

By default the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:

Under the hood headless Chrome is used to execute JavaScript. Here are some pointers on how to install it on your system.

The package will make an educated guess as to where Chrome is installed on your system. You can also manually pass the location of the Chrome binary to executeJavaScript()

Filtering certain urls

You can tell the crawler not to visit certain urls by passing using the setCrawlProfile-function. That function expects an objects that implements the Spatie\Crawler\CrawlProfile-interface:

This package comes with three CrawlProfiles out of the box:

Setting the number of concurrent requests

To improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency method.

Setting the maximum crawl count

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the amount of urls the crawler should crawl you can use the setMaximumCrawlCount method.

Setting the maximum crawl depth

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the setMaximumDepth method.

Using a custom crawl queue

When crawling a site the crawler will put urls to be crawled in a queue. By default this queue is stored in memory using the built in CollectionCrawlQueue.

When a site is very large you may want to store that queue elsewhere, maybe a database. In such cases you can write your own crawl queue.

A valid crawl queue is any class that implements the Spatie\Crawler\CrawlQueue\CrawlQueue-interface. You can pass your custom crawl queue via the setCrawlQueue method on the crawler.

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Testing

To run the tests you'll have to start the included node based server first in a separate terminal window.

With the server running, you can start testing.

Security

If you discover any security related issues, please email [email protected] instead of using the issue tracker.

Postcardware

You're free to use this package, but if it makes it to your production environment we highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using.

Our address is: Spatie, Samberstraat 69D, 2060 Antwerp, Belgium.

We publish all received postcards on our company website.

Credits

Support us

Spatie is a webdesign agency based in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

Does your business depend on our contributions? Reach out and support us on Patreon. All pledges will be dedicated to allocating workforce on maintenance and new awesome stuff.

License

The MIT License (MIT). Please see License File for more information.


All versions of crawler with dependencies

PHP Build Version
Package Version
Requires php Version ^7.0 || ^8.0
symfony/dom-crawler Version ^3.0 || ^4.0 || ^5.0 || ^6.0
guzzlehttp/guzzle Version ^6.3 || ^7.0.1
tightenco/collect Version >=5.3, <10
nicmart/tree Version ^0.2.7
spatie/browsershot Version ^2.4 || ^3.14
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package heimrichhannot/crawler contains the following files

Loading the files please wait ....