Download the PHP package scrapy/scrapy without Composer

On this page you can find all versions of the php package scrapy/scrapy. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package scrapy

Scrapy

Latest Version on Packagist Build Status

PHP web scraping made easy.

Please note: Documentation is always a work in progress, please excuse any errors.

Installation

You can install the package via composer:

Table of contents

Basic usage

Scrapy is essentially a reader which can modify read data trough series of tasks. To simply read an url you can do the following.

Parsers

Just reading HTML from some source is not a lot of fun. Scrapy allows you to crawl HTML with simple yet expressive API relying on Symphony's DOM crawler.

You can think of parsers as actions meant to extract data valuable to you from HTML.

Parser definition

Parsers are meant to be self-containing scraping rules allowing you to extract data from HTML string.

Adding parsers

Once you have your parsers defined, it's time to add them to Scrapy.

Inline parsers

You don't have to write a class for each parser, you can also do inline parsing. Let's see how would that look.

Passing additional parameters to parsers

Sometimes you want to pass some extra context to your parsers. With Scrapy, you can pass an associative array of parameters which would become available to every parser.

The same principle applies no matter if you define parsers as separate classes or inline them with functions.

Crawly

You might noticed that first argument to parser's process method is instance Crawly class.

Crawly is an HTML crawling tool. It is based on Symphony's DOM Crawler.

Crawler initialisation

Instance of Crawly can be made from any string.

Crawling methods

Crawly provides few helper methods allowing you to more easily get the wanted data from HTML.

Filter

Allows you to filter elements with CSS selector. Similar to what document.querySelector('...') does.

First

Narrow your selection by taking the first element from it.

Nth

Narrow your selection by taking the nth element from it. Note that indices are 0-based;

Raw

Get access to Symphony's DOM crawler.

Crawly does not aim to replace Symphony's DOM crawler, rather just to make it's usage more pleasant. That's why not all methods are exposed directly trough Crawly.

Using raw method allows you to utilise the underlying Symphony's crawler.

Trim

Trims the output string.

Pluck

Extract attributes from selection.

Count

Returns the count of currently selected nodes.

Int

Returns the integer value of current selection

Float

Returns the integer value of current selection

String

Returns current selection's inner content as string.

Html

Returns HTML string representation of current selection, including the parent element.

Inner HTML

Returns HTML string representation of current selection, excluding the parent element.

Exists

Checks if given selection exists.

You can get boolean response or raise an exception.

Reset

Resets the crawler back to its original HTML.

Map

This method creates a new array populated with the results of calling a provided function on every node in a selection.

For each node a callback function is called with Crawly intance created from that node. Additionally, callback function takes second argument which is the 0-based index of a node.

Node

Returns the first DOMNode of the selection.

Readers

Readers are data source classes used by Scrapy to fetch the HTML content.

Scrapy comes with some readers predefined, and you can also write your own if you need to.

Using built in readers

Scrapy comes with two built in readers: UrlReader and FileReader. Lets see how you may use them.

As you can see built in readers allow you to use Scrapy by either reading from a url or from a specific file.

Writing custom readers

You don't have to be limited to built in readers. Writing you own is a piece of cake.

And then use it during the build process.

User agents

A user agent is a computer program representing a person, in this case a Scrapy instance. Scrapy provides several built in user agents for simulating different crawlers.

Why use custom user agents

User agents make sense only in a context of readers that fetch their data over HTTP protocol. More precisely, in cases where you want to read a web page that creates its content dynamically using JavaScript.

Scrapy by default can not parse JavaScript files. This is a problem all web crawlers face. There are numerous techniques for overcoming this problem, usually by using external services like Prerender which redirect crawling bots to cached HTML pages.

Several user agents are provided to allow Scrapy to represent itself as some of the common user agents. Please not that in case a web page implements more advance crawling security checks (for example an IP check) than provided checker would fail, since they only modify the HTTP request headers.

If you want to find out more, there is a great article on pre-rendering over at Netlify.

Using built in agents

Scrapy comes with few built in agents you can use.

Writing custom agents

Just like with readers, you can write your own custom user agents.

`

And then use it during the build process.

Precedence of parameters

One thing to note is the precedence of different parameters you may set during the build process.

Setting the url is same as setting the reader to be UrlReader with that url. On the other hand, explicitly setting reader will have higher precedence over explicitly setting the url and/or user agent.

Exception handling

In general, Scrapy tries to handle all possible exceptions wrapping them in base Scrapy exception class: ScrapeException.

What this means is that you can organize your app around a single exception for general error handling.

A more granular system is planned for future release which would allow you to react to a specific parser exceptions.

Testing

To run entire suite of unit tests you can do:

Changelog

Please see CHANGELOG for more information what has changed recently.

Credits

License

The MIT License (MIT). Please see License File for more information.


All versions of scrapy with dependencies

PHP Build Version
Package Version
Requires php Version ^7.2.5
guzzlehttp/guzzle Version ~6.0
symfony/css-selector Version ^4.4
symfony/dom-crawler Version ^4.4
ext-dom Version *
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package scrapy/scrapy contains the following files

Loading the files please wait ....