Download the PHP package mkocztorz/data-scraper without Composer

On this page you can find all versions of the php package mkocztorz/data-scraper. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package data-scraper

data-scraper

SensioLabsInsight Build Status

Install

Install using composer:

Currently minimum stability is alpha.

What is it?

Data scraper is based on great Symfony DomCrawler component. Symfony\Component\DomCrawler\Crawler is expected as an input to data scraper.

Data scraper focuses on the task of extracting data from HTML that is loaded into Crawler object. The extraction is done by selecting DOM element(s) with css selector and applying appropriate extraction method to it. Css selectors and extraction methods are extendable - you may add your own extensions if needed.

Data scraper allows extracting a single value or an array of values. It allows scraping a complex set of data in one sweep. The result may be a value, an array (list of items) or even nested data structures.

What it is not.

It is not a web spider - it doesn't do any web requests - that's up to you. It doesn't care what is the source of the HTML.

Learn by example

For those who want to dive right into it, please visit this tutorial or take a look at the example at the bottom.

Working with data scraper

The main entry point to data scraper is \Mkocztorz\DataScraper\Extractor\ExtractorBase. This is a service class where the extraction methods are registered and used. But you're much more likely to use the \Mkocztorz\DataScraper\Std\Extractor that has all the standard extraction methods registered by default.

There is also \Mkocztorz\DataScraper\Html\SelectorProviderBase that helps to register and use the selectors. There is also a ready to use version \Mkocztorz\DataScraper\Std\SelectorProvider with the default Css selector registered. Most of the time, while working with data scraper, selector provider and selector objects will be transparent for you. The Std Extractor by default uses Std SelectorProvider.

You may consider working with data scraper as 2 step job:

  1. Create a formula that describes where in the HTML is the data you want to scrape and what method should be used (e.g. is it element's text or attribute).

  2. Apply the formula created in step 1 to the HTML.

You don't need to create the formula for every HTML you want so scrape. If for example you want to scrape a paginated lists of items or user profiles then you only need to create the formula once and then apply it to every page of the results or user profile.

Built in methods

Extraction Methods

Default extraction methods are in \Mkocztorz\DataScraper\Extractor\Method namespace. Examples below assume creating Extractor service first:

Examples below show step 1 of the process: creating the formula. When you want to extract the data you need to have the Crawler with HTML loaded and call:

Element's text

Class: ExtractElementText

Registered as: text

Extractor method: getText

Params: none

Usage:

Will: get the text from element with id="title".

Note: It will use the first element found using css selector.

If element not found: returns empty string.

Element's text with pattern

Class: ExtractElementTextPattern

Registered as: textPattern

Extractor method: getTextPattern

Params: ['pattern'=>The pattern containing ?P named subpattern]

Usage:

Will: get the value matching pattern in text from element with id="title".

Note: It will use the first element found using css selector.

Note: Pattern has to have ?P named subpattern.

If element not found or pattern matches nothing: returns empty string.

Element's attribute value

Class: ExtractAttribute

Registered as: attribute

Extractor method: getAttribute

Params: ['attr'=>attribute name]

Usage:

Will: get the age attribute value from element with title ID.

Note: It will use the first element found using css selector.

If element or attribute not found: returns empty string.

Element's attribute value with pattern

Class: ExtractAttributePattern

Registered as: attributePattern

Extractor method: getAttributePattern

Params: ['attr'=>attribute name, 'pattern'=>The pattern containing ?P named subpattern]

Usage:

Will: get the value matching pattern in id attribute from element with title ID.

Note: It will use the first element found using css selector.

If element or attribute not found or pattern matches nothing: returns empty string.

List of elements

NOTE: This extraction method is different from the previous ones

Class: ExtractList

Registered as: list

Extractor method: getList

Params: Array of key-value pairs. Each value must be another Extraction Method. Key will be used as key in result item.

This extraction method works differently than the previous ones. It is designed to scrape data from lists. By itself ExtractList doesn't actually scrape any data but it uses the ExtractorMethods on each element found by selector to scrape data. Think of it as a kind of foreach control structure. Important every ExtractorMethod used by ExtractList gets a Crawler that contains only one element found by ExtractorList selector (think of it as a kind of namespace). The result is that child ExtractMethod selector will apply only to that element.

Usage:

Will: Get every list item found by "ul li" selector and pass it to each of the ExtractMethods, that in turn will do their job. Sample result might look like:

Note: In the params you may use $extract->getList(..) again!

If list is empty: None of child ExtractMethods are executed and the result is empty array.

Under the hood

How it works and how it can be extended to your needs.

More docs coming soon.

Example

The result:

Licence MIT


All versions of data-scraper with dependencies

PHP Build Version
Package Version
Requires php Version >=5.3.3
symfony/dom-crawler Version ~2.1
symfony/css-selector Version ~2
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package mkocztorz/data-scraper contains the following files

Loading the files please wait ....