PHP download

Download the PHP package awanesia/scrap without Composer

On this page you can find all versions of the php package awanesia/scrap. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

Table of contents
Download awanesia/scrap
More information about awanesia/scrap
Files in awanesia/scrap

Vendor awanesia
Package scrap
Short Description A web scraper for PHP to easily extract data from web pages -> party of laurentvw
License MIT
Homepage http://github.com/awanesia/scrap

Keywords text data parser matcher parse scraper spider crawler content crawling extract Harvest extractor scraping Match scrape mining

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:

If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.

Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
To use Composer is sometimes complicated. Especially for beginners.
Composer needs much resources. Sometimes they are not available on a simple webspace.
If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.

Please rate this library. Is it a good library?

Example code of awanesia/scrap

Informations about the package scrap

Scrapher

Scrapher is a PHP library to easily scrape data from web pages.

Getting Started

Installation

Add the package to your composer.json and run composer update.

{
    "require": {
        "laurentvw/scrapher": "2.*"
    }
}

For the people still using v1.0 ("LavaCrawler"), you can find the documentation is here: https://github.com/Laurentvw/scrapher/tree/v1.0.2

Basic Usage

In order to start scraping, you need to set the URL(s) or HTML to scrape, and a type of selector to use (for example a regex selector, together with the data you wish to match).

This returns a list of arrays based on the match configuration that was set.

array(29) {
  [0] =>
  array(2) {
    'url' =>
    string(34) "https://www.google.com/webhp?tab=ww"
    'title' =>
    string(6) "Search"
  }
  ...
}

Documentation

Instantiating

When creating an instance of Scrapher, you may optionally pass one or more URLs.

Passing multiple URLs can be useful when you want to scrape the same data on different pages. For example when content is separated by pagination.

If you prefer to fetch the page yourself using a dedicated client/library, you may also simply pass the actual content of a page. This can also be handy if you want to scrape other content besides just web pages (e.g. local files).

In some cases, you may want to add (read: append) URLs or contents on the fly.

Matching data using a Selector

Before retrieving or sorting the matched data, you need to choose a selector to match the data you want.

At the moment, Scrapher offers 1 selector out of the box, RegexSelector, which let's you select data using regular expressions.

A Selector takes an expression and a match configuration as its arguments.

For example, to match all links and their link name, you could do:

Note that the kind of value passed to the "id" key may vary depending on what selector you're using, and can virtually be anything. You can think of the "id" key as the glue between the given expression and its selector.

RegexSelector uses http://php.net/manual/en/function.preg-match-all.php under the hood.

For your convenience, when using Regex, a match with 'id' => 0 will return the URL of the crawled page.

Retrieving & Sorting

Once you've specified a selector using the with method, you can start retrieving and/or sorting the data.

Retrieving

Offset & limit

Sorting

See date_create

Filtering

You can filter the matched data to refine your result set. Return true to keep the match, false to filter it out.

Mutating

In order to handle inconsistencies or formatting issues, you can alter the matched values to a more desirable value. Altering happens before filtering and sorting the result set. You can do so by using the apply index in the match configuration array with a closure that takes 2 arguments: the matched value and the URL of the crawled page.

Validation

You may validate the matched data to insure that the result set always contains the desired result. Validation happens after optionally mutating the data set with apply. To add the validation rules that should be applied to the data, use the validate index in the match configuration array with a closure that takes 2 arguments: the matched value and the URL of the crawled page. The closure should return true if the validation succeeded, and false if the validation failed. Matches that fail the validation will be removed from the result set.

To make validation easier, we recommend using https://github.com/Respect/Validation in your project.

Logging

If you wish to see the matches that were filtered out, or removed due to failed validation, you can use the getLogs method, which returns an array of message logs.

Did you know?

All methods are chainable

Only the methods get, first, last, count and getLogs will cause the chaining to end, as they all return a certain result.

You can scrape different data from one page

Suppose you're scraping a page, and you want to get all H2 titles, as well as all links on the page. You can do so without having to re-instantiate Scrapher.

About

Author

Laurent Van Winckel - http://www.laurentvw.com - http://twitter.com/Laurentvw

License

Scrapher is licensed under the MIT License - see the LICENSE file for details

Contributing

Contributions to Laurentvw\Scrapher are always welcome. You make our lives easier by sending us your contributions through GitHub pull requests.

You may also create an issue to report bugs or request new features.

All versions of scrap with dependencies

PHP Build Version

Package Version

Version v2.0.1 Release 17. Mar 2024
create-project require 0 people chose require and
0 people chose create-project.

Download

Download latest version of scrap from vendor awanesia

Requires php Version >=5.3.0

Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package awanesia/scrap contains the following files

Loading the files please wait ....