Download the PHP package caster/siflawler without Composer

On this page you can find all versions of the php package caster/siflawler. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package siflawler

siflawler-php

A simple, flexible crawler, written in PHP.

This little project is easily installable and enables you to

Interesting features on a page can be found through XPath queries. On top of that, siflawler supports basic CSS selectors with an extension enabling the retrieval of attributes. This way, querying a page is easy even if you do not know XPath.

  1. Dependencies
  2. Usage
  3. Configuration
    1. Mandatory options
    2. Optional options
    3. Querying
  4. Running tests
  5. Contributing
  6. License

Dependencies

To be able to run siflawler, you will need to have the PHP cURL extension installed. This is what siflawler uses to download pages from the website(s) you want to crawl. This does enable siflawler to download pages in parallel, amongst others. Please open an issue if you have a problem with this and would like to see if bare PHP file_get_contents support can be added.

Usage

You can install siflawler using Composer. Either run composer require 'caster/siflawler:~1.2.1' or put the following in your composer.json and run composer install.

You are now set up to start crawling.

You may like to do crawling from the commandline. In that case, just run your file as follows. siflawler will give some output letting you know what it is doing in case you have set the verbose option to true (which it is by default).

Configuration

These are the options you can pass the \siflawler\Crawler when constructing it.

Mandatory options

The following options are mandatory and siflawler will throw an exception if you forget to pass one of these to it. It simply needs to know what to do. To see what you can pass for the selectors/queries, refer to the Querying section.

The start option is simply the URL siflawler will look (first) for an HTML page to crawl and get data from. This may also be an absolute path to some file on your local disk. It can even be an array of URLs, paths, or a mix of those two.

The find option can be used to specify how siflawler should locate interesting elements on a page when it has been retrieved. For each interesting element, an object will be created (stdClass) that has the properties specified in the get option. Each key in the get option object should be a query indicating what to put in that key in the resulting stdClass object.

If you want to crawl multiple pages, use the next option (see next section).

Optional options

You can use the following optional options, which are all self-explanatory really. The below values are the default values, you only need to include options if you want to use a different value or want to be explicit.

The max_requests option can be used to limit the number of pages siflawler will request in total. A value of 0 or less means that there is no limit.

The next option can be used to find one or more URLs to crawl next. If this is null, then no next page will be crawled, but you can specify a query to find one or more locations to go to next. This is useful when you want to crawl data that is split over multiple pages using pagination.

The timeout option can be used to specify a timeout in seconds for each request. A value of 0 means that there will be no timeout.

The verbose and warnings options can be used to toggle siflawler output.

Querying

Everywhere you can specify a query to find elements or attributes of elements, you can do two things: either specify an XPath query, or specify a CSS selector. CSS can normally only select nodes, but siflawler can understand some additional syntax that will allow you to select attribute values. Examples are:

Internally, siflawler will translate CSS selectors to XPath queries. If you want to be sure that this cannot go wrong, you should use XPath, but siflawler's CSS support is pretty good and can always be improved if you create an issue 🙂

To distinguish between CSS and XPath, siflawler uses a heuristic. If you want to be sure that this does not go wrong, you can specify a query as css:[your CSS selector] or xpath:[your XPath query], to let siflawler know precisely what the query language is you use.

Running tests

This project uses phpunit for automated unit testing. You can easily run the tests by executing composer test. For that to work, you do need to install the dev version of siflawler.

Contributing

If you miss something in siflawler, found a problem or if you have something really cool to add to it, feel free to open an issue or pull request on GitHub. I will try to respond as quickly as possible.

License

siflawler - a simple, flexible crawler, written in PHP.

Copyright (C) 2015 Thom Castermans

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.


All versions of siflawler with dependencies

PHP Build Version
Package Version
Requires php Version >=5.3.0
ext-curl Version *
ext-dom Version *
ext-json Version *
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package caster/siflawler contains the following files

Loading the files please wait ....