PHP download

Download the PHP package openbuildings/spiderling without Composer

On this page you can find all versions of the php package openbuildings/spiderling. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

Table of contents
Download openbuildings/spiderling
More information about openbuildings/spiderling
Files in openbuildings/spiderling

Vendor openbuildings
Package spiderling
Short Description Crawl the web with kohana or phantomjs.
License BSD-3-Clause

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:

If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.

Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
To use Composer is sometimes complicated. Especially for beginners.
Composer needs much resources. Sometimes they are not available on a simple webspace.
If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.

Please rate this library. Is it a good library?

Example code of openbuildings/spiderling

Informations about the package spiderling

Spiderling

This is a library for crawling web pages with curl and PhantomJS. Heavily inspired by Capybara. It's a major component in phpunit-spiderling for integration level testing. It can handle AJAX requests easily and allows switching from fast PHP-only drivers to JavaScript-enabled like PhantomJS easily, without modifying the code.

A quick example

This will output the text content of the HTML node li.test fill in some inputs and submit the form.

The DSL

The Page object has a rich DSL for accessing content and filling in forms:

Navigation

visit($page, $query): Direct the browser to open a new URL
current_url(): Retrieve the current URL - this will be affected by redirects or scripts changing the browser URL in any way.
current_path(): This will return the URL without the protocol and host part, usefull for writing more generic code.

Getters

Each node represents a HTML tag on the page, and you can use extensive getter methods to probe its contents. All of these getters are dynamic, meaning that there is no cache involved and each methods sends a call to its appropriate driver.

is_root(): Check if the current element is the root "node"
id(): Get the 'id' of the current node - this ID uniquely identifies the node for the current page.
html(): Get the raw html of the current node - this is like calling outerHTML on a dom element.
tag_name(): Get the tag name of the dom element. e.g. DIV, SPAN, FORM, SELECT
attribute($name): Get an attribute of the current tag. If the tag is empty e.g. <div disabled /> then it will return an empty string. If there is no attribute however, NULL will be returned
text(): Get the text content of an html tag - this is similar to how browsers render HTML tags, all whitespace will be merged to single spaces.
is_visible(): Check if a node is visible. PhantomJS driver will return correct value if the item is hidden via JS, CSS or inline styles.
is_selected(): Check if an option tag is "selected"
is_checked(): Check if an input tag is "checked"
value(): Get the value of an input form tag

An example of using some of these getters, if we have this page:

Then you could write the following PHP code:

Setters

Spiderling also gives you the ability to modify the current page, filling in input fields, pressing buttons and links, submitting forms. This can be accomplished with the low level setters:

set($value): If the node is a representation of an input field (input, textarea or even a select), you can use this method to set its value.
append($text): If you need to append some text to a textfield or input, you can use the append() method instead of set() - this allows performing the operation with fewer round trips to the driver.
click(): If the node represents something that you can click (like a link or a button) you can use the click() method on that node. It will perform the required operation as if a person clicked it, loading the new page and updating the result of current_url / current_path getters.
select_option(): If the node is an option tag, than you can use this method to select it. This will unselect any other selected options in the SELECT tag, unless its marked as "multiple"
unselect_option(): The opposite of select_option()
hover($x = NULL, $y = NULL): hover the mouse over the current node, triggering the javascript / css events and states. You can optionally pass x and y offsets so you can fine-tune this.
drop_files($files): This will trigger all the JavaScript events associated with dropping files on top of a dom element (HTML5 feature).

So having an example form like this:

We could do this script:

Locators

You can find elements not only by CSS selectors (which is the default) but also input elements, buttons and links have special finders. This is referred to as "locator type".

css - the default
xpath - using XPath
link - find links by id, title, text inside the link or even alt text of an image inside the link.
field - find input elements (TEXTAREA, INPUT, SELECT) by id, name, text of the label, pointing to this input, placeholder of the input or the text of the option of a select tag, that does not have a value (usually the default option)
label - find a label tag by id, title, content text or image alt text inside the label
button - find a button by id, title, name, value, text inside the button or alt text of the image inside the button

All of these locator types give you the ability to easily scan the page and select something you are looking for to click or fill without looking at the html of the page at all. Everywhere there is a selector you can enter an array('{locator type}', '{selector}') to change the default locator type.

Here's an example using the previous HTML:

The php code becomes clearer and less brittle - the underlying html can change but your code will still work as expected:

Filters

If using only locators is not enough, you can easily narrow down the search with "filters". They iterate over the found candidates, filtering out ones that don't match. Be careful with them because they load the nodes and check them one-by-one which might be performance intensive, but it is OK in most cases.

Here are the available filters:

visible: TRUE or FALSE - filter out visible or not visible nodes
value: string of value - filter out nodes that don't have a matching value
text: string of text - filter out nodes that don't have the given text
attributes: array of attribute name => attribute value - filter out nodes that don't have all of the given attributes (names and values)
at: specifically select which node of the list to return, all other are filtered out.

Here is how you might use the filters with this HTML:

Finders

Most locator types have a custom method for finding an element with that particular type. There are also some other custom finders which you might find useful:

find($selector, array $filters = array()) - default, uses CSS selectors
find_field($selector, array $filters = array()) - uses 'field' locator type to find input elements
find_link($selector, array $filters = array()) - uses 'link' locator type to find anchor tags
find_button($selector, array $filters = array()) - uses 'button' locator type
not_present($selector, array $filters = array()) - the opposite of "find" makes sure an element is not present on the page
all($selector, array $filters = array()) - returns a Nodelist - an iteratable and countable array-like object that you can 'foreach' and 'count' easily. Have in mind that it features lazy loading, so only when you access a node it gets loaded by the driver. count() does not load any nodes at all.

The previous form example can be rewritten like this:

Actions

Some often used actions that you can perform on the page - modifying inputs, clicking links and buttons, etc. have shortcut methods, to make your code more readable and robust.

Here are all these actions:

click_on($selector, array $filters = array()): Find a node using a CSS selector and click on it
click_link($selector, array $filters = array()): Find a node using the "link" locator type and click on it
click_button($selector, array $filters = array()): Find a node using the "button" locator type and click on it
fill_in($selector, $with, array $filters = array()): Find a node using the "field" locator type and set its value with "$with".
choose($selector, array $filters = array()): Choose a specific radio input tag, finding it with the "field" locator type.
check($selector, array $filters = array()): Find a checkbox input tag using the "field" locator and "check" it.
uncheck($selector, array $filters = array()): Find a checkbox input tag using the "field" locator and "uncheck" it.
attach_file($selector, $file, array $filters = array()): Find a file input tag using the "field" locator and set $file to it.
select($select, $option_filters, array $filters = array()): Find a select tag using the "field" locator, and mark one or more of its options as "selected". If $option_filters is a string then the option with the value of the string will be set, otherwise all the options matching the filters will be set. This allows selecting by value, text or even position.
unselect($select, $option_filters, array $filters = array()): The same as "select" but the matched options are "unselected"
hover_on($select, array $filters = array()): Move the mouse over an element, found by css selector
hover_link($select, array $filters = array()): Move the mouse over an element, found by the link locator type
hover_field($select, array $filters = array()): Move the mouse over an element, found by the "field" locator type
hover_button($select, array $filters = array()): Move the mouse over an element, found by the "button" locator type

Using these methods you can make your code very readable. Also all of these actions return $this, allowing you to chain them easily. Consider the previous example in the Finders section - you can rewrite it like this:

A more complicated example is in order. We will be using the following HTML:

Nesting

When there are multiple elements on the page you might want to be more specific, Spiderling allows you to do this by nesting the nodes - you can call all the actions and finders from "within" a node - so that finders will search only in the children on the node.

For example:

Notice the "end()" method - this allows you to return to the previous level and continue your work from there. Also you can nest multiple times without any problem (you will have to use "end()" multiple times too to "get out of" the nesting)

Misc

There are some more additional methods as part of the DSL:

confirm($confirm): If an alert, or confirm dialogs is open on the page you can use this method to dismiss it (by providing FALSE) or approve it (for confirm dialogs, providing TRUE)
execute($script, $callback = NULL): Perform an arbitrary JavaScript on the page, in the context of a given node. You will be able to access it as the first argument of the callback, e.g. arguments[0]. The result of the JavaScript execution will be returned by the method (by passing through JSON serialization). Optionally you could provide a callback, and the result will be the first argument of the callback (the secound will be the node itself)
screenshot($file): Take a screenshot of the current state of the page, placing it in the $file as a PNG image.

Handling AJAX

Spiderling follows the same philosophy as Capybara in that it does not explicitly support or wait for AJAX calls to finish, however each finder does not immidiately conclude failure if the element is not loaded, but waits a bit (default 2 seconds) before throwing an exception. To take advantage of that when writing your crawlers when you have an AJAX request you need to search for the change the AJAX is about to do:

For example:

Drivers

A great strength of Spiderling is the ability to use different drivers for your code. This allows switching from PHP-only curl parsing of the page to a PhantomJS without modification of the code. For example if we wanted to use a PhantomJS driver instead of the default "Simple" one then we'd need to do this:

There are 4 drivers at present:

__Driver_Simple__: Uses PHP curl to load pages. Does not support JavaScript or browser alert dialogs
__Driver_Kohana__: Uses Kohana framework's native Internal Request class, without opening internet connections at all - very performant if your code already uses Kohana framework.
__Driver_Phantomjs__: Start a PhantomJS server. You would need to have PhantomJS installed and accessible in your PATH. Picks a new port at random so its possible to have multiple PhantomJS browsers open simultaneously.

You can easily write your own Drivers by extending the Driver class and implementing methods yourself. Some drivers do not support all the features, so it's OK to not implement every method.

Now for each driver in detail:

Driver_Simple

Loads the HTML page with curl and then parses it using PHP's native DOM and XPath. All finders are quite fast, so it's your best bet to use this if you do not rely on JavaScript or other browser specific features. It's also very easy to extend in order to make a "native" version for a specific web framework - the only thing you need to implement is the loading part, an example of which you can see with the "Driver_Kohana" class.

Before each request $_GET, $_POST and $_FILES are saved, filled in with appropriate values and later restored, mimicking a real PHP request.

Apart from loading the HTML through curl, you could set the content directly, if you've loaded it by other means.

Here's how that looks:

Generally performing post requests yourself is discouraged as they are not supported by all the drivers. But with Driver_Simple you can perform arbitrary requests, for testing API calls for example. This is accomplished directly through the driver like this:

Driver_Kohana

Uses Kohana framework's native Internal Request (slightly modifying it to trick the framework into thinking its an initial request). It extends __Driver_Simple__.

Also it handles redirects capping them to maximum 8 (configurable) and uses Request::$user_agent as its User Agent.

Example Use

Driver_Phantomjs

Using this driver you can perform all the finds and actions with PhantomJS, using a real WebKit engine with JavaScript, without the need for any graphical environment (headless). You need to have it installed in your PATH, accessaible by invoking "phantomjs".

You can download it from here: http://phantomjs.org/download.html

By default it spawns a new server on a random port from 4445 and 5000.

This should work if you have PhantomJS installed.

If you want to start the server from independently, you can modify the PhantomJS connection, you can also set it up to output messages to a log file as well as have, tweak other parameters.

Setting the "pid file" argument on start, allows the driver to save the pid of the phantomjs server process to that file, and then try to clean up the server when started again, thus making sure you don't have running PhantomJS process all over the place.

License

Under BSD-3-Clause license, read LICENSE file.

All versions of spiderling with dependencies

PHP Build Version

Package Version

Version 0.4.0 Release 17. Feb 2020
create-project require 3 people chose require and
0 people chose create-project.

Download

Download latest version of spiderling from vendor openbuildings

Requires php Version ^7.1
symfony/css-selector Version ^2.3|^3.0
openbuildings/environment-backup Version ~0.1.1

Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package openbuildings/spiderling contains the following files

Loading the files please wait ....