Download the PHP package unique/scraper without Composer
On this page you can find all versions of the php package unique/scraper. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download unique/scraper
More information about unique/scraper
Files in unique/scraper
Package scraper
Short Description An abstract component for creating web scrapers
License MIT
Informations about the package scraper
Scraper
This is a helper component, to ease the creation of custom website scrapers. It implements some basic logic of iterating the listing pages and downloading the items. In order to use this, you must first implement your own ItemListDownloader (by extending AbstractItemListDownloader) and ItemDownloader, (by extending AbstractItemDownloader or AbstractJsonItemDownloader) for your particular website.
Installation
This package requires php >= 7.4
. To install the component, use composer:
Usage
In order to use this, you must first implement your own ItemListDownloader and ItemDownloader for your particular website.
Since most of scraping uses (at least my uses) consist of iterating a list and scraping items from it.
Maybe one day, as the need arises, I will expand it, but for now the scraper uses the same approch.
So, let's assume we have an ad website, that has a list of ads. The listing is divided in to however many pages and each page has 20 ads. We need to scrape all the ads.
We first create a class, that will represent our scraped Ad. It must implement SiteItemInterface
.
We then implement ItemListDownloader:
Then we create a downloader for the ad itself:
Or you could extend AbstractJsonItemDownloader if ad data was fetched via json.
So that takes care of scraping. All that's left now, is to create a for example command script,
that initiates the scraping.
You can use the optional LogContainerConsole
for logging stuff to the console, using two methods:
stdOut() and stdErr(), which you need to implement yourself.
Documentation
Events
You can subscribe to various events triggered by the AbstractItemListDownloader
, by using
on( string $event_name, callable $handler )
method. Each handler will receive an EventObject
,
which depends on the event type:
on_list_begin
The event object will be ListBeginEvent
. This is a "breakable" event (read on to find out more).
Methods:
getPageNum(): int
returns the page number.
on_list_end
The event object will be ListEndEvent
.
Methods:
getItemCount(): ItemCount
returns information about page number, size and total amount of items.willContinue(): bool
returns true, if the scraper will continue to the next page.
on_item_begin
The event object will be ItemBeginEvent
. This is a "breakable" event (read on to find out more).
Methods:
getId(): string
returns the id of the item.getUrl(): string
returns url of the item.getDomElement(): \DOMElement
returns the corresponding \DOMElement.
on_item_end
The event object will be ItemEndEvent
.
Methods:
getItemCount(): ItemCount
returns information about page number, size and total amount of items.getState(): int
One of the state constants found inAbstractItemListDownloader::STATE_*
.getSiteItem(): ?SiteItemInterface
If no errors where found, provides data for item, that was scraped.getDomElement(): \DOMElement
returns the corresponding \DOMElement.
on_item_missing_url
The event object will be ItemMissingUrlEvent
.
Methods:
getUrl(): ?string
returns url of the item.setUrl( ?string $url )
Allows for a handler to set a new url.getDomElement(): \DOMElement
returns the corresponding \DOMElement.
on_break_list
The event object will be BreakListEvent
.
Methods:
getCausingEvent(): ?EventObjectInterface
returns the event object that instructed to break scraping of the list.
Breakable events
These are events that implement BreakableEventInterface and can instruct the scraper to either abort processing of the item
or to abort scraping of the entire list. In php's terms, these are continue
and break
on while
cycles.
So a breakable event object implements these methods:
shouldSkip(): bool
- Returns true, if the list item should be skiped.shouldBreak(): bool
- Returns true, if the scraping of the list should abort.continue()
- Instructs the scraper to proceed with the item.skip()
- Instructs the scraper to skip the current item, but proceed with the list.break()
- Instructs the scraper to abort the list and stop scraping.
More Documentation
For more details on what each and every method does, check out the source code, it should be pretty clearly documented.
All versions of scraper with dependencies
ext-json Version *
symfony/dom-crawler Version ^5.0
unique/events Version ^1.0
guzzlehttp/guzzle Version ^7.2.0