Download the PHP package webcrawlerapi/sdk without Composer
On this page you can find all versions of the php package webcrawlerapi/sdk. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download webcrawlerapi/sdk
More information about webcrawlerapi/sdk
Files in webcrawlerapi/sdk
Package sdk
Short Description A PHP SDK for WebCrawler API - turn website into data
License MIT
Homepage https://github.com/webcrawlerapi/webcrawlerapi-php-sdk
Informations about the package sdk
WebCrawler API PHP SDK
A PHP SDK for interacting with the WebCrawlerAPI - a powerful web crawling and scraping service.
In order to use the API you have to get an API key from WebCrawlerAPI
Read documentation at WebCrawlerAPI Docs for more information.
Requirements
- PHP 8.0 or higher
- Composer
ext-json
PHP extension- Guzzle HTTP Client 7.0 or higher
Installation
You can install the package via composer:
Usage
API Methods
crawl()
Starts a new crawling job and waits for its completion. This method will continuously poll the job status until:
- The job reaches a terminal state (done, error, or cancelled)
- The maximum number of polls is reached (default: 100)
- The polling interval is determined by the server's
recommendedPullDelayMs
or defaults to 5 seconds
crawlAsync()
Starts a new crawling job and returns immediately with a job ID. Use this when you want to handle polling and status checks yourself, or when using webhooks.
getJob()
Retrieves the current status and details of a specific job.
cancelJob()
Cancels a running job. Any items that are not in progress or already completed will be marked as canceled and will not be charged.
Parameters
Crawl Methods (crawl and crawlAsync)
url
(required): The seed URL where the crawler starts. Can be any valid URL.scrapeType
(default: "html"): The type of scraping you want to perform. Can be "html", "cleaned", or "markdown".itemsLimit
(default: 10): Crawler will stop when it reaches this limit of pages for this job.webhookUrl
(optional): The URL where the server will send a POST request once the task is completed.allowSubdomains
(default: false): If true, the crawler will also crawl subdomains.whitelistRegexp
(optional): A regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.blacklistRegexp
(optional): A regular expression to blacklist URLs. URLs that match the pattern will be skipped.maxPolls
(optional, crawl only): Maximum number of status checks before returning (default: 100)
Responses
CrawlAsync Response
The crawlAsync()
method returns a CrawlResponse
object with:
id
: The unique identifier of the created job
Job Response
The Job object contains detailed information about the crawling job:
id
: The unique identifier of the joborgId
: Your organization identifierurl
: The seed URL where the crawler startedstatus
: The status of the job (new, in_progress, done, error)scrapeType
: The type of scraping performedcreatedAt
: The date when the job was createdfinishedAt
: The date when the job was finished (if completed)webhookUrl
: The webhook URL for notificationswebhookStatus
: The status of the webhook requestwebhookError
: Any error message if the webhook request failedjobItems
: Array of JobItem objects representing crawled pagesrecommendedPullDelayMs
: Server-recommended delay between status checks
JobItem Properties
Each JobItem object represents a crawled page and contains:
id
: The unique identifier of the itemjobId
: The parent job identifierjob
: Reference to the parent Job objectoriginalUrl
: The URL of the pagepageStatusCode
: The HTTP status code of the page requeststatus
: The status of the item (new, in_progress, done, error)title
: The page titlecreatedAt
: The date when the item was createdcost
: The cost of the item in $referredUrl
: The URL where the page was referred fromlastError
: Any error message if the item failedgetContent()
: Method to get the page content based on the job's scrapeType (html, cleaned, or markdown). Returns null if the item's status is not "done" or if content is not available. Content is automatically fetched and cached when accessed.rawContentUrl
: URL to the raw content (if available)cleanedContentUrl
: URL to the cleaned content (if scrapeType is "cleaned")markdownContentUrl
: URL to the markdown content (if scrapeType is "markdown")
License
MIT License