Download the PHP package baraja-core/webcrawler without Composer
On this page you can find all versions of the php package baraja-core/webcrawler. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download baraja-core/webcrawler
More information about baraja-core/webcrawler
Files in baraja-core/webcrawler
Package webcrawler
Short Description Simple package to load list of urls and make sitemap.
License
Homepage https://github.com/baraja-core/webcrawler
Informations about the package webcrawler
Web crawler
Simply library for crawling websites by following links with minimal dependencies.
📦 Installation
It's best to use Composer for installation, and you can also find the package on Packagist and GitHub.
To install, simply use the command:
You can use the package manually by creating an instance of the internal classes, or register a DIC extension to link the services directly to the Nette Framework.
How to use
Crawler can run without dependencies.
In default settings create instance and call crawl()
method:
In $result
variable will be entity of type CrawledResult
.
Advanced checking of multiple URLs
In real case you need download multiple URLs in single domain and check if some specific URLs works.
Simple example:
Notice: File robots.txt and sitemap will be downloaded automatically if exist.
Settings
In constructor of service Crawler
you can define your project specific configuration.
Simply like:
No one value is required. Please use as key-value array.
Configuration options:
Option | Default value | Possible values |
---|---|---|
followExternalLinks |
false |
Bool : Stay only in given domain? |
sleepBetweenRequests |
1000 |
Int : Sleep in milliseconds. |
maxHttpRequests |
1000000 |
Int : Crawler budget limit. |
maxCrawlTimeInSeconds |
30 |
Int : Stop crawling when limit is exceeded. |
allowedUrls |
['.+'] |
String[] : List of valid regex about allowed URL format. |
forbiddenUrls |
[''] |
String[] : List of valid regex about banned URL format. |
📄 License
baraja-core/webcrawler
is licensed under the MIT license. See the LICENSE file for more details.