Download the PHP package shel/crawler without Composer
On this page you can find all versions of the php package shel/crawler. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Package crawler
Short Description Allows crawling of sitemaps and node-trees
License GPL-3.0
Informations about the package crawler
Shel.Crawler for Neos CMS
Crawler for Neos CMS nodes and sites. It can be used to warm up the caches after a release or dump your site as html files.
Installation
Run the following command in your project
composer require shel/crawler
Usage
To crawl all pages based on a single sitemap run
To crawl all pages based on all sitemaps listed in a robots.txt file
Node based crawling
This command will try to generate all page html without using actual requests and only renders them internally. Due to the complexity of the page context, this might not give the desired results, but the resulting html of alle crawled pages can be stored for further usage.
This can be much faster as all pages are rendered in one process and all caches are reused.
To make this work, you need make provide a valid hostname.
This can be done via one of the following ways:
- have an active domain setup for a site (recommended, the crawler will use the first active domain)
- set the
Neos.Flow.http.baseUri
setting for Neos in yourSettings.yaml
- provide the
baseUri
in general via the environment variableCRAWLER_BASE_URI
and use the example inConfiguration/Production/Settings.yaml
To crawl all sites based on their primary active domain:
To crawl all sites based on their primary active domain and use the URLs listed in robots.txt:
Experimental static file cache
By providing the outputPath
you can store all crawled content as html files.
You can use this actually as a super simple static file cache by adapting your webserver configuration. There is an example for nginx:
You replace the existing try_files
part with the given code and adapt the path cache
if you use a different one.
This cache feature is really experimental, and you are currently in charge of keeping the files up-to-date and removing old ones.
- Doesn't clear cache
- Doesn't update automatically on publish
- Ignores Fusion caching configuration
- Shortcuts are ignored (open TODO)
Contributing
Contributions or sponsorships are very welcome.
All versions of crawler with dependencies
php Version >=7.4
chuyskywalker/rolling-curl Version ~3.1
ext-curl Version *
ext-simplexml Version *
ext-libxml Version *