Download the PHP package hashbangcode/sitemap_checker without Composer
On this page you can find all versions of the php package hashbangcode/sitemap_checker. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download hashbangcode/sitemap_checker
More information about hashbangcode/sitemap_checker
Files in hashbangcode/sitemap_checker
Package sitemap_checker
Short Description A PHP library used to download, parse and crawl sitemap.xml files.
License MIT
Informations about the package sitemap_checker
Sitemap Checker
A PHP library used to download, parse and crawl sitemap.xml files.
If the sitemap.xml file is gzipped then the file is unzipped and the contents read.
Installation
Download the PHP files to a directory and run composer install
. This will set up everything needed for the application to run.
Usage
To run the application on the command line use the following.
php application.php sitemap-checker:run https://www.example.com/sitemap.xml
This will download the sitemap.xml file, let you know how many URLs it detected before starting to crawl them.
You can also use a shorthand to run the same command:
php application.php sc:run https://www.example.com/sitemap.xml
Once the crawling has finished the command will print out the results.
Options
A number of options exist for this tool.
Result Output
To change what type of result is returned you can supply the --result-file
option, or -r
for short.
The following rendering types are available:
- Plain, the default renderer (prints the results line by line).
- CSV
- JSON
- XML
- HTML
For example, to output the results as a csv you can supply the option like this.
php application.php sc:run -r results.csv https://www.example.com/sitemap.xml
This will automatically trigger the csv result rendering pathway and render the results as a csv file.
Without this option in place the tool will print results to the command line using "plain" rendering.
Limit
The --limit
option (or -l
for short) simply prevents the tool from checking any more than this limit.
For example, this will only process 10 results, regardless of the number of URLs found.
php application.php sc:run -l 10 https://www.example.com/sitemap.xml
Engine
The --engine
option (or -e
for short) changes the type of checking engine used.
Options are:
- 'guzzle' : (Default) Run the sitemap checker using Guzzle promises.
- 'chrome' : Run the sitemap checker using headless Chrome. To get this running you'll first need to add the
chrome binary to the location
./chrome/chrome
(i.e. within the package).
For example, to change the sitemap checker engine to use headless Chrome use the following.
php application.php sc:run -e chrome https://www.example.com/sitemap.xml
Exclude
Pass a list of URLs to exclude using the --exclude
(or -x
for short) flag. This will prevent URLs from being added
to the collections and checked. This can be a comma separated list of URLs to exclude. Wildcards can also be used to
prevent certain inner URLs from being used.
Some examples:
To prevent the path https://www.example.com/some-page
being used.
php application.php sc:run https://www.example.com/ --exclude='https://www.hashbangcode.com/some-page.html'
To prevent anything in https://www.example.com/sub-dir1
and https://www.example.com/sub-dir2
from being used:
php application.php sc:run https://www.example.com/ --exclude='https://www.example.com/sub-dir1/*,https://www.hashbangcode.com/sub-dir2/*'
To prevent anything on the external site https://www.example2.org
being used.
php application.php sc:run https://www.example.com/ --exclude='https://www.example2.org/*'
Robots
Pass a robots.txt file to the script using --robots
(or -t
for short). This will download and parse the robots.txt
file into a set of exclusion rules.
php application.php sc:run https://www.example.com/ --robots='https://www.example.org/robots.txt'
Note that the robots.txt parser will only use rules that have been created for the User-agent
string *
.
This option can be used in conjunction with the exclude flag to add to the spider exclusion rules.
Testing
Run ./vendor/bin/phpunit
to run the phpunit tests. All web requests are mocked within the unit tests.
Run composer run test-coverage
to run the unit tests and produce a code coverage report. This report is added to the directory .build
in the root of the application.
For the coverage report you need to add the following to your xdebug.ini configuration file.
Example Using Classes
To extract the classes out of this project to use independently do the following.
To Do
There's still lots to do.
- Add a way to auto-download the Chrome binary.
- Look into using a database or message queue.
- Add Docker container to wrap application.
- Batching processing of urls (i.e. don't do everything in one go).
- Pick better name for application.
- Add ability to add session cookies for authenticated spidering.
All versions of sitemap_checker with dependencies
guzzlehttp/guzzle Version ^7.5
symfony/console Version ^6.2
symfony/http-kernel Version ^6.2
symfony/config Version ^6.2
symfony/dependency-injection Version ^6.2
symfony/yaml Version ^6.2
guzzlehttp/promises Version ^2.0
symfony/framework-bundle Version ^6.2
chrome-php/chrome Version ^1.8