Download the PHP package deravenedwriter/crawlengine without Composer
On this page you can find all versions of the php package deravenedwriter/crawlengine. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download deravenedwriter/crawlengine
More information about deravenedwriter/crawlengine
Files in deravenedwriter/crawlengine
Package crawlengine
Short Description Crawl Engine is a PHP Library that helps to automate the Process of Login into password Protected Sites and Getting Needed information from them.
License MIT
Homepage https://github.com/okerefe/crawlengine
Informations about the package crawlengine
Read Me
CrawlEngine
Crawl Engine is a PHP Library that helps to automate the Process of Login into password Protected Sites and Getting Needed information from them. It does this with the help of other Great Libraries Like Guzzle, DomCrawler etc.
Table Of Contents
- Installation
- Bootstrapping The Engine Class
- Bootstrapping The InputDetail Class
- Getting InputTag Details from a Page Containing a Form
- Resolving Requests with CrawlEngine
Installation
The Preferred way of installing CrawlEngine is with Composer as follows:
Then ensure your bootstrap file is loading the composer autoloader:
Bootstrapping The Engine Class
The Engine class is used for performing most of CrawlEngines Functions This Includes Resolving Requests, Getting Form Details from pages and more.. The Engine can be initialized as follows:
Bootstrapping The InputDetail Class
The InputDetail Instance describes an input tag of a form. An input tag of a Form that can be as follows:
The InputDetail Class is used to pass field values for a form to the Engine class and it is also what is returned when the Engine is asked to get the form inputs of a given page
It contains different properties including name, which refers to the name of the input in question, type which refers to the type of the input in question and placeholder for the placeholder of same.
We can initialize the InputDetail Class as follows:
Getting InputTag Details from a Page Containing a Form
CrawlEngine has a way of accesing websites to analyze the input tags present. Say for example, a website located as https://example.com/login has a page as shown:
we could get an array of all the inputtags contained in this page as follows:
as was earlier said, this function returns input detail of the first form found on a page. if there are more than one form for example:
The Function would only return the inputs from the first form element. if you want to return values from the second form, you would specify an additional second value to the getLoginFields function as follows
The above code would fetch form details for the second form on the page.
Resolving Requests with CrawlEngine
To make a request with Crawl Engine one needs to know somethings about the website been accessed. this includes the uri of the form used to login, the uri which the form submits to and required fields in the form. so say for example the login form for a website is located at https://example.com/login and is structured as shown:
Above is what a typical login form should look like. so from this login form we can see that the Uri where the form would be submitted to is: https://example.com/login and that we need a valid username and password to be able to login. We also see that the site generates a csrf token to validate request and this is dynamic. you dont have to bother about this field as CrawlEngnie automatically takes care of it. you also dont have to bother about any field that has been pre-filled by the server unless you wish to change it. when CrawlEngine makes it's request, it fetches the form page, records all pre-filled input values, combines them with the ones you would give it and makes the request. So from this page above we know that we just have to give CrawlEngine a valid username and password to make the request. The main function responsible for resolving requests is the resolveRequest Method of the Engine class and is used as shown:
That's all you have to do, and then CrawlEngine does all the rest of the magic. it visits the site, take your given details along with any pre-filled ones found on the site that you didn't overwrite and submits. and while logged in like a normal user, it access all the contentPagesUri and brings the entire pages back as crawler objects Lets say for example the https://example.com/dashboard page is as follows:
The resolveRequest function then returns an array of crawlers containing crawlers for each of the content pages given. so for our request above:
for more information on crawlers and how to access different values in a page, you can check out The DomCrawler Documentation
The CrawlEngine by default searches for the input field from the first form it sees on the page containing the form. If there are more than one form on the login page from which the crawl engine would access like follows:
then by default, CrawlEngine would be referencing the first form, so the csrf and other pre-filled inputs would be from the first form. If one wishes to specify that the request is for the second form, it can be done by adding an extra parameter to the resolveRequest method as follows:
The above tells CrawlEngine that you are not referring to the first form on the page but the second one.
All versions of crawlengine with dependencies
symfony/css-selector Version ^5.1
guzzlehttp/guzzle Version ^7.0