Download the PHP package rajpurohithitesh/advance-phpscraper without Composer
On this page you can find all versions of the php package rajpurohithitesh/advance-phpscraper. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download rajpurohithitesh/advance-phpscraper
More information about rajpurohithitesh/advance-phpscraper
Files in rajpurohithitesh/advance-phpscraper
Package advance-phpscraper
Short Description Advanced PHP web scraping library with plugin support
License MIT
Informations about the package advance-phpscraper
Advance PHP Scraper
Advance PHP Scraper is a powerful, modular, and extensible PHP library designed for web scraping. It simplifies extracting data from websites, such as links, images, meta tags, structured data, and more, while offering advanced features like plugin support, rate limiting, and asynchronous scraping. Whether you're a beginner or an experienced developer, this library provides a flexible and user-friendly interface to scrape web content efficiently.
This document is crafted to be beginner-friendly, with detailed explanations and examples to help you get started, even if you're new to PHP or web scraping. By the end, you'll know how to install, use, and extend the library with ease.
Table of Contents
- What is Advance PHP Scraper?
- Why Use This Library?
- Who Should Use It?
- Key Features
- Core Scraping Features
- Advanced Features
- Plugin System
- Getting Started
- Prerequisites
- Installation
- Verifying Installation
- Basic Usage: Your First Scrape
- Scraping a Simple Website
- Extracting Links
- Extracting Images
- Extracting Meta Tags
- Using the Command-Line Interface (CLI)
- Intermediate Usage: Leveling Up
- Scraping Sitemaps
- Scraping RSS Feeds
- Parsing Assets (CSV, JSON, XML)
- Checking HTTP Status Codes
- Advanced Usage: Power User Mode
- Rate Limiting: Playing Nice with Servers
- Queue System: Scraping Multiple URLs
- API Integration: Combining Scraping with APIs
- Custom CSS Selectors
- Plugins: Supercharging Your Scraper
- What Are Plugins?
- Available Plugins
- How to Use Plugins
- Learn More About Plugins
- Configuration: Customizing Your Scraper
- Setting User Agent
- Adjusting Timeout
- Following Redirects
- Using Constructor Configuration
- Testing: Ensuring Everything Works
- Running Tests
- Writing Your Own Tests
- Troubleshooting: Solving Common Problems
- Installation Issues
- Scraping Errors
- Plugin Problems
- Contributing: Joining the Community
- License: Understanding Usage Rights
- Resources: Further Learning
What is Advance PHP Scraper?
Advance PHP Scraper is a PHP library that helps you extract data from websites, like a super-smart librarian who can quickly find and summarize books for you. Web scraping is like copying information from a webpage (e.g., product names, prices, or blog titles) using code instead of manually copying and pasting. This library makes it easy to navigate websites, grab specific data, and even handle tricky tasks like scraping JavaScript-heavy pages or processing thousands of URLs at once.
Imagine you’re at a giant library (the internet), and you need to collect all book titles from a specific shelf (a website). Doing this by hand would take forever, but Advance PHP Scraper is like a magical robot that does it for you in seconds. It’s designed to be:
- Easy: Simple commands to get data, even if you’re new to coding.
- Powerful: Handles complex tasks like async scraping or cloud deployment.
- Flexible: Add your own features using plugins, like customizing a Lego set.
Why Use This Library?
There are other scraping tools out there, but here’s why Advance PHP Scraper is special:
- Beginner-Friendly: The code is straightforward, and this guide explains everything like you’re five.
- Modular: Only use the features you need, keeping your project lightweight.
- Robust: Built-in error handling, logging, and rate limiting prevent crashes or bans.
- Extensible: Plugins let you add custom features without touching the core code.
- Free and Open-Source: Use it, modify it, share it—under the MIT License.
Who Should Use It?
- New Coders: If you’re learning PHP and want to try web scraping, this is a great starting point.
- Hobbyists: Want to scrape your favorite blog’s headlines or collect product prices? This is for you.
- Professionals: Need to scrape thousands of pages for data analysis? The library’s advanced features have you covered.
- Educators: Teaching PHP or web scraping? Use this library for hands-on examples.
Key Features
Let’s explore what Advance PHP Scraper can do. Think of these features as tools in a toolbox, each designed for a specific job.
Core Scraping Features
These are the basic tools you’ll use most often:
- Extract Common Data:
- Links: Grab all
<a>
tags (e.g., URLs and their text). - Images: Collect
<img>
tags (e.g., source URLs and alt text). - Meta Tags: Extract
<meta>
tags (e.g., description, Open Graph data). - Headings: Get
<h1>
to<h6>
tags for page structure. - Paragraphs: Pull
<p>
tag content for text. - Structured Data: Extract JSON-LD, Microdata, and RDFa (e.g., schema.org data).
- Links: Grab all
- Sitemap Parsing: Read XML sitemaps to discover all pages on a site.
- RSS Feed Parsing: Extract news or blog feeds.
- Asset Parsing: Process CSV, JSON, or XML files linked on pages.
- Custom Selectors: Use CSS selectors to target specific elements (e.g.,
div.content
).
Advanced Features
These tools are for power users:
- Rate Limiting: Control how fast you scrape to avoid server bans (like driving at the speed limit).
- Queue System: Scrape multiple URLs in batches, like a to-do list for your scraper.
- API Integration: Combine scraped data with external APIs (e.g., fetch product details).
- CLI Interface: Run scraping tasks from the command line, perfect for quick jobs.
- Multilingual Support: Handle non-English text with proper encoding (e.g., Spanish, Chinese).
- Error Handling: Logs errors and checks HTTP status codes to keep scraping smooth.
Plugin System
Plugins are like optional upgrades for your toolbox:
- Headless Browsing: Scrape JavaScript-rendered pages (e.g., React apps).
- Async Scraping: Scrape multiple pages at once for speed.
- NLP Analysis: Extract keywords and entities from text.
- PDF Parsing: Read text from linked PDFs.
- Caching: Save scraped data to reduce server load.
- Cloud Deployment: Run scraping tasks on AWS Lambda.
- Custom Plugins: Add your own features (e.g., custom logging).
Getting Started
Let’s set up the library and run your first scrape. This section is like a cooking recipe: follow each step, and you’ll have a working scraper in no time.
Prerequisites
Before you start, you need:
-
PHP 7.4 or Higher: The library works with PHP 7.4, 8.0, or 8.1. Check your version:
If it’s lower, download a newer version from php.net.
-
Composer: This is a tool to manage PHP dependencies (like a grocery delivery service for code). Install it:
- A Text Editor: Use VS Code, Sublime Text, or any editor to write PHP code.
- Internet Connection: Needed to download the library and scrape websites.
Installation
Here’s how to install the library:
-
Create a Project Folder: Make a new directory for your scraping project:
-
Install Advance PHP Scraper: Run this Composer command to download the library and its dependencies:
This creates a
vendor/
folder with the library and dependencies likesymfony/browser-kit
andguzzlehttp/guzzle
. - Check the Files:
After installation, you’ll see:
vendor/
: Contains the library and dependencies.composer.json
: Lists the project’s dependencies.composer.lock
: Locks dependency versions.
Verifying Installation
Let’s make sure everything works. Create a file named test.php
:
Run it:
Expected Output:
If you see this, you’re good to go! If you get an error, check the Troubleshooting section.
Basic Usage: Your First Scrape
Now, let’s scrape some real data! Think of this as your first adventure with the library, like learning to ride a bike with training wheels.
Scraping a Simple Website
Let’s scrape the title of a webpage. Create a file named scrape_title.php
:
Run it:
Expected Output:
Line-by-Line Explanation:
require 'vendor/autoload.php'
: This line is like opening your toolbox, loading all the library’s tools.use AdvancePHPSraper\Core\Scraper
: This tells PHP you want to use theScraper
class, like picking a specific tool from the toolbox.$scraper = new Scraper()
: Creates a new scraper, like turning on your robot assistant.$scraper->go('https://example.com')
: Tells the scraper to visit the website, like sending your robot to a library shelf.$title = $scraper->title()
: Asks the scraper to find the<title>
tag, like asking for the book’s title.echo "The page title is: $title\n"
: Prints the result, like showing off the book you found.
What’s Happening Behind the Scenes?
- The library sends an HTTP request to
https://example.com
usingSymfony BrowserKit
. - It loads the HTML into a
Crawler
object (like a super-smart librarian who can read the page). - The
title()
method searches for the<title>
tag and returns its text.
Extracting Links
Let’s grab all the links on a page. Create scrape_links.php
:
Run it:
Expected Output:
Line-by-Line Explanation:
$links = $scraper->links()
: Finds all<a>
tags and returns an array of link details (like a list of book references).foreach ($links as $link)
: Loops through each link, like flipping through a list.$link['href']
: The URL (e.g.,https://www.iana.org/domains/example
).$link['text']
: The clickable text (e.g., “More information...”).$link['is_nofollow']
: Checks if the link has arel="nofollow"
attribute (used by search engines).
Why This is Cool:
- You get detailed info about each link, like whether it’s nofollow (important for SEO).
- The library handles relative URLs (e.g.,
/page
becomeshttps://example.com/page
).
Extracting Images
Now, let’s grab images. Create scrape_images.php
:
Run it:
Expected Output:
Explanation:
$images = $scraper->images()
: Finds all<img>
tags.- Since
https://example.com
has no images, the output is empty. -
Try a different site (e.g.,
https://www.wikipedia.org
) for images:You might see:
Why This is Useful:
- You can filter images by size or attributes (e.g.,
$scraper->images()->filterByMinDimensions(100, 100)
). - The library handles lazy-loaded images (e.g.,
data-src
attributes).
Extracting Meta Tags
Meta tags contain SEO and social media data. Create scrape_meta.php
:
Run it:
Expected Output:
Explanation:
$meta = $scraper->meta()
: Returns a categorized array of meta tags (standard
,og
,twitter
,charset
,viewport
).$type
: Groups likestandard
(regular meta tags) orog
(Open Graph for social media).- Useful for SEO analysis or social media previews.
Using the Command-Line Interface (CLI)
The CLI lets you scrape without writing PHP code. Run:
Expected Output (JSON):
Explanation:
scrape
: The CLI command to scrape a URL.--extract=links,meta,content
: Specifies what to extract (options:links
,images
,meta
,content
,sitemap
,rss
).- The JSON output is easy to parse for scripts or tools.
- Great for quick tasks or automation (e.g., in a cron job).
Intermediate Usage: Leveling Up
Now that you’ve mastered the basics, let’s explore more features to make your scraper smarter.
Scraping Sitemaps
Sitemaps list all pages on a website, like a table of contents for a book. Create scrape_sitemap.php
:
Run it:
Expected Output:
Explanation:
$scraper->sitemap()
: Finds the sitemap URL fromrobots.txt
and parses it.-
Since
https://example.com
may not have a sitemap, try a site likehttps://www.wikipedia.org
:Output might be:
Why This is Awesome:
- Sitemaps help you discover all pages on a site, perfect for large-scale scraping.
- Includes metadata like
lastmod
(last modified date) andpriority
.
Scraping RSS Feeds
RSS feeds are like news tickers for websites. Create scrape_rss.php
:
Run it:
Expected Output:
Explanation:
$scraper->rssFeed()
: Finds<link type="application/rss+xml">
tags and parses RSS feeds.-
Try a news site like
https://www.bbc.com
for feeds:Output might be:
Why This is Handy:
- Great for scraping news, blogs, or podcasts.
- Returns structured data (title, link, description, date).
Parsing Assets (CSV, JSON, XML)
You can parse files linked on pages. Create parse_asset.php
:
Explanation:
fetchAsset($url)
: Downloads the file content.parseCsv($content, true)
: Parses CSV, using the first row as headers.- For JSON or XML, use
parseJson()
orparseXml()
.
Why This is Useful:
- Extract data from linked files (e.g., product lists in CSV).
- Handles multiple formats for flexibility.
Checking HTTP Status Codes
Ensure a page loaded correctly with getStatusCode()
:
Expected Output:
Explanation:
getStatusCode()
: Returns the HTTP status (e.g., 200 for success, 404 for not found).isErrorPage()
: Returnstrue
for status codes >= 400.- Helps you skip broken pages or handle errors gracefully.
Advanced Usage: Power User Mode
Ready to take your scraper to the next level? These features are like rocket boosters for your scraping adventures.
Rate Limiting: Playing Nice with Servers
Rate limiting prevents your scraper from overwhelming servers, which could lead to bans. Think of it as pacing yourself while eating cookies so you don’t get kicked out of the kitchen. Create rate_limit.php
:
Run it:
Expected Output:
Explanation:
setRateLimit(3, 1)
: Limits to 3 requests per second.- The library pauses between requests (e.g., after 3 requests, it waits 1 second).
- Prevents server overload and IP bans, especially for large-scale scraping.
Tip:
- Start with a conservative limit (e.g., 5 requests/second) and adjust based on the target site’s policies.
- Check the site’s
robots.txt
for crawling guidelines.
Queue System: Scraping Multiple URLs
The queue system lets you scrape multiple URLs efficiently, like a conveyor belt processing orders. Create queue_scrape.php
:
Run it:
Expected Output:
Line-by-Line Explanation:
$urls
: An array of URLs to scrape, like a to-do list.$callback
: A function that processes each page (here, it extracts the title).queueUrls($urls, $callback)
: Adds URLs to the queue with the callback.processQueue()
: Runs the scraper on each URL and returns results as$url => $callback_result
.- The
foreach
loop displays the results, like checking off your to-do list.
Why This is Powerful:
- Handles errors gracefully (e.g., failed URLs return
null
). - Scales to thousands of URLs without overwhelming your script.
- Customizable callbacks let you extract any data.
API Integration: Combining Scraping with APIs
You can fetch data from APIs to complement your scraped data, like adding extra toppings to a pizza. Create api_scrape.php
:
Run it:
Expected Output:
Explanation:
apiRequest($endpoint, $params, $method)
: Sends an HTTP request (GET or POST) to an API and returns the JSON response.$params
: Optional data to send (e.g., query parameters or POST body).$method
: HTTP method (default: GET).- Here, we scrape the page title and fetch a sample post from a public API.
Use Case:
- Scrape a product page and use an API to get additional details (e.g., stock status).
- Combine scraped news headlines with an API for sentiment analysis.
Custom CSS Selectors
Want to extract something specific, like a div with class content
? Use filter()
:
Explanation:
filter($selector)
: Uses CSS selectors to target elements (likediv.content
,.header
,#main
).count()
: Checks if the element exists.text()
: Gets the text inside the element.- Powerful for custom scraping when built-in methods (
links()
,images()
) aren’t enough.
Plugins: Supercharging Your Scraper
Plugins are like apps you install on your phone to add new features. They let you extend Advance PHP Scraper without changing its core code.
What Are Plugins?
A plugin is a PHP class that adds functionality, like rendering JavaScript pages or caching responses. Plugins live in src/Plugins/custom/
and are managed via plugins.json
. You can enable/disable them or create your own.
Available Plugins
The library includes six plugins, each explained in detail in the PLUGIN_README.md. Here’s a quick overview:
- HeadlessPlugin: Scrapes JavaScript-rendered content (e.g., React apps).
- AsyncPlugin: Scrapes multiple URLs at once for speed.
- NLPPlugin: Extracts keywords and entities for text analysis.
- DocumentPlugin: Parses PDFs linked on pages.
- CachePlugin: Saves scraped data to reduce server load.
- CloudPlugin: Runs scraping tasks on AWS Lambda.
How to Use Plugins
To use a plugin, enable it and call its methods. Example with CachePlugin
:
For a complete guide on plugins, including how to enable, disable, or create them, check out the PLUGIN_README.md.
Configuration: Customizing Your Scraper
You can tweak the scraper’s settings to fit your needs, like adjusting a car’s mirrors before driving.
Setting User Agent
The user agent tells servers who’s scraping (like showing your ID at a library). Default is a bot-like string, but you can mimic a browser:
Adjusting Timeout
Set how long the scraper waits for a response:
Following Redirects
Choose whether to follow HTTP redirects:
Using Constructor Configuration
Pass settings when creating the scraper:
Explanation:
- These settings make your scraper behave differently, like choosing a fast or cautious driving mode.
- Use them to avoid blocks, handle slow servers, or follow redirects.
Testing: Ensuring Everything Works
The library comes with tests to make sure it works perfectly. Think of tests as a quality check, like tasting a cake before serving it.
Running Tests
Install development dependencies:
Run tests:
Expected Output:
Writing Your Own Tests
Add tests in tests/
. Example for a custom method:
Troubleshooting: Solving Common Problems
Even the best tools can hit snags. Here’s how to fix common issues:
Installation Issues
- Error: Composer not found: Install Composer (see Installation).
- Error: PHP version too low: Upgrade to PHP 7.4+:
Scraping Errors
- Error: Could not resolve host: Check your internet connection or URL spelling.
- Error: HTTP 403 Forbidden: Set a browser-like user agent:
Plugin Problems
- Plugin not loading:
Ensure
"enabled": true
inplugins.json
. - Dependency missing:
Install required packages (e.g.,
composer require symfony/panther
).
Contributing: Joining the Community
Love the library? Help make it better! Contribute by fixing bugs, adding features, or improving docs. Read the CONTRIBUTING.md for a detailed guide.
License: Understanding Usage Rights
Advance PHP Scraper is licensed under the MIT License, meaning you can use, modify, and share it freely. See the LICENSE file for details.
Resources: Further Learning
- PHP Basics: PHP The Right Way
- Web Scraping: ScrapingBee Blog
- Symfony BrowserKit: Symfony Docs
- GitHub Repo: github.com/rajpurohithitesh/advance-phpscraper
All versions of advance-phpscraper with dependencies
symfony/browser-kit Version ^5.4
symfony/dom-crawler Version ^5.4
symfony/css-selector Version ^5.4
guzzlehttp/guzzle Version ^7.0
symfony/event-dispatcher Version ^5.4
symfony/console Version ^5.4
symfony/mime Version ^5.4
monolog/monolog Version ^2.0
league/uri Version ^6.5
donatello-za/rake-php-plus Version ^1.0.3
intervention/image Version ^2.7
ext-dom Version *
ext-libxml Version *
ext-gd Version *
ext-simplexml Version *
ext-mbstring Version *
ext-curl Version *
ext-fileinfo Version *
ext-xml Version *
ext-zlib Version *
ext-json Version *
ext-iconv Version *
ext-pcre Version *
ext-ctype Version *
ext-xmlwriter Version *
ext-tokenizer Version *
ext-filter Version *
ext-xmlreader Version *
ext-sockets Version *