PHP download

Download the PHP package yangze/spiderx without Composer

On this page you can find all versions of the php package yangze/spiderx. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

Table of contents
Download yangze/spiderx
More information about yangze/spiderx
Files in yangze/spiderx

Vendor yangze
Package spiderx
Short Description php SpiderX
License MIT
Homepage https://github.com/zhenyangze/SpiderX

Keywords framework php spider SpiderX

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:

If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.

Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
To use Composer is sometimes complicated. Especially for beginners.
Composer needs much resources. Sometimes they are not available on a simple webspace.
If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.

Please rate this library. Is it a good library?

Example code of yangze/spiderx

Informations about the package spiderx

php爬虫脚本

框架只做分发，不做数据处理，需要自己在回调中定制。
不限制采集方式，可以用正则，Xpath，字符串截取。
无限层级采集，可以实现列表->详情，列表->列表->详情，详情->详情等任意姿势采集。
队列去重，可以增量抓取，也可以全量采集。
支持调试模式，实时报表，守护模式。

安装依赖

环境	说明
php	>5.6,最好是php7以上
redis	数据队列

快速开始

在线生成代码： SpiderX生成器

1、复制代码到index.php文件中

2、命令行执行(需要composer下载依赖，时间跟网速有关)

配置说明

字段	类型	说明
name	string	任务名称，队列名称根据name值生成。如果要做分布式的，可以选择用相同的name值
tasknum	int	任务数量，默认为1
start	array	采集入口url
rule	array	采集规则，具体参考下方说明

rule值说明

rule值为数组形式，每个二级元素为一个单元。

字段	类型	说明
name	string	任务名称，队列名称根据name值生成。如果要做分布式的，可以选择用相同的name值
name	string	页面类型，选项为list 或者detail
url	string	入口url,有2种形式，一种是'#article_\d+#'这样的正则，从各个页面中抓取；一种是取其他单元的name值与所取单元的data值组合，比如一个单元name为news_list,data中有个元素为url,则组合成news_list.url赋值给当前字段
data	array	要采集的数据，以回调方式赋值，形式为：`key => function ($pageInfo, $html, $data) { return '';}`

回调说明

通用回调方法

开始任务

任务完成

向队列中添加url数据

重试，url请求失败，重新请求，默认为3次

如果获取不到html数据，可以重写setGetHtml方法

类似的还有setGetLinks方法，抽取页面中的链接，或者其他url存储方式

页面加载回调

需要依赖用户设置的每个rule下面单元的name值。假设我们设置的name值为news. 则对应的回调方法有：

请求url前回调

获取html后回调

解析页面数据后回调，一般用于保存数据

高级玩法

获取页面数据

字符串截取
xpath
正则

设置cookie和header头

需要重写setGetHtml方法

模拟登录

需要要on_start回调中添加自动登录逻辑

部分页面可能需要先get方式获取页面中的参数，然后再发起POST请求

无限级数据采集

实现的方式就是在data的单元中，把url的值设置为上一个单元的name.DataField的形式

参考demo目录sina文件。

post提交表单，post分页抓取数据

实现方式为自定义添加url队列，请求类型method为post，请求参数query为数组或者字符串形式：

快速导出列表数据和表格数据

执行效果

代码参考

参考demo目录

效果参考

命令行

command

报表模式

report

守护模式：

运行后看不到，不截图了。

后续功能

[x] 脚手架，自动生成代码
[ ] 支持深度优先和广度优先
[x] 命令行效果
[x] 异步多线程

All versions of spiderx with dependencies

PHP Build Version

Package Version

Version v2.1.35 Release 27. Feb 2022
create-project require 0 people chose require and
0 people chose create-project.

Download

Download latest version of spiderx from vendor yangze

Requires php Version ^7.0
guzzlehttp/guzzle Version ~6.3
league/climate Version ^3.2
katzgrau/klogger Version ~1.0
inhere/console Version ^2.0
symfony/dom-crawler Version ^4.1
symfony/css-selector Version ^4.1

Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package yangze/spiderx contains the following files

Loading the files please wait ....