Download the PHP package yangze/spiderx without Composer

On this page you can find all versions of the php package yangze/spiderx. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package spiderx

php爬虫脚本


安装依赖

环境 说明
php >5.6,最好是php7以上
redis 数据队列

快速开始

在线生成代码: SpiderX生成器

1、复制代码到index.php文件中

2、命令行执行(需要composer下载依赖,时间跟网速有关)

配置说明

字段 类型 说明
name string 任务名称,队列名称根据name值生成。如果要做分布式的,可以选择用相同的name值
tasknum int 任务数量,默认为1
start array 采集入口url
rule array 采集规则,具体参考下方说明

rule值说明

rule值为数组形式,每个二级元素为一个单元。

字段 类型 说明
name string 任务名称,队列名称根据name值生成。如果要做分布式的,可以选择用相同的name值
name string 页面类型,选项为list 或者detail
url string 入口url,有2种形式,一种是'#article_\d+#'这样的正则,从各个页面中抓取;一种是取其他单元的name值与所取单元的data值组合,比如一个单元name为news_list,data中有个元素为url,则组合成news_list.url赋值给当前字段
data array 要采集的数据,以回调方式赋值,形式为:key => function ($pageInfo, $html, $data) { return '';}

回调说明

通用回调方法

开始任务

任务完成

向队列中添加url数据

重试,url请求失败,重新请求,默认为3次

如果获取不到html数据,可以重写setGetHtml方法

类似的还有setGetLinks方法,抽取页面中的链接,或者其他url存储方式

页面加载回调

需要依赖用户设置的每个rule下面单元的name值。假设我们设置的name值为news. 则对应的回调方法有:

请求url前回调

获取html后回调

解析页面数据后回调,一般用于保存数据

高级玩法

获取页面数据

  1. 字符串截取

  2. xpath

  3. 正则

设置cookie和header头

需要重写setGetHtml方法

模拟登录

需要要on_start回调中添加自动登录逻辑

部分页面可能需要先get方式获取页面中的参数,然后再发起POST请求

无限级数据采集

实现的方式就是在data的单元中,把url的值设置为上一个单元的name.DataField的形式

参考demo目录sina文件。

post提交表单,post分页抓取数据

实现方式为自定义添加url队列,请求类型method为post,请求参数query为数组或者字符串形式:

快速导出列表数据和表格数据

执行效果

代码参考

参考demo目录

效果参考

命令行

command

报表模式

report

守护模式:

运行后看不到,不截图了。

后续功能


All versions of spiderx with dependencies

PHP Build Version
Package Version
Requires php Version ^7.0
guzzlehttp/guzzle Version ~6.3
league/climate Version ^3.2
katzgrau/klogger Version ~1.0
inhere/console Version ^2.0
symfony/dom-crawler Version ^4.1
symfony/css-selector Version ^4.1
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package yangze/spiderx contains the following files

Loading the files please wait ....