PHP爬虫利器：QueryList，让网页数据抓取变得简单高效

几行代码轻松采集网页数据，PHP开发者必备的DOM解析工具

在数据为王的时代，网页数据采集已成为许多项目的核心需求。虽然Python常被视为爬虫的首选语言，但PHP生态中同样存在强大的工具——QueryList。这个基于phpQuery的库能够让PHP开发者以jQuery的方式优雅地处理DOM解析和数据采集。

什么是QueryList？

QueryList是一套简洁、优雅、可扩展的PHP数据采集工具，其设计理念是"优雅的渐进式PHP DOM解析框架"。它让DOM解析变得简单自然，无论是获取网页HTML还是从中提取内容，都能让代码保持干净整洁。

核心特性

QueryList的功能丰富，适合大多数PHP爬虫需求：

jQuery风格选择器：与jQuery CSS3 DOM选择器完全一致，上手快速
熟悉的DOM操作：API完全类似jQuery，操作简单直观
列表采集方案：提供通用解决方案，轻松抓取列表数据
强大HTTP请求套件：支持模拟登录、伪造浏览器、使用HTTP代理等复杂请求
编码处理：内置乱码解决方案，彻底解决编码烦恼
内容过滤：可使用jQuery选择器过滤内容
模块化设计：高度可扩展，满足各种定制需求

安装QueryList

QueryList的安装非常简单，只需要使用Composer：

composer require jaeger/querylist

安装完成后，就可以在项目中直接使用了。

基本用法

简单示例：获取页面所有图片

use QL\QueryList;

$images = QueryList::get('http://www.example.com')->find('img')->attrs('src');

print_r($images->all());

搜索结果的抓取

假设我们需要抓取某度搜索结果的标题和链接：

use QL\QueryList;

$data = QueryList::get('https://www.baidu.com/s?wd=QueryList', null, [
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
        'Accept-Encoding' => 'gzip, deflate, br',
    ]
])->rules([
    'title' => ['h3', 'text'],
    'link' => ['h3>a', 'href']
])
->range('.result')
->queryData();

print_r($data);

如果只需要获取标题或链接列表，可以分开处理：

$ql = QueryList::get('https://www.baidu.com/s?wd=QueryList', null, [
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
        'Accept-Encoding' => 'gzip, deflate, br',
    ]
]);

$titles = $ql->find('h3>a')->texts(); // 获取搜索结果标题列表
$links = $ql->find('h3>a')->attrs('href'); // 获取搜索结果链接列表

print_r($titles->all());
print_r($links->all());

高级用法：多层嵌套数据处理

对于复杂的页面结构，QueryList同样能优雅处理：

$url = 'https://www.example.com/data/list.php';
$ql = QueryList::get($url);

$data = $ql->find('.class1')->map(function ($row) {
    $type = $row->find('.class2')->html();

    $list = $row->find('span')->map(function ($row1) {
        $title = $row1->find('a')->html();
        $url = $row1->find('a')->attr('href');

        // 使用正则表达式提取作者信息
        preg_match_all('/[（(]([\x{4e00}-\x{9fa5}]{2,4})[）)]/u', $row1->text(), $matches);

        return [
            'title' => $title,
            'author' => $matches[1][0] ?? '',
            'url' => sprintf("https://www.example.com%s", $url)
        ];
    })->all();

    return [
        'type' => $type,
        'data' => $list
    ];
})->all();

与Go语言方案对比

与Go语言的goquery相比，QueryList在写法上更加简洁：

Go语言实现：

resp, err := http.Get("https://www.example.com/data?tab=list")
if err != nil {
    resultResp.Code = 1
    resultResp.Err = "Failed to fetch data"
    c.JSON(http.StatusOK, resultResp)
    return
}
defer resp.Body.Close()

doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
    resultResp.Code = 1
    resultResp.Err = "Failed to read response body"
    c.JSON(http.StatusOK, resultResp)
}

doc.Find(".class1").Each(func(i int, s *goquery.Selection) {
    text1 := s.Find(".class2").Text()
    href1, _ := s.Find(".class3").Attr("href")
    imgsrc1, _ := s.Find("a img").Attr("src")
})

相比之下，QueryList的PHP实现更加简洁和直观。