32. 实战：PyQuery实现抓取TX图文新闻

news2025/4/17 21:14:23

前言（链接在评论区）（链接在评论区）（链接在评论区）

目的（链接在评论区）（链接在评论区）（链接在评论区）

思路（链接在评论区）（链接在评论区）（链接在评论区）

代码实现

1. 拿到页面源代码

2. 解析html文件

3. 拿到标题和内容

4. 下载图片

5. 保存文件

完整代码

运行效果

编辑

总结

前言

我们之前提到PyQuery区别于其他几种解析方式的最大优势就是可以“修改源代码”从而便于我们提取信息。今天我们以TX新闻为例，对这个解析的优势作一个简要的介绍。

目的

利用Pyquery+Markdown抓取TX新闻的某一篇完整图文内容，并保存到本地md文件。

思路

1. 拿到页面源代码

2. 解析html文件

3. 拿到标题和内容

4. 下载图片

5. 保存文件

代码实现

1. 拿到页面源代码

# main函数，完成所有操作(url普适)
def main():
    url = '见评论区'
    resp = requests.get(url)
    html = resp.text
    title, essay = get_content(html)
    save_file(title, essay)

main函数中拿到url的源代码，并传入get_content函数中获取标题和内容以及图片，最后调用save_file函数保存文件到本地。

2. 解析html文件

# 拿到标题和内容
def get_content(html):
    p = pq(html)

开始讲解 get_content函数，第一步就是先用PyQuery解析主函数中传入的html源代码。

3. 拿到标题和内容

随后解析出标题和内容。观察源代码发现只有一个h1标签，那么直接拿出来，随后可以写到markdown文件中。而文章内容部分就比较复杂了，我们要实现的是从本地读取图片，那么势必要先将图片下载到本地，那么就再建立一个download_img函数。

    title = p("h1")
    # print(title)
    essay = p("p.one-p")
    ps = essay("img").items()
    for p in ps:
        img_src = p.attr("src")
        img_uuid = uuid.uuid4()
        download_img(img_src, img_uuid)
        p.attr("src", f"1_img_src/{img_uuid}.jpg")  # 将图片资源地址改为本地
        p.attr("alt", "Image Not Found")
        # print(essay)
    return title, essay

注意这行代码

p.attr("src", f"1_img_src/{img_uuid}.jpg")  # 将图片资源地址改为本地
p.attr("alt", "Image Not Found")

这两行的含义就是在每一行具有img标签的p标签里面，将src属性替换为本地路径，并添加一个alt属性来作为图片无法显示时的替换文本。

这里就体现了PyQuery的优越性

4. 下载图片

# 下载图片，保存到本地指定路径，用同一个uuid
def download_img(img_src, img_uuid):
    download_url = 'https:' + img_src
    img_resp = requests.get(download_url)
    file_path = f"1_img_src/{img_uuid}.jpg"
    with open(file_path, mode='wb') as f:
        f.write(img_resp.content)

常规操作不赘述，注意区分这里的img_src/img_uuid的区别

5. 保存文件

# 保存Markdown文件
def save_file(title, essay):
    true_title = title.text()
    with open(f"1_{true_title}.md", mode='w', encoding='utf-8') as f:
        f.write(str(title))
        f.write(str(essay))

完整代码

"""
PyQuery & Markdown
new.xx.com（见评论区）
"""

from pyquery import PyQuery as pq
import requests
import uuid


# main函数，完成所有操作(url普适)
def main():
    url = '见评论区'
    resp = requests.get(url)
    html = resp.text
    title, essay = get_content(html)
    save_file(title, essay)


# 拿到标题和内容
def get_content(html):
    p = pq(html)
    title = p("h1")
    # print(title)
    essay = p("p.one-p")
    ps = essay("img").items()
    for p in ps:
        img_src = p.attr("src")
        img_uuid = uuid.uuid4()
        download_img(img_src, img_uuid)
        p.attr("src", f"1_img_src/{img_uuid}.jpg")  # 将图片资源地址改为本地
        p.attr("alt", "Image Not Found")
        # print(essay)
    return title, essay


# 下载图片，保存到本地指定路径，用同一个uuid
def download_img(img_src, img_uuid):
    download_url = 'https:' + img_src
    img_resp = requests.get(download_url)
    file_path = f"1_img_src/{img_uuid}.jpg"
    with open(file_path, mode='wb') as f:
        f.write(img_resp.content)


# 保存Markdown文件
def save_file(title, essay):
    true_title = title.text()
    with open(f"1_{true_title}.md", mode='w', encoding='utf-8') as f:
        f.write(str(title))
        f.write(str(essay))


if __name__ == '__main__':
    main()