工作任务:爬取豆瓣图书搜索结果页面的全部图书信息
在ChatGPT中输入提示词:
你是一个Python编程专家,要完成一个爬虫Python脚本编写的任务,具体步骤如下:
用 fake-useragent库设置随机的请求头;
设置chromedriver的路径为:"D:\Program Files\chromedriver125\chromedriver.exe"
隐藏chromedriver特征;
设置selenium的窗口最大化;
请求标头:
Accept:
text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Accept-Encoding:
gzip, deflate, br, zstd
Accept-Language:
zh-CN,zh;q=0.9,en;q=0.8
Connection:
keep-alive
Host:
http://search.douban.com
Referer:
https://search.douban.com/book/subject_search?search_text=chatgpt&cat=1001&start=0
Sec-Ch-Ua:
"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"
Sec-Ch-Ua-Mobile:
?0
Sec-Ch-Ua-Platform:
"Windows"
Sec-Fetch-Dest:
document
Sec-Fetch-Mode:
navigate
Sec-Fetch-Site:
same-origin
Sec-Fetch-User:
?1
Upgrade-Insecure-Requests:
1
User-Agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36
用selenium打开网页:https://search.douban.com/book/subject_search?search_text=chatgpt&cat=1001&start={pagenumber}
{pagenumber}的值从0开始,以15递增,到285结束;
定位xpath=//*[@id="root"]/div/div[2]/div[1]/div[1]/div[{number}]/div/div/div[1]/a的div标签,提取其文本内容({number}的值是从1到15),写入Excel表格第1列;
定位xpath=//*[@id="root"]/div/div[2]/div[1]/div[1]/div[{number}]/div/div/div[3]的div 标签,提取其文本内容({number}的值是从1到15),写入Excel表格第2列;
保存Excel,Excel文件名为:doubanChatGPT20240606.xlsx, 保存到文件夹:F:\AI自媒体内容\AI行业数据分析
注意:
每一步都要输出信息到屏幕
每爬取1条数据,随机暂停5-8秒;
每爬取完1页数据,随机暂停6-12秒;
设置请求头,以应对网站的反爬虫机制;
有些标签的内容可能为空,导致处理时程序报错,遇到为空标签就直接跳过,继续处理下一个标签;
DataFrame.append 方法在 pandas 1.4.0 版本中已经被弃用,并且在后续版本中被移除。为了解决这个问题,我们可以使用 concat 函数来代替 append;
当前使用的是 Selenium 4 或更高版本,executable_path 参数已经被 service 参数替代了;
忽略 SSL 错误:在 Chrome 选项中添加了 --ignore-certificate-errors 和 --ignore-ssl-errors。
增加错误处理,确保尽量多地捕获和处理异常。
在每次请求前更新 User-Agent。
无头模式:使用 --headless 参数在无头模式下运行,以减少干扰。如果需要在前台运行,可以移除此行。
随机暂停:在请求之间随机暂停,以避免反爬虫机制。
源代码:
import time
import random
import pandas as pd
from fake_useragent import UserAgent
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
# 设置chromedriver的路径
chromedriver_path = "D:\\Program Files\\chromedriver125\\chromedriver.exe"
# 创建随机请求头
ua = UserAgent()
# 设置Chrome选项
chrome_options = Options()
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument("--ignore-certificate-errors")
chrome_options.add_argument("--ignore-ssl-errors")
chrome_options.add_argument("--allow-insecure-localhost")
chrome_options.add_argument("--disable-web-security")
chrome_options.add_argument("--disable-site-isolation-trials")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--headless") # 无头模式运行
# 隐藏chromedriver特征
chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])
chrome_options.add_experimental_option('useAutomationExtension', False)
# 初始化webdriver
service = Service(chromedriver_path)
driver = webdriver.Chrome(service=service, options=chrome_options)
# 设置请求头
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"Connection": "keep-alive",
"Host": "http://search.douban.com",
"Referer": "https://search.douban.com/book/subject_search?search_text=chatgpt&cat=1001&start=0",
"Sec-Ch-Ua": '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"Windows"',
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
}
# 数据存储列表
data = []
# 爬取网页数据
for pagenumber in range(0, 286, 15):
url = f"https://search.douban.com/book/subject_search?search_text=chatgpt&cat=1001&start={pagenumber}"
print(f"正在爬取页面: {url}")
# 更新 User-Agent
headers["User-Agent"] = ua.random
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": headers["User-Agent"]})
driver.get(url)
# 随机暂停以防止反爬
time.sleep(random.uniform(6, 12))
for number in range(1, 16):
try:
# 定位书名的div标签
try:
book_title_xpath = f'//*[@id="root"]/div/div[2]/div[1]/div[1]/div[{number}]/div/div/div[1]/a'
book_title = driver.find_element(By.XPATH, book_title_xpath).text
except Exception as e:
book_title = ""
print(f"无法找到书名,错误: {e}")
# 定位描述的div标签
try:
book_desc_xpath = f'//*[@id="root"]/div/div[2]/div[1]/div[1]/div[{number}]/div/div/div[3]'
book_desc = driver.find_element(By.XPATH, book_desc_xpath).text
except Exception as e:
book_desc = ""
print(f"无法找到描述,错误: {e}")
# 添加数据到列表
data.append([book_title, book_desc])
print(f"爬取到数据: {book_title}, {book_desc}")
# 随机暂停以防止反爬
time.sleep(random.uniform(5, 8))
except Exception as e:
print(f"跳过因错误: {e}")
continue
# 将数据写入Excel文件
columns = ["书名", "描述"]
df = pd.DataFrame(data, columns=columns)
output_path = "F:\\AI自媒体内容\\AI行业数据分析\\doubanChatGPT20240606.xlsx"
df.to_excel(output_path, index=False)
print(f"数据已保存到Excel文件:{output_path}")
driver.quit()