今天想爬取一些政策,从政策服务 (smejs.cn) 这个网址爬取,html源码找不到链接地址,通过浏览器的开发者工具,点击以下红框
分析预览可知想要的链接地址的id有了,进行地址拼接就行
点击标头可以看到请求后端服务器的api地址,通过拿到这个地址,编写python脚本,不会的可以让gpt帮你写,很好用
import requests
import pandas as pd
import logging
import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
# 设置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# 请求头信息
headers = {
'Content-Type': 'application/json',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
# 基础URL
base_url = 'https://policy-gateway.smejs.cn/policy/api/policy/getNewPolicyList'
base_policy_url = 'https://policy.smejs.cn/frontend/policy-service/'
# 参数
params = {
'orderBy': '',
'keyWords': '',
'genreCode': 'K,A,S,Z',
'queryPublishBegin': '',
'queryPublishEnd': '',
'queryApplyBegin': '',
'queryApplyEnd': '',
'typeCondition': '',
'publishUnit': '',
'applyObj': '',
'meetEnterprise': '',
'title': '',
'commissionOfficeIds': '',
'commissionOfficeSearchIds': '',
'industry': '',
'relativePlatform': '',
'level': '',
'isSearch': 'N',
'policyType': '',
'provinceValue': '江苏省',
'cityValue': '',
'regionValue': '',
'current': 1,
'size': 15,
'total': 23960,
'page': 0
}
# 总条目数和每页条目数
total_policies = 23960
page_size = 15
total_pages = (total_policies // page_size) + 1
# 存储所有政策数据
all_policies = []
# 配置重试策略
retry_strategy = Retry(
total=5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["HEAD", "GET", "OPTIONS"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
http = requests.Session()
http.mount("https://", adapter)
http.mount("http://", adapter)
# 遍历每一页
for page in range(total_pages):
params['current'] = page + 1
try:
response = http.get(base_url, headers=headers, params=params, verify=False)
response.raise_for_status()
except requests.exceptions.RequestException as e:
logging.error(f"Failed to fetch data for page {page + 1}: {e}")
continue
data = response.json()
if 'records' not in data['data']:
logging.error(f"No records found for page {page + 1}")
continue
records = data['data']['records']
for record in records:
policy_id = record.get('id')
level_value = record.get('levelValue')
title = record.get('title')
type_value = record.get('typeValue')
commission_office_names = record.get('commissionOfficeNames')
publish_time = record.get('publishTime')
valid_date_end = record.get('validDateEnd')
policy_url = base_policy_url + policy_id
all_policies.append({
'ID': policy_id,
'URL': policy_url,
'Level Value': level_value,
'Title': title,
'Type Value': type_value,
'Commission Office Names': commission_office_names,
'Publish Time': publish_time,
'Valid Date End': valid_date_end
})
logging.info(f"Fetched data for page {page + 1}")
time.sleep(1) # 防止过快请求
# 转换为DataFrame
df = pd.DataFrame(all_policies)
# 保存到Excel
df.to_excel('policies.xlsx', index=False)
logging.info("Data saved to policies.xlsx")
然后运行后,就等到爬取完成了,后面也可以多线程爬,还没试,不知道是否有防爬机制。。。。