一、背景
最近做城投项目时候遇到一个问题,就是一个代码写好不动,我只操作页面,运行出来的结果却是页面上显示的内容。这就导致了我不能按自己的需求抓取指定模块的内容
二、示例
import requests
headers = {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "zh-CN,zh;q=0.9",
"Connection": "keep-alive",
"Cookie": "wzws_sessionid=oGRscM+CZmM1ZWUxgDE4Mi4xNTAuNjMuMTg2gWQ5YzU0NA==; JSESSIONID=QW1HmDZ5fh6fW86HwLxm8woDJUwKu9PF_Z-jH_DD0wW3Ypa38VSk!1171792879; u=6",
"Referer": "https://data.stats.gov.cn/easyquery.htm?cn=E0103",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"sec-ch-ua": "\"Chromium\";v=\"112\", \"Google Chrome\";v=\"112\", \"Not:A-Brand\";v=\"99\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\""
}
url = "https://防敏/easyquery.htm"
params = {
"m": "QueryData",
"dbcode": "fsnd",
"rowcode": "zb",
"colcode": "sj",
"wds": "[{\"wdcode\":\"reg\",\"valuecode\":\"110000\"}]",
"dfwds": "[{\"wdcode\":\"sj\",\"valuecode\":\"1993\"}]",
"k1": "1684310513171"
}
response = requests.get(url, headers=headers, params=params, verify=False)
print(response.text)
print(response)
这样一个代码,编写好后不改动,我在页面上选择地区生成总值时候,运行出来的是地区生产总值的内容
当我点击页面内容到支出法地区生产总值时候,跑出来的数据就变成了支出法地区生产总值,这时候代码是完全没用改变的,一直都还是这个脚本
三、分析
根据脚本可以发现,不是明文显示的只有cookie,故推断是因为cookie造成的这种情况。 此时如果选择不带cookie请求,则只能请求出来地区生成总值这一个地方的数据,改变params参数是没用的。故解决思路确定为找到cookie生成方式。 但是经过实践发现,无论是hook还是搜关键词,都很难定位到生成该cookie的后台代码。 后来突然想到为何不直接获取到已生成的cookie
四、调试
我这里获取cookie有两种方式: 一种是response的header里获取Set-Cookie 一种是直接response.cookies
解决了cookie获取问题,又发现由于抓取需求需要抓取指定年份的数据,但是分析params参数发现,这个params所能带的参数只能是(省份代码,模块代码)或者(省份代码,年份)。那么我需要指定省份内指定模块的指定年份的数据怎么获取呢,这里猜测还是cookie控制
这样一来,我就需要先以params带着模块参数访问一次,取得该模块的cookie,然后把这个再带着该模块的cookie去请求指定年份的数据
五、最终脚本
import requests
headers = {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "zh-CN,zh;q=0.9",
"Connection": "keep-alive",
"Referer": "https://data.stats.gov.cn/easyquery.htm?cn=E0103",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"sec-ch-ua": "\"Chromium\";v=\"112\", \"Google Chrome\";v=\"112\", \"Not:A-Brand\";v=\"99\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\""
}
url = "https://防敏/easyquery.htm"
params_base = {
"m": "QueryData",
"dbcode": "fsnd",
"rowcode": "zb",
"colcode": "sj",
"wds": "[{\"wdcode\":\"reg\",\"valuecode\":\"110000\"}]",
"dfwds": "[{\"wdcode\":\"zb\",\"valuecode\":\"A0204\"}]",
"k1": "1684828790804",
"h": "1"
}
# 先访问一次取到cookie
response = requests.get(url, headers=headers, params=params_base, verify=False)
set_cookie = response.headers['Set-Cookie']
print(set_cookie)
print('\n')
print(response.cookies)
# 查询时间时候带上cookie
headers = {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "zh-CN,zh;q=0.9",
"Connection": "keep-alive",
'Cookie': set_cookie,
"Referer": "https://data.stats.gov.cn/easyquery.htm?cn=E0103",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"sec-ch-ua": "\"Chromium\";v=\"112\", \"Google Chrome\";v=\"112\", \"Not:A-Brand\";v=\"99\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\""
}
params = {
"m": "QueryData",
"dbcode": "fsnd",
"rowcode": "zb",
"colcode": "sj",
'wds': '[{"wdcode":"reg","valuecode":"110000"}]',
'dfwds': '[{"wdcode":"sj","valuecode":"1993"}]',
"k1": 1684828790804
}
response = requests.get(url, headers=headers, params=params, verify=False)
print(response.text)
这时候只需要变动params的省份编码和模块代码即可获取到 指定省份内 指定模块的 指定年份的数据