python爬虫实现简单的代理ip池
我们在普通的爬虫过程中经常遇到一些网站对ip进行封锁的
下面演示一下普通的爬虫程序
使用requests.get爬取数据
这段代码是爬取豆瓣排行榜的数据,使用f12来查看请求的url和数据格式
代码
def requestData():
# 爬取数据的url
url: str = "https://movie.douban.com/j/chart/top_list";
# 拼接url的请求参数,根据查看记录可以看到,start是页码
params: dict = {
"type": 24,
"start": 0,
"limit": 20,
"interval_id": "100:90",
"action": ""
}
# 请求头
headers: dict = {
"Cookie": 'bid=E_4gLcYLK28; douban-fav-remind=1; _pk_id.100001.4cf6=356001ac7c27c8a7.1721138006.; __yadk_uid=3UpO8BdyzrKbVCb1NOAtbGumsp4WCXwl; __utmz=30149280.1721147606.4.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utmz=223695111.1721147606.3.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); ll="118281"; _vwo_uuid_v2=DD3C30CAFFD881E01CA061E73D9968226|23b8625e4550d2e13d1dacf343f40f5d; __utma=30149280.457246694.1704531990.1721147606.1721223349.5; __utmc=30149280; __utma=223695111.1791837078.1721138006.1721147606.1721223349.4; __utmb=223695111.0.10.1721223349; __utmc=223695111; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1721223349%2C%22https%3A%2F%2Fwww.google.com%2F%22%5D; _pk_ses.100001.4cf6=1; ap_v=0,6.0; __utmt=1; __utmb=30149280.1.10.1721223349',
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
}
# 循环100次来获取数据
for i in range(1000):
params["start"] = 20 * i;
try:
data = requests.get(url, params=params, headers=headers);
jsonData = data.json();
for item in jsonData:
print("标题:{} 评分:{}".format(item["title"], item["score"]))
except Exception as e:
pass;
请求头中的User-Agent是模仿浏览器,告诉请求的地址,我们也是浏览器的请求,这个是可以针对一些低级的反爬做的一些措施,Cookie就是身份验证,要跟服务器说明我们的身份,也是针对一些低级的反爬做的一些措施
for i in range(1000):
params["start"] = 20 * i;
try:
data = requests.get(url, params=params, headers=headers);
jsonData = data.json();
for item in jsonData:
print("标题:{} 评分:{}".format(item["title"], item["score"]))
except Exception as e:
pass;
这里加了try except 这个是为了防止爬虫失败导致后面的数据没有进行爬取到
jsonData = data.json();
for item in jsonData:
print("标题:{} 评分:{}".format(item["title"], item["score"]))
这一段就是拿出爬取的json格式的数据,进行打印,这里也可以写如excel也可以写进数据库,因为我不需要那些数据,所以就做了简单的打印,查看爬取的数据
以上这段代码就是简单的爬虫,但是这种简单的爬虫,很容易遇到ip反爬,下面就演示简单的ip代理池爬虫
def requestDataIp():
# 爬取数据的url
url: str = "https://movie.douban.com/j/chart/top_list";
# 拼接url的请求参数,根据查看记录可以看到,start是页码
params: dict = {
"type": 24,
"start": 0,
"limit": 20,
"interval_id": "100:90",
"action": ""
}
# 请求头
headers: dict = {
"Cookie": 'bid=E_4gLcYLK28; douban-fav-remind=1; _pk_id.100001.4cf6=356001ac7c27c8a7.1721138006.; __yadk_uid=3UpO8BdyzrKbVCb1NOAtbGumsp4WCXwl; __utmz=30149280.1721147606.4.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utmz=223695111.1721147606.3.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); ll="118281"; _vwo_uuid_v2=DD3C30CAFFD881E01CA061E73D9968226|23b8625e4550d2e13d1dacf343f40f5d; __utma=30149280.457246694.1704531990.1721147606.1721223349.5; __utmc=30149280; __utma=223695111.1791837078.1721138006.1721147606.1721223349.4; __utmb=223695111.0.10.1721223349; __utmc=223695111; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1721223349%2C%22https%3A%2F%2Fwww.google.com%2F%22%5D; _pk_ses.100001.4cf6=1; ap_v=0,6.0; __utmt=1; __utmb=30149280.1.10.1721223349',
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
}
# 动态ip代理池
ipList: List[str] = [
"d484.kdltps.com:15818",
"d484.kdltps.com:15818",
"117.42.94.126:19634"
]
ipIndex = 0;
# 循环100次来获取数据
for i in range(100):
proStr = ipList[ipIndex];
proxies = {
"http": f"http://{proStr}",
'https': f"https://{proStr}"
}
params["start"] = startIdx * 20;
try:
data = requests.get(url, params=params, headers=headers,
proxies=proxies);
jsonData = data.json();
for item in jsonData:
print("标题:{} 评分:{}".format(item["title"], item["score"]))
startIdx+=1;
except Exception as e:
print("请求报错:" + str(e));
pass;
ipIndex += 1;
if ipIndex >= len(ipList):
ipIndex = 0;
这里大部分代码没变,只是加了一些东西
ipList: List[str] = [
"d484.kdltps.com:15818",
"d484.kdltps.com:15818",
"117.42.94.126:19634"
]
这段代码就是默认ip代理池,ip代理池可以是数据库,也可以是excel进行存储
proStr = ipList[ipIndex];
proxies = {
"http": f"http://{proStr}",
'https': f"https://{proStr}"
}
这里的http是针对http请求做代理,https是针对https请求做代理
这里就是针对反爬做的措施,因为我每次访问都是用不同的ip代理,有一些反爬措施是根据ip短时间的访问次数进行反爬的,这样就可以逃过一劫,也可以保证我的ip池高可用,一个一个ip被封了
try:
data = requests.get(url, params=params, headers=headers,
proxies=proxies);
jsonData = data.json();
for item in jsonData:
print("标题:{} 评分:{}".format(item["title"], item["score"]))
except Exception as e:
print("请求报错:" + str(e));
pass;
这里有一些处理可用优化一下,因为这里遇到错误直接打印然后跳出了,这里可以加一些操作,如果判断ip代理被封了,或者是ip代理失效了就可以删掉这个代理再用其他代理爬一次,比如
def requestDataIp():
# 爬取数据的url
url: str = "https://movie.douban.com/j/chart/top_list";
# 拼接url的请求参数,根据查看记录可以看到,start是页码
params: dict = {
"type": 24,
"start": 0,
"limit": 20,
"interval_id": "100:90",
"action": ""
}
# 请求头
headers: dict = {
"Cookie": 'bid=E_4gLcYLK28; douban-fav-remind=1; _pk_id.100001.4cf6=356001ac7c27c8a7.1721138006.; __yadk_uid=3UpO8BdyzrKbVCb1NOAtbGumsp4WCXwl; __utmz=30149280.1721147606.4.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utmz=223695111.1721147606.3.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); ll="118281"; _vwo_uuid_v2=DD3C30CAFFD881E01CA061E73D9968226|23b8625e4550d2e13d1dacf343f40f5d; __utma=30149280.457246694.1704531990.1721147606.1721223349.5; __utmc=30149280; __utma=223695111.1791837078.1721138006.1721147606.1721223349.4; __utmb=223695111.0.10.1721223349; __utmc=223695111; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1721223349%2C%22https%3A%2F%2Fwww.google.com%2F%22%5D; _pk_ses.100001.4cf6=1; ap_v=0,6.0; __utmt=1; __utmb=30149280.1.10.1721223349',
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
}
# 动态ip代理池
ipList: List[str] = [
"d484.kdltps.com:15818",
"d484.kdltps.com:15818",
"117.42.94.126:19634"
]
ipIndex = 0;
startIdx = 0;
# 循环100次来获取数据
for i in range(100):
proStr = ipList[ipIndex];
proxies = {
"http": f"http://{proStr}",
'https': f"https://{proStr}"
}
params["start"] = startIdx * 20;
try:
data = requests.get(url, params=params, headers=headers,
proxies=proxies);
jsonData = data.json();
for item in jsonData:
print("标题:{} 评分:{}".format(item["title"], item["score"]))
startIdx += 1;
except requests.exceptions.ConnectionError as e:
ipList.pop(ipIndex)
print("代理报错")
except Exception as e:
print("请求报错:" + str(e));
pass;
ipIndex += 1;
if ipIndex >= len(ipList):
ipIndex = 0;
也就是加了这段代码
except requests.exceptions.ConnectionError as e:
ipList.pop(ipIndex)
print("代理报错")