Python爬虫使用实例
—— 续
IDE: Pycharm or Jupyter Notebook
6. 网易云歌榜
🥝 获取地址
记得把#/去掉就好,一定要记得,否则没用。
热歌榜的url是 https://music.163.com/discover/toplist?id=3778678 ,同理可得其他榜的url
比如:
飙升榜:https://music.163.com/discover/toplist?id=19723756
改下id即可
关于headers
🥝 实现代码
import requests
import re # 正则表达式模块
import os # 文件操作模块
filename='music\\'
if not os.path.exists(filename):
os.mkdir(filename)
url='https://music.163.com/discover/toplist?id=3778678'
headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
}
response=requests.get(url=url,headers=headers)
html_data=re.findall('<li><a href="/song\?id=(\d+)">(.*?)</a>',response.text)
# print(html_data)
for num_id,title in html_data:
# https://music.163.com/song/media/outer/url?id=1859245776.mp3
music_url=f'https://music.163.com/song/media/outer/url?id={num_id}.mp3'
music_content=requests.get(url=music_url,headers=headers).content
with open('music\\'+title+'.mp3',mode='wb') as f:
f.write(music_content)
print(num_id,title)
🥝 运行效果
网易云热歌榜共200首歌,不用几分钟就下好了。
有些歌是重复的
进行了覆盖,所以文件夹📂中≤200
7. 大学排名
import re
with open('rank.csv',encoding='utf-8',newline='',mode='a') as f:
csv_writer=csv.writer(f)
csv.writer.writerow(['country','rank','region','score1','score2','score3','score4','score5','score6','stars','total_score','university','year'])
# 替换不需要的字符
def replace(str_):
str_=re.findall('<div class="td-wrap"><div class="td-wrap-in">(.*?)</div>',str_)[0]
return str_
#1.发送请求
url="https://www.qschina.cn/sites/default/files/qs-rankings-data/cn/2057712_indicators.txt"
response=requests.get(url) # requests method:get
#print(response)# <Response [200]>
#2.获取数据
json_data=response.json()
#print(json_data)
#3.解析数据
for data in json_data['data']:
country=data['location']
rank=data['overall_rank']
region=data['region']
score1=replace(data['ind_76'])
score2=replace(data['ind_77'])
score3=replace(data['ind_36'])
score4=replace(data['ind_73'])
score5=replace(data['ind_18'])
score6=replace(data['ind_14'])
stars=data['stars']
total_score=replace(data['overall'])
university=data['uni']
university=re.findall('<div class="td-wrap"><div class="td-wrap-in"><a href=".*?" class="uni-link">(.*?)</a></div></div>',university)
year="2021"
print(country,rank,region,score1,score2,score3,score4,score5,score6,stars,total_score,university,year)
8. 微博评论
🥝 获取地址
🥝 实现代码
正则表达式中匹配中文的常用表达式是[\u4e00-\u9fa5]+
。 [\u4e00-\u9fa5]
表示一个字符集,其中\u4e00
和\u9fa5
分别是Unicode中文字符的起始和结束编码范围, +
表示前面的字符集可以出现一次或多次,即匹配一个或多个中文字符。
import requests
import pandas as pd
import re
# import pprint
# https://m.weibo.cn/
url="https://m.weibo.cn/comments/hotflow?id=5071592833155928&mid=5071592833155928&max_id_type=0"
header={'cookie':"SINAGLOBAL=7254774839451.836.1628626364688; SUB=_2AkMWR_ROf8NxqwJRmf8cymjraIt-ygDEieKgGwWVJRMxHRl-yT9jqmUgtRB6PcfaoQpx1lJ1uirGAtLgm7UgNIYfEEnw; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWEs5v92H1qMCCxQX.d-5iG; UOR=,,www.baidu.com; _s_tentry=-; Apache=1090953026415.7019.1632559647541; ULV=1632559647546:8:4:2:1090953026415.7019.1632559647541:1632110419050; WBtopGlobal_register_version=2021092517; WBStorage=6ff1c79b|undefined",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}
# 发起请求并获取数据
response = requests.get(url=url, headers=header)
response.encoding = 'utf-8' # 确保响应内容的编码
# 创建一个空的数据框
df = pd.DataFrame(columns=[
'用户',
'地区',
'评论',
'日期',
])
# 遍历响应数据
for index in response.json()['data']['data']:
# pprint.pprint(index)
content = ''.join(re.findall('[\u4e00-\u9fa5]+', index['text']))
# 创建一个字典并添加到数据框中
dict_entry = {
'用户': index['user']['screen_name'],
'地区': index['source'].replace('来自', ''),
'评论': content,
'日期': index['created_at'],
}
# 将字典添加到数据框
df = df.append(dict_entry, ignore_index=True)
print(dict_entry)
# 可以根据需要删除这行以获取所有评论
# break
# 写入CSV文件
df.to_csv('weibo评论1.csv', encoding='utf-8-sig', index=False)
🥝 运行效果
正则表达式模块re及其应用
9. 天气数据
# pip install pandas
# pip install lxml
import pandas as pd
import lxml
url='http://www.weather.com.cn/textFC/hb.shtml'
pd.read_html(url,header=1,index_col=[0,1])[1]
10. 基金数据
# pip install html5lib
import pandas as pd
import lxml
import html5lib
import ssl
#ssl._create_default_https_context = ssl._create_unverified_context
url='https://fund.eastmoney.com/fund.html#os_0;isall_0;ft_;pt_1'
pd.read_html(url)[2]
URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>
原因:Python 升级到 2.7.9 之后引入了一个新特性,当使用urllib.urlopen打开一个 https 链接时,会验证一次 SSL 证书。
解决方法:
import ssl
# ssl._create_default_https_context = ssl._create_unverified_context
效果部分截图,比较长没接完,大概就是这样子的: