Python快速入门
简单易懂Python入门
爬虫流程
- 获取网页内容:HTTP请求
- 解析网页内容:Requst库、HTML结果、Beautiful Soup库
- 储存和分析数据
什么是HTTP请求和响应
如何用Python Requests发送请求
-
下载pip
-
macos系统下载:pip3 install requests
通过第二行进行伪装为浏览器请求
实践
import requests
headers = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.6.1 Safari/605.1.15"
}
response = requests.get("https://movie.douban.com/top250",headers=headers)
print(response.text)
什么是HTML网页结构?
HTML常见标签
:链接
- ![在这里插入图片描述](https://img-blog.csdnimg.cn/48567ae1276e494e8f03b3035aa9aa56.png) # Beautiful Soup
- pip3 install bs4
from bs4 import BeautifulSoup
import requests
content = requests.get("http://books.toscrape.com/").text
soup = BeautifulSoup(content,"html.parser")
all_prices = soup.findAll("p",attrs={"class","price_color"})
for price in all_prices:
print(price.string[2:])
实战
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.6.1 Safari/605.1.15"
}
for start_num in range(0,250,25):
response = requests.get(f"https://movie.douban.com/top250?start={start_num}", headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
all_titles = soup.findAll("span", attrs={"class", "title"})
for title in all_titles:
title_string = title.string
if "/" not in title_string:
print(title_string)
进阶
- 正则表达式
- 多线程
- 数据库
- 数据分析
规则
- 不爬公民隐私数据
- 不爬受著作权保护内容
- 不爬国家事务、国防建设、尖端科学技术等
- 请求数量频率不能过高
- 反爬就不要强行图片
- 了解robots.txt查看可爬和不可爬内容