Python快速入门

简单易懂Python入门

爬虫流程

获取网页内容：HTTP请求
解析网页内容：Requst库、HTML结果、Beautiful Soup库
储存和分析数据

什么是HTTP请求和响应

在这里插入图片描述

如何用Python Requests发送请求

下载pip
macos系统下载：pip3 install requests

在这里插入图片描述
通过第二行进行伪装为浏览器请求

实践

import requests
headers = {
    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.6.1 Safari/605.1.15"
}
response = requests.get("https://movie.douban.com/top250",headers=headers)

print(response.text)

什么是HTML网页结构？

在这里插入图片描述

HTML常见标签

:链接

![在这里插入图片描述](https://img-blog.csdnimg.cn/48567ae1276e494e8f03b3035aa9aa56.png) # Beautiful Soup

pip3 install bs4

from bs4 import BeautifulSoup
import requests
content = requests.get("http://books.toscrape.com/").text

soup = BeautifulSoup(content,"html.parser")
all_prices = soup.findAll("p",attrs={"class","price_color"})
for price in all_prices:
    print(price.string[2:])

实战

import requests
from bs4 import BeautifulSoup
headers = {
    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.6.1 Safari/605.1.15"
}
for start_num in range(0,250,25):
    response = requests.get(f"https://movie.douban.com/top250?start={start_num}", headers=headers)
    html = response.text
    soup = BeautifulSoup(html, "html.parser")
    all_titles = soup.findAll("span", attrs={"class", "title"})
    for title in all_titles:
        title_string = title.string
        if "/" not in title_string:
            print(title_string)