【自用】Python爬虫学习（一）：爬虫基础与四个简单案例

news2025/7/7 7:05:23

Python爬虫学习（一）

基础知识
四个简单的爬虫案列
- 1.使用urlopen获取百度首页并保存
- 2.获取某翻译单词翻译候选结果
- 3.获取某网页中的书名与价格
- 4.获取某瓣排名前250的电影名称

基础知识

对于一个网页，浏览器右键可以查看页面源代码，但是这与使用开发者工具的检查看到的结果不一定相同。
在这里插入图片描述

服务器渲染：相同则说明应该是服务器渲染，在页面看到的数据，源代码中就有，服务器将所有数据一并发送给客户端。只需要对网页进行请求，获得页面数据后对感兴趣内容进行数据解析即可。
客户端渲染：不一样则说明应该是客户端渲染，右键看到的页面源代码只是简单的html框架，数据信息是服务器单独再次发送，经客户端注入重新渲染的结果。

想要获取第二种类型的网页数据，需要用到浏览器的抓包工具。
如下所示，页面中含有“美丽人生”，但右键查看页面源代码，使用Ctrl+F搜索却没有该文本，说明该网页应该就属于第2种类型，即客户端渲染。
在这里插入图片描述
那么包含“美丽人生”的文本在哪里呢？在该页面右键点击最下面的检查，或者直接按F12键打开开发者工具。

依次点击左侧红色方框中的条目内容，查看右侧预览信息，发现第二个就应该是我们需要的内容，其中就有“美丽人生”的文本。
在这里插入图片描述
确定好之后，点击右侧的标头，目前需要关注这几个部分的信息。

编写代码尝试获取预览的数据信息

import requests

url = 'https://movie.douban.com/j/chart/top_list?type=24&interval_id=100%3A90&action=&start=0&limit=20'

herders = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0'
}

resp = requests.get(url=url, headers=herders)

print(resp.text)

运行结果：
在这里插入图片描述
可以看到，已经获取到预览中看到的所有数据，但略显杂乱，后续只需要对该部分内容进行感兴趣提取就行，显然这是python基础，与爬虫无关了，毕竟已经获取到了数据。

例如，只获取电影名称与评分，示例代码如下：

import requests

url = 'https://movie.douban.com/j/chart/top_list?type=24&interval_id=100%3A90&action=&start=0&limit=20'

herders = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0'
}

resp = requests.get(url=url, headers=herders)
content_list = resp.json()

for content in content_list:
    movie_name = content['title']
    movie_score = content['score']
    print(f'《{movie_name}》, 评分：{movie_score}')

运行结果：
在这里插入图片描述

四个简单的爬虫案列

1.使用urlopen获取百度首页并保存

from urllib.request import urlopen

resp = urlopen('http://www.baidu.com')

with open('baidu.html', mode='w', encoding='utf-8') as f:
    f.write(resp.read().decode('utf-8'))

2.获取某翻译单词翻译候选结果

在这里插入图片描述

参考源码：

import requests

url = 'https://fanyi.baidu.com/sug'

name = input('请输入你要查询的单词：')
data = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0',
    'kw': name
}

resp = requests.post(url, data=data)

fanyi_result = dict(resp.json()['data'][0])['v']
print(fanyi_result)

resp.close()

3.获取某网页中的书名与价格

在这里插入图片描述
参考源码：

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0"
}
url = "http://books.toscrape.com/"
response = requests.get(url=url, headers=headers)
if response.ok:
    response = requests.get("http://books.toscrape.com/")
    print(response.status_code)  # 状态代码，200为请求成功
    content = response.text
    # 参数"html.parser"表明解析的是html
    soup = BeautifulSoup(content, "html.parser")

    # 获取网站中书本的价格信息：根据属性查找对应的p标签，返回的结果为可迭代对象
    all_prices = soup.find_all("p", attrs={"class": "price_color"})
    # print(list(all_prices))
    print("=====书本价格：=====")
    for price in all_prices:
        # 利用price.string可以只保留html标签中的文本内容，再利用字符串的切片得到价格
        print(price.string[2:])
    print("=====书本名称：=====")
    # 获取网站中书名信息
    all_titles = soup.find_all("h3")
    for title in all_titles:
        all_links = title.findAll("a")
        for link in all_links:
            print(link.string)

    response.close()
else:
    print("请求失败")

4.获取某瓣排名前250的电影名称

在这里插入图片描述
参考源码：

import requests
from bs4 import BeautifulSoup

# 获取豆瓣排名前250的电影名称

# 浏览器标识
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0"
}

i = 1
for start_num in range(0, 250, 25):
    # print(start_num)

    response = requests.get(f"https://movie.douban.com/top250?start={start_num}", headers=headers)
    # print("服务器响应状态码：", response.status_code)
    response.encoding = "UTF-8"  # 指定字符集

    if response.ok:  # 如果服务器响应正常执行下面代码
        douban_top250_html = response.text
        soup = BeautifulSoup(douban_top250_html, "html.parser")

        # all_titles = soup.find_all("span", attrs={"class": "title"})
        all_titles = soup.find_all("span", class_="title")  # 两种写法效果都一样

        for title in all_titles:
            title_string = title.string
            if "/" not in title_string:
                print(f"{i}:\t《{title.string}》")
                i = i + 1
    else:
        print("请求失败!")

    response.close()