Python爬虫技术第09节状态码

news2025/4/6 18:08:44

在使用Python进行网络爬虫开发时，理解HTTP状态码是至关重要的。HTTP状态码是由服务器返回给客户端的响应状态指示，通常用于告知客户端请求是否成功、失败或需要采取进一步的操作。下面是一些常见的HTTP状态码及其含义：

常见的HTTP状态码

1xx: 信息性状态码
- 这些状态码表示接收的请求正在处理中。
- 100 Continue: 表示服务器已经收到请求的一部分，并且客户端应该继续发送剩余部分。
2xx: 成功状态码
- 200 OK: 请求已成功，响应中包含所请求的信息。
- 201 Created: 请求已成功创建新的资源。
- 204 No Content: 服务器成功处理了请求，但没有返回任何内容。
3xx: 重定向状态码
- 301 Moved Permanently: 请求的资源已永久移动到新位置。
- 302 Found (临时重定向): 请求的资源暂时位于不同的URI。
- 304 Not Modified: 自从上次请求以来，资源未被修改过。
4xx: 客户端错误状态码
- 400 Bad Request: 请求无法被服务器理解。
- 401 Unauthorized: 请求需要用户的身份认证。
- 403 Forbidden: 服务器理解请求客户端的请求，但是拒绝执行此请求。
- 404 Not Found: 请求的资源不存在。
5xx: 服务器错误状态码
- 500 Internal Server Error: 服务器遇到了一个未曾预料的情况，导致它无法完成对请求的处理。
- 502 Bad Gateway: 作为网关或代理工作的服务器从上游服务器收到了无效的响应。
- 503 Service Unavailable: 服务器目前无法使用（由于超载或停机维护）。

使用Python检查HTTP状态码

在Python中，可以使用requests库来发送HTTP请求并获取状态码。以下是一个简单的示例：

import requests

response = requests.get('https://www.example.com')
print(response.status_code)

如果要根据状态码做出不同反应，可以使用条件语句：

if response.status_code == 200:
    print("请求成功")
elif response.status_code == 404:
    print("页面未找到")
else:
    print("发生了其他错误")

通过检查状态码，你可以确保你的爬虫能够适当地处理请求的结果，避免不必要的数据抓取或重试。在实际应用中，你可能还需要处理更复杂的状态码和异常情况，如重试机制、超时设置等。

在这里插入图片描述

当深入探讨HTTP状态码在Python爬虫中的应用时，我们不仅关注常见状态码的处理，还应考虑如何有效地管理异常和错误，以及如何优化爬虫的行为。下面是一些进阶主题：

异常处理与重试机制

在爬虫中，网络连接可能会不稳定，或者服务器可能会暂时不可用。因此，一个健壮的爬虫应该具备处理这些情况的能力，比如使用重试机制：

from requests.exceptions import RequestException

def fetch(url, retries=3):
    try:
        response = requests.get(url)
        response.raise_for_status()  # 如果状态码不是200，则抛出HTTPError异常
        return response
    except RequestException as e:
        if retries > 0:
            return fetch(url, retries=retries-1)  # 递归重试
        else:
            print(f"Failed to retrieve {url} after {retries} attempts.")
            raise

超时设置

在爬虫中设置合理的超时值可以防止请求长时间挂起，从而提高爬虫的整体效率。requests库允许设置超时参数：

response = requests.get('https://www.example.com', timeout=5)  # 设置超时为5秒

尊重服务器的速率限制

为了避免对目标网站造成过大负担，爬虫应该合理安排请求频率，避免被封禁IP。这可以通过添加延时或使用代理池来实现：

import time

def fetch_with_delay(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response
    elif response.status_code == 429:  # 太多请求
        print("Too many requests, waiting...")
        time.sleep(60)  # 等待一分钟
        return fetch_with_delay(url)

优雅地处理重定向

重定向是网络爬虫中常见的现象，特别是当网站更新URL结构时。requests库默认会自动处理301和302重定向，但有时你可能需要控制这个过程：

response = requests.get('http://old.example.com', allow_redirects=False)
if response.status_code in [301, 302]:
    new_url = response.headers['location']
    response = requests.get(new_url)

使用头部信息

有时候，服务器会根据请求的头部信息（如User-Agent）来改变其行为。为了模拟不同的浏览器或设备，可以在请求中添加特定的头部信息：

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get('https://www.example.com', headers=headers)

以上技巧可以帮助你构建一个更加健壮和高效的网络爬虫，同时也遵守了网络礼仪和法律法规。在实践中，你还应该密切关注目标网站的robots.txt文件，以确保你的爬虫行为合法并且尊重网站的爬虫政策。

当然，我将提供一些具体的Python代码示例，展示如何处理不同的HTTP状态码和网络异常，同时使用一些最佳实践，如重试机制和超时处理。

示例代码：网络爬虫异常处理与状态码检查

import requests
from requests.exceptions import RequestException
import time

def fetch(url, retries=3, backoff_factor=0.3, status_forcelist=(500, 502, 504), timeout=5):
    """
    发送GET请求并处理异常和重试。
    
    参数:
        url: 目标URL
        retries: 最大重试次数
        backoff_factor: 指数退避因子
        status_forcelist: 需要重试的状态码列表
        timeout: 请求超时时间
    """
    for i in range(retries):
        try:
            response = requests.get(url, timeout=timeout)
            response.raise_for_status()  # 抛出非200状态码的异常
            
            # 根据状态码进行额外处理
            if response.status_code in status_forcelist:
                raise RequestException(f"Got error status code: {response.status_code}")
            
            return response
        
        except RequestException as e:
            print(f"Request failed with error: {e}")
            wait_time = backoff_factor * (2 ** i)
            print(f"Retrying in {wait_time} seconds...")
            time.sleep(wait_time)

    raise RequestException("All retries failed")

# 使用示例
url = "https://api.example.com/data"
try:
    response = fetch(url)
    print("Response received successfully.")
    print(response.text)
except RequestException as e:
    print(f"Failed to fetch data from {url}: {e}")

代码解析

重试机制 (fetch函数)：
- retries 参数定义了最大重试次数。
- backoff_factor 和指数退避策略确保每次重试之间有逐渐增加的延迟，以避免频繁请求造成的服务器压力。
- status_forcelist 列出了哪些状态码触发重试，例如500级别的服务器错误。
超时处理 (timeout参数)：
- 设置了请求的最大等待时间，防止请求无限期挂起。
异常处理 (raise_for_status方法和RequestException捕获)：
- raise_for_status 方法会在状态码不在200-299范围内时抛出异常。
- 捕获RequestException来处理所有类型的请求异常，包括网络问题、DNS解析失败等。
状态码检查：
- 在重试逻辑中，除了raise_for_status之外，我们还检查了status_forcelist中定义的状态码，以便对特定的服务器错误进行重试。