在使用 Python 爬虫调用 API 时,请求失败是一个常见的问题。这可能是由于网络问题、API 限制、服务器错误或其他原因导致的。为了确保爬虫的稳定性和可靠性,我们需要合理地处理这些失败的请求。以下是一些有效的处理方法:
1. 捕获异常
使用 try-except
语句捕获可能的异常,可以防止程序因异常而崩溃,并提供适当的错误处理。常见的异常类型包括:
-
网络错误(如
ConnectionError
):通常表示网络连接问题。 -
HTTP 错误(如
HTTPError
):表示 HTTP 请求返回的状态码不是 200。 -
解析错误(如
ValueError
):通常发生在解析 HTML 或 JSON 数据时。
示例代码:
Python
import requests
from requests.exceptions import HTTPError, ConnectionError, Timeout
url = "http://example.com/api/data"
try:
response = requests.get(url, timeout=10)
response.raise_for_status() # 如果响应状态码不是200,抛出HTTPError
data = response.json()
except HTTPError as http_err:
print(f"HTTP error occurred: {http_err}")
except ConnectionError as conn_err:
print(f"Connection error occurred: {conn_err}")
except Timeout as timeout_err:
print(f"Timeout error occurred: {timeout_err}")
except Exception as err:
print(f"An error occurred: {err}")
2. 重试机制
在请求失败时,可以设置重试机制,让爬虫重新尝试获取数据。可以通过以下方法实现:
-
使用
retrying
库:提供简单的重试机制。 -
自定义重试逻辑:在捕获到特定异常后,设置最大重试次数和重试间隔时间。
使用 retrying
库的示例:
Python
from retrying import retry
import requests
@retry(stop_max_attempt_number=3, wait_fixed=2000)
def fetch_url(url):
response = requests.get(url)
response.raise_for_status()
return response.text
url = "http://example.com/api/data"
try:
data = fetch_url(url)
print(f"Successfully fetched {url}")
except Exception as err:
print(f"Failed to fetch {url}: {err}")
自定义重试逻辑的示例:
Python
import time
import requests
def fetch_url(url, max_retries=3, wait_time=2):
for attempt in range(max_retries):
try:
response = requests.get(url)
response.raise_for_status()
return response.text
except requests.exceptions.RequestException as req_err:
print(f"Attempt {attempt + 1} failed: {req_err}")
time.sleep(wait_time)
raise Exception(f"Failed to fetch {url} after {max_retries} attempts")
url = "http://example.com/api/data"
try:
data = fetch_url(url)
print(f"Successfully fetched {url}")
except Exception as err:
print(f"Failed to fetch {url}: {err}")
3. 指数退避
当 API 返回“429 Too Many Requests”状态码时,表示请求过于频繁。此时可以使用指数退避策略,即在每次重试之间增加等待时间。这有助于避免因请求频率过高而被限制。
示例代码:
Python
import time
import requests
def fetch_url_with_backoff(url, max_retries=5):
retry_count = 0
while retry_count < max_retries:
try:
response = requests.get(url)
response.raise_for_status()
return response.text
except requests.exceptions.HTTPError as http_err:
if http_err.response.status_code == 429:
retry_after = int(http_err.response.headers.get('Retry-After', 1))
print(f"Rate limit exceeded. Retrying in {retry_after} seconds...")
time.sleep(retry_after)
retry_count += 1
else:
raise
except requests.exceptions.RequestException as req_err:
print(f"Request failed: {req_err}")
break
raise Exception(f"Failed to fetch {url} after {max_retries} attempts")
url = "http://example.com/api/data"
try:
data = fetch_url_with_backoff(url)
print(f"Successfully fetched {url}")
except Exception as err:
print(f"Failed to fetch {url}: {err}")
4. 日志记录
在异常处理中,及时记录异常信息是非常重要的。可以使用 Python 内置的 logging
模块或第三方库(如 loguru
)来记录异常日志。这有助于快速定位问题并进行修复。
示例代码:
Python
import logging
import requests
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
url = "http://example.com/api/data"
try:
response = requests.get(url)
response.raise_for_status()
data = response.json()
except requests.exceptions.RequestException as req_err:
logging.error(f"Request failed: {req_err}")
5. 优化请求
-
缓存结果:对于不需要频繁更新的数据,可以将 API 调用的结果缓存起来,减少不必要的请求。
-
批量请求:尽量合并多个单独的请求为一个批量请求,以减少总的调用次数。
-
合理安排请求频率:避免短时间内频繁发送请求。
6. 使用代理
如果请求被限制或被封禁,可以使用代理服务器来隐藏真实的 IP 地址。这有助于避免因 IP 被封禁而导致的请求失败。
示例代码:
Python
import requests
proxies = {
"http": "http://your-proxy-ip:port",
"https": "http://your-proxy-ip:port"
}
url = "http://example.com/api/data"
try:
response = requests.get(url, proxies=proxies)
response.raise_for_status()
data = response.json()
except requests.exceptions.RequestException as req_err:
print(f"Request failed: {req_err}")
总结
通过上述方法,可以有效处理 API 请求失败的问题,提高爬虫的稳定性和可靠性。合理捕获异常、设置重试机制、使用指数退避策略、记录日志以及优化请求频率,都是确保爬虫稳定运行的重要手段。