Python爬虫之入门保姆级教程

news2025/4/15 4:33:40

一、分析要爬取的网站

二、导入相关库

三、相关的参数

四、向网站发出请求（使用代理IP）

五、匹配

六、获取图片，保存到文件夹中（os库）

七、完整代码

总结

相信许多人都曾为如何入门Python爬虫而烦恼。今天，我将以一个有趣的案例为引子，为大家详细介绍如何使用Python进行简单的爬虫实践。

一、分析要爬取的网站

在开始编写爬虫之前，我们需要了解要爬取的网站的结构和数据来源。以一个图片网站为例，我们需要爬取图片信息并保存。为了更好地理解网页结构，我们可以使用开发者工具来分析网页的HTML代码，并找到需要爬取的数据所在的标签和属性。

二、导入相关库

为了进行爬虫编程，我们需要导入相关的库。其中，requests库用于向目标网站发出请求，获取网页内容；而BeautifulSoup库则用于解析网页内容，方便我们提取所需数据。我们还将使用re库来进行正则表达式匹配，以及os库来保存图片到本地文件夹。

import requests  
from bs4 import BeautifulSoup  
import re  
import os

三、相关的参数

在进行爬虫编程时，有一些参数需要我们注意。例如，user-agent和cookie等。user-agent用于模拟浏览器行为，防止被目标网站识别为爬虫而被封锁；cookie则用于保存用户登录状态，方便后续请求。为了设置user-agent和cookie，我们可以在请求头中添加相应的字段，具体代码示例将在后续步骤中介绍。

headers = {  
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537',  
    'Cookie': 'your cookie here'  
}  
proxy = {  
    'http': 'http://168.168.168.168:16888',  
    'https': 'http://168.168.168.168:16888'  
}

四、向网站发出请求（使用代理IP）

为了防止一些反爬虫措施，我们需要使用代理IP向网站发出请求。我们可以购买代理IP服务或者使用免费的代理IP。在本例中，我们将使站大爷代理IP。使用代理IP时需要将其设置为HTTP请求的代理服务器地址，这样就能通过代理IP发送请求，从而避免封锁。

url = 'http://www.example.com/'  
response = requests.get(url, headers=headers, proxies=proxy, timeout=10)  
html = response.text

五、匹配

在这一部分，我们需要讲解如何通过正则表达式或其他方式匹配数据，如何提取有用信息并处理过滤掉无用信息。

例如，我们可以使用正则表达式匹配图片内容。具体来说，我们可以使用BeautifulSoup库中的find_all方法来解析HTML内容，并使用正则表达式来匹配所需的标题和内容信息。此外，我们还可以使用过滤器来过滤掉无关的标签和属性，从而提取出所需的数据。

soup = BeautifulSoup(html, 'lxml')  
img_tags = soup.find_all('img')  
image_paths = []  
for img in img_tags:  
    if re.search(r'<img[^>]+src="([^">]+)"', img['src']):  
        img_url = img['src']  
        img_path = os.path.join('news_images', img_url.split('/')[-1])  
        if not os.path.exists(img_path):  
            try:  
                response = requests.get(img_url, proxies=proxy, timeout=10)  
                with open(img_path, 'wb') as f:  
                    f.write(response.content)  
            except requests.exceptions.RequestException as e:  
                print('请求错误，错误信息：', e)  
                continue  
        image_paths.append(img_path)

六、获取图片，保存到文件夹中（os库）

获取图片并保存到文件夹中需要使用os库。首先，我们需要创建文件夹并设置相应的权限，然后将匹配到的图片下载并保存到该文件夹中。具体来说，我们可以使用BeautifulSoup库中的find方法来解析图片标签，并获取图片的URL地址。然后使用requests库中的get方法来下载图片，最后使用os库中的open方法将图片保存到本地文件夹中。需要注意的是，下载图片时需要设置适当的超时时间和异常处理机制，以避免长时间等待或失败重试等问题。

if not os.path.exists('news_images'):  
    os.makedirs('news_images')

七、完整代码

在这一部分，我们将提供完整的代码示例，包括所有步骤的代码，可运行并取得所需结果。请注意，以下代码仅供参考，实际情况可能因网站更新而有所变化。在编写代码时，我们可以将相关代码封装成函数或类，以便于维护和复用。另外，我们还可以添加适当的注释来解释代码的作用和原理。

import requests  
from bs4 import BeautifulSoup  
import re  
import os  
  
# 设置相关参数  
headers = {  
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537',  
    'Cookie': 'your cookie here'  
}  
proxy = {  
    'http': 'http://168.168.168.168:16888',  
    'https': 'http://168.168.168.168:16888'  
}  
  
# 向网站发出请求（使用站大爷代理IP）  
url = 'http://www.example.com/'  
response = requests.get(url, headers=headers, proxies=proxy, timeout=10)  
html = response.text  
  
# 解析网页内容，提取图片信息  
soup = BeautifulSoup(html, 'lxml')  
img_tags = soup.find_all('img')  
image_paths = []  
for img in img_tags:  
    if re.search(r'<img[^>]+src="([^">]+)"', img['src']):  
        img_url = img['src']  
        img_path = os.path.join('news_images', img_url.split('/')[-1])  
        if not os.path.exists(img_path):  
            try:  
                response = requests.get(img_url, proxies=proxy, timeout=10)  
                with open(img_path, 'wb') as f:  
                    f.write(response.content)  
            except requests.exceptions.RequestException as e:  
                print('请求错误，错误信息：', e)  
                continue  
        image_paths.append(img_path)  
  
# 创建文件夹并保存图片  
if not os.path.exists('news_images'):  
    os.makedirs('news_images')