一些爬虫代码的解析

news2025/4/16 23:09:19

import requests
from bs4 import BeautifulSoup
import time
import logging
import json
from concurrent.futures import ThreadPoolExecutor
import random

# 配置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# 目标网页的URL列表
urls = [
    'http://example.com',
    'http://example.org',
    # 更多URL...
]

# 用户代理列表
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
    # 更多User-Agent...
]

# 代理服务器列表
proxies = [
    {'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080'},
    # 更多代理...
]

# 会话管理
session = requests.Session()

# 爬虫函数
def crawl(url, session):
    try:
        # 设置User-Agent
        session.headers.update({'User-Agent': random.choice(user_agents)})
        
        # 设置代理
        proxy = random.choice(proxies) if proxies else None
        response = session.get(url, timeout=5, proxies=proxy)
        
        # 检查请求是否成功
        if response.status_code == 200:
            logging.info(f'成功访问: {url}')
            # 使用BeautifulSoup解析网页内容
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # 获取网页的<title>标签内容
            title = soup.title.string if soup.title else 'No title'
            logging.info(f'网页标题是: {title}')
            
            # 返回解析结果
            return {
                'url': url,
                'title': title
            }
        else:
            logging.warning(f'访问失败，状态码: {response.status_code}')
    
    except requests.exceptions.RequestException as e:
        logging.error(f'请求错误: {e}')
    except Exception as e:
        logging.error(f'其他错误: {e}')
    return None

# 多线程爬取函数
def multithread_crawl(urls):
    results = []
    with ThreadPoolExecutor(max_workers=10) as executor:
        future_to_url = {executor.submit(crawl, url, session): url for url in urls}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            result = future.result()
            if result:
                results.append(result)
    return results

# 主函数
def main():
    results = multithread_crawl(urls)
    
    # 将结果保存到JSON文件
    with open('results.json', 'w', encoding='utf-8') as f:
        json.dump(results, f, ensure_ascii=False, indent=4)

if __name__ == '__main__':
    main()

BeautifulSoup：用于解析HTML内容

logging：用于记录日志信息

json用来处理JSON数据

ThreadPoolExecutor:来自concurrent.futures模块，用于创建线程池并管理多线程。

一、

logging.basicConfig设置日志的基本配置。

level=logging.INFO，这设置了日志的最低级别。Python的日志级别从低到高有DEBUG、INFO、WARNING、ERROR和CRITICAL。在这个例子中，我们设置级别为INFO，意味着所有重要及以上级别的日志都会被记录。如果设置为DEBUG,则会记录更多的信息，对于调试程序非常有用。

format='%(asctime)s - %(levelname)s - %(message)s': 这是一个字符串，定义了日志的格式。这个格式包含三个部分：

%(asctime)s: 这表示日志的时间戳，会自动记录日志消息产生的时间。默认格式是年-月-日时:分:秒,毫秒。

%(levelname)s: 这表示日志的级别，如INFO、WARNING、ERROR等，这有助于快速识别日志消息的严重性。
%(message)s: 这是实际的日志消息，即你传递给logging方法（如logging.info()、logging.error()等）的消息。

二、

用户代理（User-Agent）

user_agents列表在爬虫中用于存储不同的浏览器用户代理字符串。用户代理（User-Agent）是HTTP请求的一部分，用于告诉服务器发起请求的浏览器类型和版本，以及其他系统信息。在爬虫中使用用户代理的几个目的：

1、模仿浏览器：大多数网站对直接的爬虫访问有限制，使用用户代理可以模拟普通用户浏览器访问，降低被识别为爬虫的风险。

2、提高兼容性：不同的网站可能对不同的浏览器有不同的支持程度，使用常见的用户代理有助于确保爬虫能够正常访问网站的资源。

3、规避反爬措施：一些网站可能通过检测用户代理来确定是否允许访问，使用合法的用户代理可以规避这些反爬措施。

4、随机化：在user_agents列表中存储多个用户代理，并在请求时随机选择，可以进一步降低被服务器封锁的风险。

这个列表中的具体用户代理字符串如下：

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"：

这是一个示例用户代理字符串，表示访问来自Windows10操作系统的64位计算机，使用的是Google Chrome浏览器的58.0.3029.110版本。Safari/537.3部分可能是为了兼容性而添加的，因为AppleWebKit/537.36是WebKit引擎的版本号，而WebKit也是Safari浏览器的引擎。