什么是爬虫?——从技术原理到现实应用的全面解析 V
二十一、云原生爬虫架构设计
21.1 无服务器爬虫(AWS Lambda)
# lambda_function.py
import boto3
import requests
from bs4 import BeautifulSoup
s3 = boto3.client('s3')
def lambda_handler(event, context):
# 抓取目标页面
headers = {'User-Agent': 'AWS-Lambda-Crawler/1.0'}
response = requests.get('https://news.example.com/latest', headers=headers)
# 解析内容
soup = BeautifulSoup(response.text, 'html.parser')
articles = []
for item in soup.select('.news-item'):
articles.append({
'title': item.select_one('h2').