github项目--crawl4ai
- 输出html
- 输出markdown格式
- 输出结构化数据
- 与BeautifulSoup的对比
crawl4ai
github上这个项目,没记错的话,昨天涨了3000多的star,今天又新增2000star。一款抓取和解析工具,简单写个demo感受下
这里我们使用crawl4ai抓取github每日趋势,每天通过邮件发到自己邮箱
输出html
async def github_trend_html():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://github.com/trending",
)
assert result.success, "github 数据抓取失败"
return result.cleaned_html
输出的还是html,但对原始页面做了处理,比如移除不相关元素,动态元素,简化html结构。
输出markdown格式
async def github_trend_md():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://github.com/trending",
)
assert result.success, "github 数据抓取失败"
return result.markdown
用md软件打开看一下效果:
输出结构化数据
async def github_trend_json():
schema = {
"name": "Github trending",
"baseSelector": ".Box-row",
"fields": [
{
"name": "repository",
"selector": ".lh-condensed a[href]",
"type": "text",
},
{
"name": "description",
"selector": "p",
"type": "text",
},
{
"name": "lang",
"type": "text",
"selector": "span[itemprop='programmingLanguage']",
},
{
"name": "stars",
"type": "text",
"selector": "a[href*='/stargazers']"
},
{
"name": "today_star",
"type": "text",
"selector": "span.float-sm-right",
},
],
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://github.com/trending",
extraction_strategy=extraction_strategy,
bypass_cache=True,
)
assert result.success, "github 数据抓取失败"
github_trending_json = json.loads(result.extracted_content)
for ele in github_trending_json:
ele['repository'] = 'https://github.com/' + ''.join(ele['repository'].split())
return github_trending_json
与前两种不同的是,
结构化输出
需要通过自定义schema
来定义解析的数据结构。控制台按照我们定义的schema输出了标准了JSON数据。将数据放入html模版,通过邮件每日发送。看一下邮件显示:
与BeautifulSoup的对比
记得第一次用soup的时候,对于只用过Java sax解析xml的我来说,soup真的太方便了。今天简单测试了下crawl4ai,和soup相比
- crawl4ai数据采集分析更方便
- soup需要配合使用request进行网页抓取,BeautifulSoup负责html解析
- html解析有点类似,都是通过CSS选择器,但crawl4ai通过定义schema,解析更方便
- 数据解析方面,crawl4ai除了提供了markdown和简化版的html,还提供了通过
集成OpenAI提取结构化数据
的能力(尚未体验)