6.每天进步一点点---Python爬虫urllib库

news2024/10/6 8:40:30

文章未完成待续

urllib 库是 Python 内置的一个 HTTP 请求库。在 Python 2.x 中，是由 urllib 和 urllib2 两个库来实现请求发送的，在 Python 3.x 中，这两个库已经合并到一起，统一为 urllib 了。

urllib 库由四个模块组成。

request 模块：打开和浏览 URL 中的内容。
error 模块：包含 urllib.request 发生的错误或异常。
parse 模块：解析 URL。
robotparser 模块：解析 robots.txt 文件。

1.发送请求

一个简单的模拟访问百度首页的例子，代码示例如下

import urllib.request

resp = urllib.request.urlopen("http://www.baidu.com")
print(resp)
print(resp.read())

代码执行结果如下

在这里插入图片描述

通过 urllib.request 模块提供的 urlopen()函数，我们构造一个 HTTP 请求，从上面的结果可知，urlopen()函数返回的是一个 HTTPResponse 对象，调用该对象的 read()函数可以获得请求返回的网页内容。read()返回的是一个二进制的字符串，明显是无法正常阅读的，要调用 decode(‘utf-8’)将其解码为 utf-8 字符串。这里顺便把 HTTPResponse 类常用的方法和属性打印出来，我们可以使用 dir()函数来查看某个对象的所有方法和属性。修改后的代码如下：

import urllib.request

resp = urllib.request.urlopen("http://www.baidu.com")
print("resp.geturl：", resp.geturl())
print("resp.msg：", resp.msg)
print("resp.status：", resp.status)
print("resp.version：", resp.version)
print("resp.reason：", resp.reason)
print("resp.debuglevel：", resp.debuglevel)
print("resp.getheaders：", resp.getheaders()[0:2])
print(resp.read().decode('utf-8'))

执行结果如下

在这里插入图片描述

另外，有一点要注意，在 URL 中包含汉字是不符合 URL 标准的，需要进行编码，代码示例如下

urllib.request.quote('http://www.baidu.com') 
# 编码后：http%3A//www.baidu.com
urllib.request.unquote('http%3A//www.baidu.com') 
# 解码后：http://www.baidu.com

2.抓取二进制文件

直接把二进制文件写入文件即可，代码示例如下

import urllib.request

pic_url = "https://www.baidu.com/img/bd_logo1.png"
pic_resp = urllib.request.urlopen(pic_url)
pic = pic_resp.read()
with open("bd_logo.png", "wb") as f:
    f.write(pic)