本文提供了在python3环境里使用boto3访问S3对象存储,并列出百万级文件对象的存储信息的示例代码。
一、测试环境
操作系统和python版本如下:
[root@localhost boto3]# cat /etc/os-release
NAME="openEuler"
VERSION="22.03 LTS"
ID="openEuler"
VERSION_ID="22.03"
PRETTY_NAME="openEuler 22.03 LTS"
ANSI_COLOR="0;31"
[root@localhost boto3]# python3 --version
Python 3.9.9
二、准备运行环境
1、所需基础whl软件包
[root@localhost boto3]# ll packages/
total 13M
-rw-r-----. 1 AAAA AAAA 137K Aug 2 16:06 boto3-1.34.105-py3-none-any.whl
-rw-r-----. 1 AAAA AAAA 12M Aug 2 16:06 botocore-1.34.105-py3-none-any.whl
-rw-r-----. 1 AAAA AAAA 20K Aug 2 16:06 jmespath-1.0.1-py3-none-any.whl
-rw-r-----. 1 AAAA AAAA 225K Aug 2 16:06 python_dateutil-2.9.0.post0-py2.py3-none-any.whl
-rw-r-----. 1 AAAA AAAA 81K Aug 2 16:06 s3transfer-0.10.1-py3-none-any.whl
-rw-r-----. 1 AAAA AAAA 11K Aug 2 16:06 six-1.16.0-py2.py3-none-any.whl
-rw-r-----. 1 AAAA AAAA 141K Aug 2 16:06 urllib3-1.26.18-py2.py3-none-any.whl
2、启用虚拟环境并安装whl包
[root@localhost boto3]# python3.9 -m venv myenv
[root@localhost boto3]# source myenv/bin/activate
(myenv) [root@localhost boto3]# ll
total 8.0K
drwxr-x---. 5 root root 74 Aug 2 16:09 myenv
drwxr-x---. 2 AAAA AAAA 4.0K Aug 2 16:06 packages
(myenv) [root@localhost boto3]# cd packages/
(myenv) [root@localhost packages]# pip3 install *
Processing ./boto3-1.34.105-py3-none-any.whl
Processing ./botocore-1.34.105-py3-none-any.whl
Processing ./jmespath-1.0.1-py3-none-any.whl
Processing ./python_dateutil-2.9.0.post0-py2.py3-none-any.whl
Processing ./s3transfer-0.10.1-py3-none-any.whl
Processing ./six-1.16.0-py2.py3-none-any.whl
Processing ./urllib3-1.26.18-py2.py3-none-any.whl
Installing collected packages: six, urllib3, python-dateutil, jmespath, botocore, s3transfer, boto3
Successfully installed boto3-1.34.105 botocore-1.34.105 jmespath-1.0.1 python-dateutil-2.9.0.post0 s3transfer-0.10.1 six-1.16.0 urllib3-1.26.18
(myenv) [root@localhost packages]#
三、测试代码
1、编写连接文件
连接参数也可以直接写在代码文件中,但本人喜欢将配置和代码进行分离,故将配置单独保存
(myenv) [root@localhost boto3]# cat apiconf.py
setting = {"endpoint_url":"http://1192.168.188.13:8080","access_key":"48ES5QR8J70IB3KC93F4","secret_key":"TkhooozMPDd26XP3SbPJSfgcViB0ArShU4sBd33H"}
2、列出所有可用桶
编写代码:
(myenv) [root@localhost boto3]# cat bucketList.py
#!/usr/bin/python
#coding=utf-8
#for python 3.*.*
##__author__='daigjianbing'
import boto3
import apiconf
# 获取 endpoint、access key 和 secret key
endpoint_url = apiconf.setting["endpoint_url"]
access_key = apiconf.setting["access_key"]
secret_key = apiconf.setting["secret_key"]
# 创建 S3 客户端实例并指定 endpoint 和凭证信息
s3 = boto3.client('s3',
endpoint_url=endpoint_url,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
verify=False) # 如果不需要SSL验证,可以设置verify=False,即http或者https
# 列出所有的桶
responses = s3.list_buckets()
buckets = [bucket['Name'] for bucket in responses['Buckets']]
#print('All of Buckets:', buckets)
for bucket in buckets:
print(bucket)
执行测试:
(myenv) [root@localhost boto3]# python bucketList.py
1541412547839
1541753844586
1543198691659
1543198885291
1543198895193
1543307983636
....
3、列出桶中的对象
编写代码(以测试一个包含4万余对象的桶为例):
(myenv) [root@localhost boto3]# cat filesListInBucket.py
#!/usr/bin/python
#coding=utf-8
#for python 3.*.*
##__author__='daigjianbing'
import boto3
import apiconf
import time
# 获取 endpoint、access key 和 secret key
endpoint_url = apiconf.setting["endpoint_url"]
access_key = apiconf.setting["access_key"]
secret_key = apiconf.setting["secret_key"]
# 创建 S3 客户端实例并指定 endpoint 和凭证信息
s3 = boto3.resource('s3',
endpoint_url=endpoint_url,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
verify=False) # 不需要SSL验证,设置verify=False
bucket_name = 'myfile'
#连接桶
bucket = s3.Bucket(bucket_name)
objects = []
n = 0
# 遍历存储桶中的所有对象
for obj in bucket.objects.all():
key = obj.key
timestep = obj.last_modified
n = n + 1
#print(key, timestep)
print("objectNUM:",n) #打印当前对象序数
objects.append((key, timestep))
# 按LastModified时间戳排序(降序)
objects.sort(key=lambda x: x[1], reverse=True) #按对象文件存储时间降序排列
outtxt=""
# 打印排序后的结果
for key, timestamp in objects:
timestamptxt=timestamp.strftime('%Y-%m-%d_%H:%M:%S') #定义文件时间输出格式
print(key, timestamptxt)
outtxt=outtxt+str(key)+" "+timestamptxt+"\n"
todaystr=time.strftime('%Y%m%d',time.localtime(time.time())) #获取当前日期信息
outfilename=bucket_name+"-"+todaystr+'.log' #指定输出结果文件名称
with open(outfilename, 'w') as f:
f.write(outtxt)
以上代码遍历桶中所有对象,并按对象的保存时间进行倒序排列,在屏幕打印输出对象key和时间,同时将结果保存到“桶名-年月日”的日志文件中。
实测执行:
(myenv) [root@localhost boto3]# python filesListInBucket.py
objectNUM: 1
...
objectNUM: 41741
47c426e6adbb4570937b676281a273e9.mp3 2024-08-05_10:12:08
...
00605045-645f-4572-bf61-87a6227a97da.zip 2019-06-13_09:04:29
(myenv) [root@localhost boto3]# cat myfile-20240805.log
47c426e6adbb4570937b676281a273e9.mp3 2024-08-05_10:12:08
...
00605045-645f-4572-bf61-87a6227a97da.zip 2019-06-13_09:04:29
(myenv) [root@localhost boto3]]# cat myfile-20240805.log |wc -l
41741
四、补充
s3.list_objects_v2(Bucket='****') ,也可以列出的桶中的对象信息,但它是按页输出,默认1000个对象为1页,缺省只列出1000个对象信息;
而用s3.Bucket(bucket_name)的objects.all(),则可以列出的桶中所有对象信息,实测百万级对象信息输出没有任何问题,只是需要的等待的时间较长而已。
其余文件操作,可以参照编写代码执行。