python的多线程实现高速下载PDB数据集

news2025/7/12 5:18:43

多线程下载数据

最近在某个网站上写了个shell脚本来下载数据集，内容量不大，但是下载的特别的慢，于是想到用多线程下载，发现快了很多。本文主要让大家清楚python中的几个模块区别和关于程序加速的一些方法，以及多线程下载和单线程下载的一个比较

程序提速的一些方法

在这里插入图片描述

单线程串行

不加改造的程序，也就是单线程模式，只能像生产者消费者模式那样IO和CPU顺序执行，CPU和IO无法并行，而python中的多线程也就是threading模块可以让python支持多线程。

python中多线程的threading模块

python的threading有一个限制，就是它只能实现CPU和IO的并行，不能实现CPU和CPU的并行。python并发的实现有一个全局解释器锁(GIL)，它会阻止多线程程序中的线程真正地并行执行。这意味着即使你的程序有多个线程，在任何给定的时间点只有一个线程能够被执行。因此，多线程在IO密集型任务上更加有效。

python中的multiprocessing模块

在Python中，multiprocessing模块提供了一种简单的方式来利用多核处理器的能力来并行运行多个进程。这与threading模块不同，后者由于全局解释器锁（GIL）的存在，对于CPU密集型任务可能不会提供真正的并行性。

python对并发编程的支持

在这里插入图片描述

下面我们来看一下threading模块的提速效果,我们写一个简单的爬虫看看效果.
首先写个blog_side.py文件，文件如下

import requests

urls = [
    f"https://www.cnblogs.com/#p{page}"
    for page in range(1, 50+1)
]

def craw(urls):
    r = requests.get(urls)
    print(urls,len(r.text))

craw(urls[0])

其次写主文件

import threading
import time
from blog_spide import urls,craw

def single_thread():
    print("singe_thread begin")
    for url in urls:
        craw(url)
    print("singe_thread end")

def multi_thread():
    print("multi_thread begin")
    threads = []
    for url in urls:
        threads.append(threading.Thread(target=craw, args=(url,)))
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    print("multi_thread end")

if __name__ == '__main__':
    start = time.time()
    single_thread()
    end = time.time()
    print("single thread cost:",end - start,"seconds")


    start = time.time()
    multi_thread()
    end = time.time()
    print("muti_thread cost:",end - start,"seconds")