简单的线程池——从单线程到多线程——从零基础到零基础（站长素材）

news2025/4/6 13:39:56

多进程（Process）-读取到数据，要用cpu来运行大量的次数和时间（多线程）（cpu密集型）——multiprocessing

多线程（Thread）-IO多的，同时运行任务数目不多（多线程）（IO密集型）——threading

多协程（Coroutine）-request不可以用这个，只能用aiohttp（推荐），一个线程可以有N个协程（单线程）（IO密集型）——asyncio

单线程就是一个主线程，就是一条线走到黑。（耗时）cpu和IO同时进行，cpu爬取，IO存储

多线程就是好多个线程一起搞。（省时）

我在网上看，有的真的是好难，不适合入门。最终在b站找到一个简单的线程池方法（pool）。（超简单实现。）

这个是没有上-多线程

先说一下单线程爬取（也是很困难）

有迷惑你的链接在哪，你会发现你爬不到。爬到的要不然就是空的，要不然就是

出现这个。

import os

import requests
from lxml import etree
from lxml import html
from html.parser import HTMLParser
import re
count = 0
wenjian = input("你的照片将要储存到......文件夹：")
img_path = f"./{wenjian}/"  # 指定保存地址
if not os.path.exists(img_path):
    print("您没有这个文件为您新建一个文件：")
    os.mkdir(img_path)
else:
    for i in range(1,5,1):
        if i==1:
            url = "https://sc.chinaz.com/tupian/nvshengtupian.html"
        else:
            url = f"https://sc.chinaz.com/tupian/nvshengtupian_{i}.html"
            headers = {
                "User-Agent":
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
            }
            response = requests.get(url,headers=headers)
            response.encoding= "utf-8"
            response = response.text
            img_html = re.findall('data-original="(.*?)"',response)
            for img in img_html:
                img = 'https:'+img
                count += 1
                myimg = requests.get(img)
                file_name = f'{img_path}图片{str(count)}.jpg'
                # 图片和音乐WB的二进制写入方式
                f = open(file_name, "wb")
                f.write(myimg.content)
                print("正在保存" + str(count) + " 张图片")

1.这个utf-8，有必要的。防止意外

response.encoding= "utf-8"

用的是线程池（pool）

import os
import time
import requests
import re
from multiprocessing.dummy import Pool


urls = []
def xieru(myimg):
    print("start")
    file_name = f'{img_path}图片{myimg}.jpg'
    # 图片和音乐WB的二进制写入方式
    f = open(file_name, "wb")
    f.write(myimg.content)
    # print("正在保存" + str(count) + " 张图片")
    print('end')
start = time.time()
count = 0
wenjian = input("你的照片将要储存到......文件夹：")
img_path = f"./{wenjian}/"  # 指定保存地址
if not os.path.exists(img_path):
    print("您没有这个文件为您新建一个文件：")
    os.mkdir(img_path)
else:
    page = input("你要爬取多少页:")
    for i in range(1,int(page),1):
        if i==1:
            url = "https://sc.chinaz.com/tupian/nvshengtupian.html"
        else:
            url = f"https://sc.chinaz.com/tupian/nvshengtupian_{i}.html"
        headers = {
            "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
        }

        response = requests.get(url,headers=headers)
        response.encoding= "utf-8"
        response = response.text
        img_html = re.findall('data-original="(.*?)"',response)
        for img in img_html:
            img = 'https:'+img
            myimg = requests.get(img)
            urls.append(myimg)


pool = Pool(4)
pool.map(xieru,urls)
pool.close()
pool.join() #主进程阻塞后，让子进程继续运行完成，子进程运行完后，再把主进程全部关掉。
end = time.time()
print(end-start)

今天先到这里，现在跟着up主 ——蚂蚁学Python。学习多线程，多进程，多协程。

尝试了Xpath，最后还是觉得re好用，在selenuim中还是CSS好用

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/1457157.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！