python 项目中使用 celery 中导致mysql数据库连接耗尽记录【mysql数据库连接池使用错误】
结论:由于使用 celery 进行项目的多任务管理,在worker任务定义的过程中,使用了 dbutils 中的 PooledDB 连接池进行 mysql数据库连接, 因此系统在并发执行 worker 任务时, 将产生大量的数据库连接,最终导致mysql数据库连接耗尽
我在该 celery 项目中,使用了 20 个 worker 节点,每个 worker 节点开启了 32 个prefork子进程, 每个子进程中将产生连接池,而 PooledDB 连接池最大为 200个, 最终该项目产生的数据库连接数为 20 * 32 * 200 =12800 个数据库连接 ;
而我用的阿里云 rds 数据库, 8核16G 该配置连接数为 1600 个,所以每次只要大量并发执行有数据库操作的任务, 就会导致数据库连接耗尽, 所有连接该数据库的服务都挂了
下面是我在项目中使用的数据库连接的部分代码
from dbutils.pooled_db import PooledDB
try:
_pool: PooledDB = PooledDB(creator=pymysql, mincached=0, maxcached=10, blocking=True,
maxconnections=200, # 连接池允许的最大连接数,0和None表示不限制连接数
maxshared=100, # 允许的最大共享连接数(默认值 0 或 None 表示所有连接都是专用的)
maxusage=10,
host=conf["host"], port=conf["port"], user=conf["user"], passwd=conf["pwd"],
db=conf["dbname"], use_unicode=True, charset='utf8mb4',
cursorclass=SSDictCursor,
setsession=['SET AUTOCOMMIT = 1'])
except Exception as e:
raise e
由于项目中的 celery 默认使用 prefork 多进程的模式; 但是 PooledDB 在 prefork 类型的多进程模式下不生效
celery 并发模式文档摘要
Overview of Concurrency Options
- prefork: The default option, ideal for CPU-bound tasks and most use cases. It is robust and recommended unless there’s a specific need for another model.
- eventlet and gevent: Designed for IO-bound tasks, these models use greenlets for high concurrency. Note that certain features, like soft_timeout, are not available in these modes. These have detailed documentation pages linked below.
- solo: Executes tasks sequentially in the main thread.
- threads: Utilizes threading for concurrency, available if the concurrent.futures module is present.
- custom: Enables specifying a custom worker pool implementation through environment variables.
部分翻译:
- celery 默认选项,非常适合 CPU 密集型任务和大多数用例。它非常可靠,除非对其他模型有特定需求,否则建议使用它。
dbutils 中的 PooledDB 文档摘要
Notes
If you are using one of the popular object-relational mappers SQLObject or SQLAlchemy, you won’t need DBUtils, since they come with their own connection pools. SQLObject 2 (SQL-API) is actually borrowing some code from DBUtils to split the pooling out into a separate layer.
Also note that when you are using a solution like the Apache webserver with mod_python or mod_wsgi, then your Python code will be usually run in the context of the webserver’s child processes. So if you are using the pooled_db module, and several of these child processes are running, you will have as much database connection pools. If these processes are running many threads, this may still be a reasonable approach, but if these processes don’t spawn more than one worker thread, as in the case of Apache’s “prefork” multi-processing module, this approach does not make sense. If you’re running such a configuration, you should resort to a middleware for connection pooling that supports multi-processing, such as pgpool or pgbouncer for the PostgreSQL database.
部分翻译:
- 当您使用诸如带有 mod_python 或 mod_wsgi 的 Apache Web 服务器之类的解决方案时,您的 Python 代码通常将在 Web 服务器的子进程的上下文中运行。因此,如果您使用 pooled_db 模块,并且其中几个子进程正在运行,您将拥有尽可能多的数据库连接池。如果这些进程正在运行许多线程,这可能仍然是一种合理的方法,但如果这些进程不产生多个工作线程,就像 Apache 的“prefork”多处理模块的情况一样,这种方法就没有意义。
参考文档:
-
dbutils 官方文档
-
celery 官方文档