PostgreSQL 机器学习插件 MADlib 安装与使用

在这里插入图片描述

MADlib 一个可以在数据库上运行的开源机器学习库，支持 PostgreSQL 和 Greenplum 等数据库；并提供了丰富的分析模型，包括回归分析，决策树，随机森林，贝叶斯分类，向量机，风险模型，KMEAN 聚集，文本挖掘，数据校验等。

本文介绍，如何在 Linux 环境中为 PostgreSQL 数据库使用源码编译安装 MADlib 插件，并提供的基础例子演示通过 SQL 使用 MADlib 机器学习插件进行数据分析。

01 编译安装

1.1 依赖环境安装

1. cmake 版本 3.5.2+

MADlib 使用 cmake 进行编译安装，自动生成 Makefile 具有更好的系统兼容性，但是需要 cmake 版本最少为 3.5.2，否则在编译过程中可能出现段错误

使用 cmake --version 检查当前 cmake 版本，如果不满足则需要手动升级，cmake 安装包下载地址：https://cmake.org/files/

# 卸载旧 cmake
sudo yum remove -y cmake
# 在任意目录下载安装包并解压
wget https://cmake.org/files/v3.27/cmake-3.27.4.tar.gz
tar -zxvf cmake-3.27.4.tar.gz
# 编译安装
cd cmake-3.27.4
./configure
sudo make && sudo make install 
# 创建链接
sudo ln -s /usr/local/bin/cmake /usr/bin/cmake 
# 或把 cmake 路径添加到环境变量
export CMAKE_HOME=/usr/local/cmake
export PATH=$PATH:$CMAKE_HOME/bin
# 验证安装
cmake -version

2. python 2.6 或 python 2.7

MADlib 2.0.0 版本才开始支持 python3，如果安装之前的版本，需要有 python 2.6 或 python 2.7 的环境，centos 7 默认有 python 2.7.5

如果不满足，同样需要自行安装合适 python 版本，资源足够的情况下建议使用 Anaconda 管理 Python 环境，可以避免大部分环境问题

# 下载 conda 安装包
wget https://repo.anaconda.com/archive/Anaconda3-2023.03-1-Linux-x86_64.sh
# 安装 conda 注意安装过程中指定安装路径
bash Anaconda3-2023.03-1-Linux-x86_64.sh
# 配置软连接
ln -s /[your-install-path]/anaconda3/condabin/conda /usr/bin/conda
# 创建并激活环境
conda create -n madlib python=2.7
conda activate madlib
# 退出环境 
conda deactivate

3. PostgreSQL 加 --with-python 编译

MADlib 和 PostgreSQL 有版本依赖关系，最低需要 PostgreSQL 9.2+，MADlib 1.17.0 版本之后开始支持 PostgreSQL 12，MADlib 2.0.0 版本之后开始支持 Postgres 13/14/15，所以根据自己需要下载对应版本

但是，在编译安装 PostgreSQL 的时候，需要加上 --with-python 参数，让其支持 Python 程序

# 下载源码，这里直接从 github 仓库 clone 代码，指定版本为 12.7
git clone -b REL_12_7 --depth 1 https://github.com/postgres/postgres.git
# 编译安装
cd postgres/
./configure --prefix=[your-postgresql-install-path] --with-python
make && make install

安装完成之后，在 vim ~/.bashrc 配置好环境变量，使用起来更方便

# PostgreSQL PATH
PGHOME=[your-postgresql-install-path]
export PGHOME
PGDATA=$PGHOME/data
export PGDATA
PGCONFIG=$PGHOME/bin/pg_config
export PGCONFIG
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PGHOME/lib
export LD_LIBRARY_PATH
PATH=$PGHOME/bin:$PATH
export PATH

最后，初始化数据库并启动

cd $PGHOME/bin
# 初始化数据库
initdb -U postgres -D $PGDATA -W
# 启动数据库
pg_ctl -D $PGDATA -l logfile start

4. 其他第三方依赖

除了上面那些依赖环境，MADlib 还需要下面这些第三方依赖，但这些依赖在 cmake 编译时 src/CMakeList.txt 定义了下载路径

Boost 1.61.0：http://www.boost.org/
PyXB 1.2.6：http://pyxb.sourceforge.net/
Eigen 3.2.2：http://eigen.tuxfamily.org/index.php?title=Main_Page

编译 MADlib 时，会联网下载依赖软件，网络环境不好的情况下，下载过程缓慢

所以建议先手动下载好依赖包 PyXB-1.2.6.tar.gz，eigen-branches-3.2.tar.gz 和 boost_1_61_0.tar.gz 放在本地

cd [your-3rd-download-path]
# 下载 PyXB-1.2.6.tar.gz
wget http://sourceforge.net/projects/pyxb/files/PyXB-1.2.6.tar.gz
# eigen-branches-3.2.tar.gz
wget https://github.com/madlib/eigen/archive/branches/3.2.tar.gz
mv 3.2.tar.gz eigen-branches-3.2.tar.gz
# boost_1_61_0.tar.gz
wget https://sourceforge.net/projects/boost/files/boost/1.61.0/boost_1_61_0.tar.gz

1.2 MADlib 编译安装

这里以 MADlib 1.17.0 + PostgresSQL 12.16 为例编译安装 MADlib 插件

1. 获取源码

首先，从 github 仓库或者 MADlib 官网获取源码

git clone https://github.com/apache/madlib.git
cd madlib
git tag -l
git checkout rel/v1.17.0

2. 源码编译

修改完成之后，确认自己的 ~/.bashrc 环境变量中配置了 PostgreSQL 路径，然后创建一个 build 目录在该目录下编译安装 MADlib

cd madlib
./configure --install-prefix=[your-madlib-sourcecode-dir]/build -DPYXB_TAR_SOURCE=[your-3rd-download-path]/PyXB-1.2.6.tar.gz -DEIGEN_TAR_SOURCE=[your-3rd-download-path]/eigen-branches-3.2.tar.gz -DBOOST_TAR_SOURCE=[your-3rd-download-path]/boost_1_61_0.tar.gz -DPOSTGRESQL_EXECUTABLE=$PGHOME/bin/ -DPOSTGRESQL_12_EXECUTABLE=$PGHOME/bin/ -DPOSTGRESQL_12_CLIENT_INCLUDE_DIR==$PGHOME/bin/ -DPOSTGRESQL_12_SERVER_INCLUDE_DIR==$PGHOME/bin/
# 生成 Makefile 文件之后，进行编译安装
make -j8 && sudo make install # madlib 需要向 /usr 目录写入，所以要 root 权限

3. 安装 MADlib

使用 build 目录下的 /src/bin/madpack 为 PostgreSQL 安装 MADlib 插件，安装命令为

# $BUILD_ROOT/src/bin/madpack -s madlib -p postgres -c [user[/password]@][host][:port][/database] install
./madpack -p postgres -c postgres@127.0.0.1:8886/postgres install

其中 $BUILD_ROOT/src/bin/madpack 为 madpack 工具路径，-p 指定数据库类型 postgres 代表 PostgreSQL 数据库，-c 指定数据库连接信息，执行该命令后，即安装完成

在这里插入图片描述
安装完成之后，可以使用 install-check 命令检验功能是否完备

# $BUILD_ROOT/src/bin/madpack -p postgres -c [user[/password]@][host][:port][/database] install-check
./madpack -p postgres -c postgres@127.0.0.1:8886/postgres install-check

参数含义和上面一致，也可以使用 $BUILD_ROOT/src/bin/madpack --help 查看选项的具体含义，执行效果如下

在这里插入图片描述
验证完成之后，使用 psql -U postgres -p [port] 登陆数据库，进一步验证 MADlib 可用性；在使用该插件之前，PostgreSQL 还需要安装插件 plpythonu，前面使用 --with-python 编译，这里直接使用 create extension plpythonu; 创建插件即可

MADlib 安装时会将插件创建在 madlib 模式中，所以在使用的时候还需要设置 search_path

postgres=# create extension plpythonu;
CREATE EXTENSION
postgres=# \dn
  List of schemas
  Name  |  Owner   
--------+----------
 madlib | postgres
 public | postgres
(2 rows)

postgres=# SET search_path TO "$user",madlib,public;
SET
postgres=# \dt
              List of relations
 Schema |       Name       | Type  |  Owner   
--------+------------------+-------+----------
 madlib | migrationhistory | table | postgres
(1 row)

postgres=# select * from migrationhistory;
 id | version |          applied           
----+---------+----------------------------
  1 | 1.17.0  | 2023-11-08 09:18:33.561105
(1 row)

4. 卸载 MADlib

如果不需要这个插件了，卸载插件无法像其他插件一样直接使用 drop extension 来卸载，同样需要使用 madpack 工具卸载，卸载命令如下

# $BUILD_ROOT/src/bin/madpack -s madlib -p postgres -c [user[/password]@][host][:port][/database] uninstall
./madpack -p postgres -c postgres@127.0.0.1:8886/postgres uninstall

卸载完成之后，PostgreSQL 数据库中 madlib schema 被删除

在这里插入图片描述

问题处理

安装 MADlib 过程中如果遇到依赖环境问题，只需要根据报错提示重新安装对应版本依赖即可

Found pg_config ("/bin/pg_config"), but pg_config.h file not present in the server include dir (/usr/include/postgresql/server).

笔者遇到了非依赖的问题，找到错误的 pg_config 路径，报错内容如下

在这里插入图片描述

这是由于笔者的环境中有多个 PostgreSQL 实例，这里找到的 /bin/pg_config 并非笔者要安装的 PG 数据库的路径，将该路径内容删除或备份即可

02 基础使用

这里以 MADlib 官方文档提供的例子介绍该插件的基础使用方式，这个例子中基于数据库表中带标记的心脏病患者数据使用回归分析训练一个模型，然后使用这个模型来预测患者是否会第二次心脏病发作。

2.1 数据准备

创建患者数据库表 patients，包含 id、是否第二次发病 second_attack，是否进行治疗 treatment，病情焦虑情况 trait_anxiety

DROP TABLE IF EXISTS patients, patients_logregr, patients_logregr_summary;
 
CREATE TABLE patients( id INTEGER NOT NULL,
                        second_attack INTEGER,
                        treatment INTEGER,
                        trait_anxiety INTEGER);
                          
INSERT INTO patients VALUES                                                     
(1,     1,      1,      70),
(3,     1,      1,      50),
(5,     1,      0,      40),
(7,     1,      0,      75),
(9,     1,      0,      70),
(11,    0,      1,      65),
(13,    0,      1,      45),
(15,    0,      1,      40),
(17,    0,      0,      55),
(19,    0,      0,      50),
(2,     1,      1,      80),
(4,     1,      0,      60),
(6,     1,      0,      65),
(8,     1,      0,      80),
(10,    1,      0,      60),
(12,    0,      1,      50),
(14,    0,      1,      35),
(16,    0,      1,      50),
(18,    0,      0,      45),
(20,    0,      0,      60);

2.2 训练模型

调用 MADlib 内置回归分析函数 logregr_train，函数参数中指定的数据源表、结果输出表、标签等内容，基于数据表 patients 作为输入来训练分类模型，使用方式和常用的 python 机器学习库相似。

SELECT madlib.logregr_train(
    'patients',                                 -- source table
    'patients_logregr',                         -- output table
    'second_attack',                            -- labels
    'ARRAY[1, treatment, trait_anxiety]',       -- features
    NULL,                                       -- grouping columns
    20,                                         -- max number of iteration
    'irls'                                      -- optimizer
    );

执行上面的函数之后，可以查看训练出的模型参数

postgres=# \x
Expanded display is on.
postgres=# SELECT * from patients_logregr;
-[ RECORD 1 ]------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
coef                     | {-6.363469941781869,-1.0241060523932703,0.11904491666860625}
log_likelihood           | -9.410182983888758
std_err                  | {3.2138976637509056,1.1710784486031807,0.05497904582693035}
z_stats                  | {-1.979985241457605,-0.8744982486995525,2.1652779686892014}
p_values                 | {0.04770518706981086,0.3818469735304479,0.030366404504615287}
odds_ratios              | {0.0017233763092323062,0.35911735405495454,1.1264205122089475}
condition_no             | 326.0819227915625
num_rows_processed       | 20
num_missing_rows_skipped | 0
num_iterations           | 5
variance_covariance      | {{10.32913819306353,-0.4743046651957294,-0.17199590126004818},{-0.47430466519572945,1.3714247327828326,-0.0011952070338159114},{-0.17199590126004818,-0.0011952070338159114,0.003022695480039707}}

postgres=# \x
Expanded display is off.
postgres=# SELECT unnest(array['intercept', 'treatment', 'trait_anxiety']) as attribute,
        unnest(coef) as coefficient,
        unnest(std_err) as standard_error,
        unnest(z_stats) as z_stat,
        unnest(p_values) as pvalue,
        unnest(odds_ratios) as odds_ratio
 FROM patients_logregr;
   attribute   |     coefficient     |   standard_error    |       z_stat        |        pvalue        |      odds_ratio       
---------------+---------------------+---------------------+---------------------+----------------------+-----------------------
 intercept     |  -6.363469941781869 |  3.2138976637509056 |  -1.979985241457605 |  0.04770518706981086 | 0.0017233763092323062
 treatment     | -1.0241060523932703 |  1.1710784486031807 | -0.8744982486995525 |   0.3818469735304479 |   0.35911735405495454
 trait_anxiety | 0.11904491666860625 | 0.05497904582693035 |  2.1652779686892014 | 0.030366404504615287 |    1.1264205122089475
(3 rows)

2.3 模型使用

现在，我们来使用上面训练得到的模型，来预测患者 1 年内是否会第二次心脏病发作

这里没有去构建新的数据，我们将直接使用原始数据表来进行预测，更加直观的判断模型的准确性，模型调用并计算得到预测值和原始值过程如下

postgres=# SELECT p.id, madlib.logregr_predict(m.coef, ARRAY[1, p.treatment, p.trait_anxiety]),
       p.second_attack
FROM patients p, patients_logregr m
ORDER BY p.id;
 id | logregr_predict | second_attack 
----+-----------------+---------------
  1 | t               |             1
  2 | t               |             1
  3 | f               |             1
  4 | t               |             1
  5 | f               |             1
  6 | t               |             1
  7 | t               |             1
  8 | t               |             1
  9 | t               |             1
 10 | t               |             1
 11 | t               |             0
 12 | f               |             0
 13 | f               |             0
 14 | f               |             0
 15 | f               |             0
 16 | f               |             0
 17 | t               |             0
 18 | f               |             0
 19 | f               |             0
 20 | t               |             0
(20 rows)

由于数据量不足，这里准确率一般，但是确实实现了基本的逻辑回归预测，进一步我们可以使用 logregr_predict_prob 函数查看逻辑回归模型预测为真的概率，其中如果概率大于 0.5，则预测为 True；否则，被认为 False。

postgres=# SELECT p.id, madlib.logregr_predict_prob(coef, ARRAY[1, treatment, trait_anxiety])
FROM patients p, patients_logregr m
ORDER BY p.id;
 id | logregr_predict_prob 
----+----------------------
  1 |   0.7202230289415248
  2 |    0.894354902502046
  3 |   0.1922695417551723
  4 |    0.685513072239347
  5 |  0.16774788150886008
  6 |   0.7980981089151397
  7 |   0.9285680757525024
  8 |   0.9593057636935703
  9 |   0.8775761174314511
 10 |    0.685513072239347
 11 |   0.5867008959433155
 12 |   0.1922695417551723
 13 |  0.11603201063299477
 14 |  0.03838291431349889
 15 |  0.06749762241476068
 16 |   0.1922695417551723
 17 |   0.5458707743026223
 18 |   0.2676754223871351
 19 |  0.39861863928511365
 20 |    0.685513072239347
(20 rows)