一、创建虚拟环境
好习惯,首先创建单独的运行环境
conda create -n uie python=3.10.9
conda activate uie
二、安装paddle框架及paddlenlp
2.1 参考官方文档安装paddle
开始使用_飞桨-源于产业实践的开源深度学习平台
首先查看自己服务器cuda版本,如下我的版本时10.2
(PyTorch-1.8) [ma-user work]$nvidia-smi
Wed Apr 19 23:35:11 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:0E.0 Off | 0 |
| N/A 39C P0 28W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
在Paddle官网直接复制命令即可。
2.2 安装paddlenlp
pip install --upgrade paddlenlp
2.2.1 问题一 ERROR: Failed building wheel for numpy Failed to build numpy
-x86_64-3.10/numpy/core/src/multiarray/scalartypes.o -MMD -MF build/temp.linux-x86_64-3.10/build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.o.d" failed with exit status 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for numpy
Failed to build numpy
ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× pip subprocess to install backend dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip
手工安装numpy包,再次执行nlp包安装,还是不行。
pip install numpy
换另外一种方式成功
python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple
三、下载PaddleNLP源码
$git clone https://github.com/PaddlePaddle/PaddleNLP.git
四、执行训练
4.1、对标注数据进行预处理
python ../PaddleNLP/model_zoo/uie/doccano.py --doccano_file ./data.json --task_type ext --save_dir ./ --splits 0.7 0.2 0.1 --schema_lang ch
4.2、模型精调
$python ../PaddleNLP/model_zoo/uie/finetune.py
--device gpu
--logging_steps 10
--save_steps 100
--eval_steps 100
--seed 42
--model_name_or_path uie-base
--output_dir $finetuned_model
--train_path ./train.txt
--dev_path ./dev.txt
--max_seq_length 512
--per_device_eval_batch_size 16
--per_device_train_batch_size 16
--num_train_epochs 20
--learning_rate 1e-5
--label_names "start_positions" "end_positions"
--do_train
--do_eval
--do_export
--export_model_dir $finetuned_model
--overwrite_output_dir
--disable_tqdm True
--metric_for_best_model eval_f1
--load_best_model_at_end True
--save_total_limit 1
出现下图及训练成功
五、模型应用
from pprint import pprint
from paddlenlp import Taskflow
schema = ['时间', '地区', '指标名']
ie = Taskflow('information_extraction', schema=schema, task_path="./checkpoint/model_best")
pprint(ie("我想查询2022年山东省主营业务收入数据"))