一、Dolphinscheduler简介
Apache DolphinScheduler
是一个分布式易扩展的可视化DAG
工作流任务调度开源系统。适用于企业级场景,提供了一个可视化操作任务、工作流和全生命周期数据处理过程的解决方案。
Apache DolphinScheduler
旨在解决复杂的大数据任务依赖关系,并为应用程序提供数据和各种 OPS
编排中的关系。 解决数据研发ETL
依赖错综复杂,无法监控任务健康状态的问题。 DolphinScheduler
以 DAG(Directed Acyclic Graph,DAG)
流式方式组装任务,可以及时监控任务的执行状态,支持重试、指定节点恢复失败、暂停、恢复、终止任务等操作。
二、本章目标
- 基于
K8S
环境完成Dolphinscheduler
部署 - 使用本地文件存储而非
HDFS
和S3
- 基于
K8S
环境Dolphinscheduler
简单应用(支持Python3
和MySQL
数据源及工作流编排)
三、前提条件
- 具备
Kubernetes 1.12+
集群(必须),使用Kuboard v3
作为集群管理工具(可选)具体操作可见:
K8S安装笔记(一)—— master节点完整安装配置
K8S安装笔记(二)—— 多公网服务器搭建集群 PV
供应(存储使用NFS
,存储类为nfs-storage
)具体操作可见:
搭建NFS Server及创建 NFS 存储类
四、安装helm
helm
官方文档,https://helm.sh/docs/intro/install/
4.1 下载所需版本
下载路径:https://github.com/helm/helm/releases
我选择的选择的版本是:
helm-v3.12.2-linux-amd64
4.2 上传至服务器并解压
tar -zxvf helm-v3.12.2-linux-amd64.tar.gz
4.3 移到到可执行目录
helm
在解压后的目录中找到二进制文件,然后将其移至所需的目标位置
mv linux-amd64/helm /usr/local/bin/helm
4.4 从脚本安装
安装脚本:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
五、安装 dolphinscheduler
5.1 下载解压
请下载源码包 apache-dolphinscheduler--src.tar.gz
,下载地址。
发布一个名为 dolphinscheduler
的版本(release
),官方给出的参照如下:
tar -zxvf apache-dolphinscheduler-<version>-src.tar.gz
cd apache-dolphinscheduler-<version>-src/deploy/kubernetes/dolphinscheduler
helm repo add bitnami https://charts.bitnami.com/bitnami
helm dependency update .
helm install dolphinscheduler . --set image.tag=<version>
我选择的是3.1.8
版本,具体执行命令如下:
tar -zxvf apache-dolphinscheduler-3.1.8-src.tar.gz
cd apache-dolphinscheduler-3.1.8-src/deploy/kubernetes/dolphinscheduler
5.2 变更RAW资源请求地址
原请求地址为:https://raw.githubusercontent.com/,国内可能连接不上,变更RAW资源加速地址:https://raw.gitmirror.com
修改Chart.yaml
文件中下面的配置项:
vi Chart.yaml
#原地址
#repository: https://raw.githubusercontent.com/bitnami/charts/archive-full-index/bitnami
#资源加速地址
repository: https://raw.gitmirror.com/bitnami/charts/archive-full-index/bitnami
5.3 支持本地文件存储而非 HDFS和S3
修改 values.yaml
文件中下面的配置项:
common:
configmap:
RESOURCE_STORAGE_TYPE: "NONE"
RESOURCE_UPLOAD_PATH: "/dolphinscheduler"
FS_DEFAULT_FS: "file:///"
fsFileResourcePersistence:
enabled: true
accessModes:
- "ReadWriteMany"
storageClassName: "-"
storage: "20Gi"
storageClassName
和storage
按需修改为实际值,注意:storageClassName
必须支持访问模式:ReadWriteMany
。
5.4 部署
helm dependency update .
helm install dolphinscheduler . -n dolphinscheduler
部署效果:
kubectl get pod,svc,pvc -o wide -n dolphinscheduler
5.5 访问前端页面
port-forward
端口转发:
kubectl port-forward --address 0.0.0.0 -n dolphinscheduler svc/dolphinscheduler-api 12345:12345
其他NodePort
或ingress
访问方式请自行探索。
访问前端页面:http://localhost:12345/dolphinscheduler/ui
默认的用户是admin
,默认的密码是dolphinscheduler123
六、支持Python3和MySQL
6.1 dolphinscheduler-worker 镜像构建
下载 MySQL
驱动包 mysql-connector-java-8.0.16.jar
目录结构:
创建一个新的 Dockerfile
,用于添加 MySQL
的驱动包和安装Python 3
:
FROM dolphinscheduler.docker.scarf.sh/apache/dolphinscheduler-worker:3.1.8
# 添加mysql驱动
COPY ./mysql-connector-java-8.0.16.jar /opt/dolphinscheduler/libs
# 添加自定义requirements.txt
COPY ./requirements.txt /tmp
RUN apt-get update && \
apt-get install -y --no-install-recommends python3-pip && \
pip3 install --no-cache-dir -r /tmp/requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ && \
rm -rf /var/lib/apt/lists/*
requirements.txt
:
Flask_Cors==3.0.10
pandas==1.4.2
PyMySQL==1.0.2
SQLAlchemy==1.4.32
xlwt==1.3.0
xlsxwriter==3.0.3
gunicorn
greenlet
eventlet
gevent
pypinyin
openpyxl
构建一个包含新镜像:
docker build -t apache/dolphinscheduler-worker:python3-mysql .
将构建好的新镜像进行分发。
6.2 dolphinscheduler-api 镜像构建
创建一个新的 Dockerfile
,用于添加 MySQL
的驱动包和安装Python 3
:
FROM dolphinscheduler.docker.scarf.sh/apache/dolphinscheduler-api:3.1.8
# 添加mysql驱动
构建一个包含新镜像:
docker build -t apache/dolphinscheduler-api:support-mysql .
将构建好的新镜像进行分发。
6.3 修改PYTHON_HOME
- 修改
values.yaml
文件中的PYTHON_HOME
为/usr/bin/python3
- 或在
Kuboard
-dolphinscheduler
名称空间 - 配置中心 - 配置字典 -dolphinscheduler-common
- 编辑
6.4 修改dolphinscheduler-worker 运行镜像版本
6.5 修改dolphinscheduler-api 运行镜像版本
七、简单使用验证
7.1 登录
7.2 创建租户
7.3 创建项目
7.4 创建数据源
7.5 创建项目
7.6 创建工作流
[{
"processDefinition": {
"id": 1,
"code": 10577001612288,
"name": "测试",
"version": 1,
"releaseState": "OFFLINE",
"projectCode": 10576969989760,
"description": "",
"globalParams": "[]",
"globalParamList": [],
"globalParamMap": {},
"createTime": "2023-08-15 09:33:45",
"updateTime": "2023-08-15 09:33:45",
"flag": "YES",
"userId": 1,
"userName": null,
"projectName": null,
"locations": "[{\"taskCode\":10576976496512,\"x\":175,\"y\":216},{\"taskCode\":10576980784256,\"x\":498,\"y\":216},{\"taskCode\":10576997351040,\"x\":834,\"y\":216}]",
"scheduleReleaseState": null,
"timeout": 0,
"tenantId": -1,
"tenantCode": null,
"modifyBy": null,
"warningGroupId": 0,
"executionType": "PARALLEL"
},
"processTaskRelationList": [{
"id": 1,
"name": "",
"processDefinitionVersion": 1,
"projectCode": 10576969989760,
"processDefinitionCode": 10577001612288,
"preTaskCode": 0,
"preTaskVersion": 0,
"postTaskCode": 10576976496512,
"postTaskVersion": 1,
"conditionType": "NONE",
"conditionParams": {},
"createTime": "2023-08-15 09:33:45",
"updateTime": "2023-08-15 09:33:45",
"operator": 1,
"operateTime": "2023-08-15 09:33:45"
}, {
"id": 2,
"name": "",
"processDefinitionVersion": 1,
"projectCode": 10576969989760,
"processDefinitionCode": 10577001612288,
"preTaskCode": 10576976496512,
"preTaskVersion": 1,
"postTaskCode": 10576980784256,
"postTaskVersion": 1,
"conditionType": "NONE",
"conditionParams": {},
"createTime": "2023-08-15 09:33:45",
"updateTime": "2023-08-15 09:33:45",
"operator": 1,
"operateTime": "2023-08-15 09:33:45"
}, {
"id": 3,
"name": "",
"processDefinitionVersion": 1,
"projectCode": 10576969989760,
"processDefinitionCode": 10577001612288,
"preTaskCode": 10576980784256,
"preTaskVersion": 1,
"postTaskCode": 10576997351040,
"postTaskVersion": 1,
"conditionType": "NONE",
"conditionParams": {},
"createTime": "2023-08-15 09:33:45",
"updateTime": "2023-08-15 09:33:45",
"operator": 1,
"operateTime": "2023-08-15 09:33:45"
}],
"taskDefinitionList": [{
"id": 1,
"code": 10576976496512,
"name": "sql",
"version": 1,
"description": "",
"projectCode": 10576969989760,
"userId": 1,
"taskType": "SQL",
"taskParams": {
"localParams": [],
"resourceList": [],
"type": "MYSQL",
"datasource": 1,
"sql": "show tables",
"sqlType": "0",
"preStatements": [],
"postStatements": [],
"segmentSeparator": "",
"displayRows": 10
},
"taskParamList": [],
"taskParamMap": null,
"flag": "YES",
"taskPriority": "MEDIUM",
"userName": null,
"projectName": null,
"workerGroup": "default",
"environmentCode": -1,
"failRetryTimes": 0,
"failRetryInterval": 1,
"timeoutFlag": "CLOSE",
"timeoutNotifyStrategy": null,
"timeout": 0,
"delayTime": 0,
"resourceIds": "",
"createTime": "2023-08-15 09:33:45",
"updateTime": "2023-08-15 09:33:45",
"modifyBy": null,
"taskGroupId": 0,
"taskGroupPriority": 0,
"cpuQuota": -1,
"memoryMax": -1,
"taskExecuteType": "BATCH",
"operator": 1,
"operateTime": "2023-08-15 09:33:45"
}, {
"id": 2,
"code": 10576980784256,
"name": "shell",
"version": 1,
"description": "",
"projectCode": 10576969989760,
"userId": 1,
"taskType": "SHELL",
"taskParams": {
"localParams": [],
"rawScript": "echo \"hello shell\"",
"resourceList": []
},
"taskParamList": [],
"taskParamMap": null,
"flag": "YES",
"taskPriority": "MEDIUM",
"userName": null,
"projectName": null,
"workerGroup": "default",
"environmentCode": -1,
"failRetryTimes": 0,
"failRetryInterval": 1,
"timeoutFlag": "CLOSE",
"timeoutNotifyStrategy": null,
"timeout": 0,
"delayTime": 0,
"resourceIds": "",
"createTime": "2023-08-15 09:33:45",
"updateTime": "2023-08-15 09:33:45",
"modifyBy": null,
"taskGroupId": 0,
"taskGroupPriority": 0,
"cpuQuota": -1,
"memoryMax": -1,
"taskExecuteType": "BATCH",
"operator": 1,
"operateTime": "2023-08-15 09:33:45"
}, {
"id": 3,
"code": 10576997351040,
"name": "python",
"version": 1,
"description": "",
"projectCode": 10576969989760,
"userId": 1,
"taskType": "PYTHON",
"taskParams": {
"localParams": [],
"rawScript": "print(\"hello python\")",
"resourceList": []
},
"taskParamList": [],
"taskParamMap": null,
"flag": "YES",
"taskPriority": "MEDIUM",
"userName": null,
"projectName": null,
"workerGroup": "default",
"environmentCode": -1,
"failRetryTimes": 0,
"failRetryInterval": 1,
"timeoutFlag": "CLOSE",
"timeoutNotifyStrategy": null,
"timeout": 0,
"delayTime": 0,
"resourceIds": "",
"createTime": "2023-08-15 09:33:45",
"updateTime": "2023-08-15 09:33:45",
"modifyBy": null,
"taskGroupId": 0,
"taskGroupPriority": 0,
"cpuQuota": -1,
"memoryMax": -1,
"taskExecuteType": "BATCH",
"operator": 1,
"operateTime": "2023-08-15 09:33:45"
}],
"schedule": null
}]