工具datax.....
下载地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
下载以后解压:tar -xvzf datax.tar.gz
然后,理论上就可以直接使用了。但是,datax本身是python2写的,如果需要python3来执行,肯定会报错。
对于这个问题,可以用下面的方法解决:
DataX的python3使用-CSDN博客
当然,如果是python2来执行,此步骤直接跳过...
另外,datax说是python写的,但是其功能却并不是由python实现的,而是由java实现的。因此运行datax,需要先安装java....
首先,先看下自己系统的版本,再觉得安装哪个版本的java
我的是Ubuntu的虚拟机,x86_64的,因此需要下载:
连接:Java Downloads | Oracle 中国
放到虚拟机上,直接解压:
tar -xvzf jdk-17.0.9_linux-x64_bin.tar.gz
完事以后,也不用干啥,配下环境变量就好。
vim ~/.bashrc
将下面的内容写入其中就好:(注意,jdk-17.0.9_linux-x64_bin.tar.gz文件解压到哪,JAVA_HOME就绑定啥路径,原则就是 JAVA_HOME/bin 目录就是 jdk-17.0.9_linux-x64_bin.tar.gz解压后的文件的bin目录,确保这个即可)
export JAVA_HOME=/usr/lib/jvm
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH完事以后记得执行:
source ~/.bashrc
然后,就可以看到:
java就算是安装完成了。
接下来,看datax要用的一个配置文件,这个文件由我们自定义,就是具体要干啥,需要给datax说清楚的意思。文件名无所谓,后缀需要是json
大致就是这个文件:
我是要从一mysql数据库将数据导入到Doris数据库。
这个json文件可以这么构建:
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "具体的用户",
"password": "具体的password",
"column": ["start_time", "cluster", "hostname", "ip", "device", "used_time", "data_type", "end_time"],
"connection": [
{
"jdbcUrl": ["jdbc:mysql://mysql_数据库_ip:3306/db_name"],
"table": ["table_name"]
}
]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"writeMode": "insert",
"username": "具体的用户",
"password": "具体的password",
"column": ["start_time", "cluster", "hostname", "ip", "device", "used_time", "data_type", "end_time"],
"preSql": [],
"postSql": [],
"connection": [
{
"jdbcUrl": "jdbc:mysql://doris_数据库_ip:9030/db_name",
"table": ["table_name"]
}
]
}
}
}
],
"setting": {
"speed": {
"channel": 3
}
}
}
}
然后,就可以执行命令:python3 datax/bin/datax.py job.json
照理说,现在就能执行了,但是,我这里出现错误:
这个错,我最终执行命令:rm -rf datax/plugin/*/._*
解决了....
这里可以参考下:DataX3.0:深入探索与实战应用-CSDN博客
然后,基本上就可以用了:
不过,最终,我这里的命令执行失败的,因为,我的虚拟机连接不到公司的数据库服务器,比较遗憾。但是,大致,应该是可用的。
后来,实践中,始终报这个错:
好吧,后来的后来.......
这是因为datax里面自带的mysql驱动太老旧了,导致始终无法连接到数据库。需要删除老的mysql驱动,换成新的mysql驱动就可以了。
参照帖子:DataX同步mysql数据报错 无法连接mysql - 阿伦啊 - 博客园
换成这个就可以了。原始的太老了,是 mysql-connector-java-5.1.34.jar(删掉)
最后。。。。
$ python3 datax/bin/datax.py job.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2025-03-14 19:11:54.552 [main] INFO VMInfo - VMInfo# operatingSystem class => com.sun.management.internal.OperatingSystemImpl
2025-03-14 19:11:54.558 [main] INFO Engine - the machine info =>
osInfo: Bisheng 11 11.0.12+9
jvmInfo: Linux amd64 4.19.90-2107.6.0.0100.oe1.bclinux.x86_64
cpu num: 16
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [G1 Young Generation, G1 Old Generation]
MEMORY_NAME | allocation_size | init_size
CodeHeap 'profiled nmethods' | 116.37MB | 2.44MB
G1 Old Gen | 1,024.00MB | 970.00MB
G1 Survivor Space | -0.00MB | 0.00MB
CodeHeap 'non-profiled nmethods' | 116.38MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
Metaspace | -0.00MB | 0.00MB
G1 Eden Space | -0.00MB | 54.00MB
CodeHeap 'non-nmethods' | 7.25MB | 2.44MB
2025-03-14 19:11:54.576 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"mysqlreader",
"parameter":{
"column":[
"start_time",
"cluster",
"hostname",
"ip",
"device",
"used_time",
"data_type",
"end_time"
],
"connection":[
{
"jdbcUrl":[
"jdbc:mysql://xxxxxx:3306/gpu_statistics?useSSL=false"
],
"table":[
"hachi_gpu_day_used_time"
]
}
],
"password":"****************",
"username":"root"
}
},
"writer":{
"name":"mysqlwriter",
"parameter":{
"column":[
"start_time",
"cluster",
"hostname",
"ip",
"device",
"used_time",
"data_type",
"end_time"
],
"connection":[
{
"jdbcUrl":"jdbc:mysql://xxxxxxx:9030/smart_center?useSSL=false",
"table":[
"gpu_day_used_time_hachi_test"
]
}
],
"password":"**********",
"postSql":[],
"preSql":[],
"username":"root",
"writeMode":"insert"
}
}
}
],
"setting":{
"speed":{
"channel":3
}
}
}
2025-03-14 19:11:54.589 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2025-03-14 19:11:54.591 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2025-03-14 19:11:54.591 [main] INFO JobContainer - DataX jobContainer starts job.
2025-03-14 19:11:54.592 [main] INFO JobContainer - Set jobId = 0
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loads generally unnecessary.
2025-03-14 19:11:54.983 [job-0] INFO OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://172.22.20.8:3306/gpu_statistics?useSSL=false&yearIsDateType=false&zeroDateTiminyInt1isBit=false&rewriteBatchedStatements=true.
2025-03-14 19:11:55.005 [job-0] INFO OriginalConfPretreatmentUtil - table:[hachi_gpu_day_used_time] has columns:[id,hostname,ip,device,used_time,start_time,end_time,cluster,dat
2025-03-14 19:11:55.540 [job-0] INFO OriginalConfPretreatmentUtil - table:[gpu_day_used_time_hachi_test] all columns:[
start_time,cluster,hostname,ip,device,used_time,data_type,end_time
].
2025-03-14 19:11:55.574 [job-0] INFO OriginalConfPretreatmentUtil - Write data [
insert INTO %s (start_time,cluster,hostname,ip,device,used_time,data_type,end_time) VALUES(?,?,?,?,?,?,?,?)
], which jdbcUrl like:[jdbc:mysql://172.22.16.154:9030/smart_center?useSSL=false&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatem
2025-03-14 19:11:55.575 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2025-03-14 19:11:55.575 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] do prepare work .
2025-03-14 19:11:55.575 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do prepare work .
2025-03-14 19:11:55.576 [job-0] INFO JobContainer - jobContainer starts to do split ...
2025-03-14 19:11:55.576 [job-0] INFO JobContainer - Job set Channel-Number to 3 channels.
2025-03-14 19:11:55.581 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] splits to [1] tasks.
2025-03-14 19:11:55.582 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] splits to [1] tasks.
2025-03-14 19:11:55.599 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2025-03-14 19:11:55.603 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2025-03-14 19:11:55.604 [job-0] INFO JobContainer - Running by standalone Mode.
2025-03-14 19:11:55.609 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2025-03-14 19:11:55.615 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2025-03-14 19:11:55.615 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2025-03-14 19:11:55.646 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2025-03-14 19:11:55.651 [0-0-0-reader] INFO CommonRdbmsReader$Task - Begin to read record by Sql: [select start_time,cluster,hostname,ip,device,used_time,data_type,end_time fro
] jdbcUrl:[jdbc:mysql://172.22.20.8:3306/gpu_statistics?useSSL=false&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2025-03-14 19:12:05.633 [job-0] INFO StandAloneJobContainerCommunicator - Total 0 records, 0 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTitReaderTime 0.000s | Percentage 0.00%
2025-03-14 19:12:15.645 [job-0] INFO StandAloneJobContainerCommunicator - Total 12800 records, 856913 bytes | Speed 83.68KB/s, 1280 records/s | Error 0 records, 0 bytes | All s | All Task WaitReaderTime 0.150s | Percentage 0.00%
2025-03-14 19:12:25.646 [job-0] INFO StandAloneJobContainerCommunicator - Total 25088 records, 1683744 bytes | Speed 80.75KB/s, 1228 records/s | Error 0 records, 0 bytes | All91s | All Task WaitReaderTime 0.176s | Percentage 0.00%
2025-03-14 19:12:35.647 [job-0] INFO StandAloneJobContainerCommunicator - Total 37376 records, 2513454 bytes | Speed 81.03KB/s, 1228 records/s | Error 0 records, 0 bytes | All62s | All Task WaitReaderTime 0.216s | Percentage 0.00%
2025-03-14 19:12:45.648 [job-0] INFO StandAloneJobContainerCommunicator - Total 51712 records, 3466011 bytes | Speed 93.02KB/s, 1433 records/s | Error 0 records, 0 bytes | All69s | All Task WaitReaderTime 0.265s | Percentage 0.00%
2025-03-14 19:12:55.650 [job-0] INFO StandAloneJobContainerCommunicator - Total 66048 records, 4430671 bytes | Speed 94.21KB/s, 1433 records/s | Error 0 records, 0 bytes | All83s | All Task WaitReaderTime 0.312s | Percentage 0.00%
2025-03-14 19:13:05.651 [job-0] INFO StandAloneJobContainerCommunicator - Total 80384 records, 5391333 bytes | Speed 93.81KB/s, 1433 records/s | Error 0 records, 0 bytes | All52s | All Task WaitReaderTime 0.358s | Percentage 0.00%
2025-03-14 19:13:15.652 [job-0] INFO StandAloneJobContainerCommunicator - Total 94720 records, 6345816 bytes | Speed 93.21KB/s, 1433 records/s | Error 0 records, 0 bytes | All71s | All Task WaitReaderTime 0.410s | Percentage 0.00%
2025-03-14 19:13:25.653 [job-0] INFO StandAloneJobContainerCommunicator - Total 109056 records, 7314509 bytes | Speed 94.60KB/s, 1433 records/s | Error 0 records, 0 bytes | Al705s | All Task WaitReaderTime 0.444s | Percentage 0.00%
2025-03-14 19:13:35.655 [job-0] INFO StandAloneJobContainerCommunicator - Total 121344 records, 8131830 bytes | Speed 79.82KB/s, 1228 records/s | Error 0 records, 0 bytes | Al155s | All Task WaitReaderTime 0.468s | Percentage 0.00%
2025-03-14 19:13:45.656 [job-0] INFO StandAloneJobContainerCommunicator - Total 133632 records, 8952834 bytes | Speed 80.18KB/s, 1228 records/s | Error 0 records, 0 bytes | Al309s | All Task WaitReaderTime 0.492s | Percentage 0.00%
2025-03-14 19:13:55.657 [job-0] INFO StandAloneJobContainerCom