Datax在win10中的安装_windows安装datax_JMzz的博客-CSDN博客
DataX/userGuid.md at master · alibaba/DataX · GitHub
环境准备:
1.JDK(1.8以上,推荐1.8)
2.①Python(推荐Python2.7.X)
②Python(Python3.X.X的可以下载下面的安装包替换)
python3.0需要替换安装目录bin下的3个文件
替换文件下载:链接: 百度网盘 请输入提取码 提取码: re42
3.Apache Maven 3.x (Compile DataX) 非编译安装不需要
Python环境安装这里就不作说明,请自行下载安装。
1、下载解压
备注:我用的Python3.0.X 没有替换python相关文件
安装目录
E:\DATAX\datax
所有脚本请到 E:\DATAX\datax\bin 下执行
cmd
e:
cd E:\DATAX\datax\bin
2、自检脚本
python datax.py ../job/job.json
3、练手配置示例:从stream读取数据并打印到控制台
第一步、创建作业的配置文件(json格式)
可以通过命令查看配置模板: python datax.py -r {YOUR_READER} -w {YOUR_WRITER}
例如:
python datax.py -r streamreader -w streamwriter
返回如下
DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. Please refer to the streamreader document: https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md Please refer to the streamwriter document: https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md Please save the following configuration as a json file and use python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json to run the job. { "job": { "content": [ { "reader": { "name": "streamreader", "parameter": { "column": [], "sliceRecordCount": "" } }, "writer": { "name": "streamwriter", "parameter": { "encoding": "", "print": true } } } ], "setting": { "speed": { "channel": "" } } } }
根据模板配置json如下:
stream2stream.json
{ "job": { "content": [ { "reader": { "name": "streamreader", "parameter": { "sliceRecordCount": 10, "column": [ { "type": "long", "value": "10" }, { "type": "string", "value": "hello,你好,世界-DataX" } ] } }, "writer": { "name": "streamwriter", "parameter": { "encoding": "UTF-8", "print": true } } } ], "setting": { "speed": { "channel": 5 } } } }
第二步:启动DataX
$ cd {YOUR_DATAX_DIR_BIN}
$ python datax.py ./stream2stream.json
python datax.py ../job/stream2stream.json
同步结束,显示日志如下:
DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. 2023-03-16 13:52:50.773 [main] INFO MessageSource - JVM TimeZone: GMT+08:00, Locale: zh_CN 2023-03-16 13:52:50.776 [main] INFO MessageSource - use Locale: zh_CN timeZone: sun.util.calendar.ZoneInfo[id="GMT+08:00",offset=28800000,dstSavings=0,useDaylight=false,transitions=0,lastRule=null] 2023-03-16 13:52:50.786 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl 2023-03-16 13:52:50.791 [main] INFO Engine - the machine info => osInfo: Oracle Corporation 1.8 25.172-b11 jvmInfo: Windows 10 amd64 10.0 cpu num: 8 totalPhysicalMemory: -0.00G freePhysicalMemory: -0.00G maxFileDescriptorCount: -1 currentOpenFileDescriptorCount: -1 GC Names [PS MarkSweep, PS Scavenge] MEMORY_NAME | allocation_size | init_size PS Eden Space | 256.00MB | 256.00MB Code Cache | 240.00MB | 2.44MB Compressed Class Space | 1,024.00MB | 0.00MB PS Survivor Space | 42.50MB | 42.50MB PS Old Gen | 683.00MB | 683.00MB Metaspace | -0.00MB | 0.00MB 2023-03-16 13:52:50.815 [main] INFO Engine - { "content":[ { "reader":{ "name":"streamreader", "parameter":{ "column":[ { "type":"long", "value":"10" }, { "type":"string", "value":"hello,你好,世界-DataX" } ], "sliceRecordCount":10 } }, "writer":{ "name":"streamwriter", "parameter":{ "encoding":"UTF-8", "print":true } } } ], "setting":{ "speed":{ "channel":5 } } } 2023-03-16 13:52:50.833 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null 2023-03-16 13:52:50.835 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0 2023-03-16 13:52:50.835 [main] INFO JobContainer - DataX jobContainer starts job. 2023-03-16 13:52:50.837 [main] INFO JobContainer - Set jobId = 0 2023-03-16 13:52:50.855 [job-0] INFO JobContainer - jobContainer starts to do prepare ... 2023-03-16 13:52:50.856 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do prepare work . 2023-03-16 13:52:50.857 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work . 2023-03-16 13:52:50.857 [job-0] INFO JobContainer - jobContainer starts to do split ... 2023-03-16 13:52:50.858 [job-0] INFO JobContainer - Job set Channel-Number to 5 channels. 2023-03-16 13:52:50.859 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] splits to [5] tasks. 2023-03-16 13:52:50.859 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [5] tasks. 2023-03-16 13:52:50.880 [job-0] INFO JobContainer - jobContainer starts to do schedule ... 2023-03-16 13:52:50.889 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups. 2023-03-16 13:52:50.892 [job-0] INFO JobContainer - Running by standalone Mode. 2023-03-16 13:52:50.900 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [5] channels for [5] tasks. 2023-03-16 13:52:50.905 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated. 2023-03-16 13:52:50.906 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated. 2023-03-16 13:52:50.916 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[2] attemptCount[1] is started 2023-03-16 13:52:50.919 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started 2023-03-16 13:52:50.923 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[3] attemptCount[1] is started 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 2023-03-16 13:52:50.929 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 2023-03-16 13:52:50.933 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[4] attemptCount[1] is started 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 2023-03-16 13:52:51.049 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[130]ms 2023-03-16 13:52:51.049 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[120]ms 2023-03-16 13:52:51.050 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[2] is successed, used[135]ms 2023-03-16 13:52:51.052 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[3] is successed, used[129]ms 2023-03-16 13:52:51.052 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[4] is successed, used[119]ms 2023-03-16 13:52:51.053 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks. 2023-03-16 13:53:00.918 [job-0] INFO StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.008s | Percentage 100.00% 2023-03-16 13:53:00.919 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks. 2023-03-16 13:53:00.923 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work. 2023-03-16 13:53:00.923 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do post work. 2023-03-16 13:53:00.923 [job-0] INFO JobContainer - DataX jobId [0] completed successfully. 2023-03-16 13:53:00.924 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: E:\DATAX\datax\hook 2023-03-16 13:53:00.925 [job-0] INFO JobContainer - [total cpu info] => averageCpu | maxDeltaCpu | minDeltaCpu -1.00% | -1.00% | -1.00% [total gc info] => NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime PS MarkSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s PS Scavenge | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s 2023-03-16 13:53:00.925 [job-0] INFO JobContainer - PerfTrace not enable! 2023-03-16 13:53:00.926 [job-0] INFO StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.008s | Percentage 100.00% 2023-03-16 13:53:00.927 [job-0] INFO JobContainer - 任务启动时刻 : 2023-03-16 13:52:50 任务结束时刻 : 2023-03-16 13:53:00 任务总计耗时 : 10s 任务平均流量 : 95B/s 记录写入速度 : 5rec/s 读出记录总数 : 50 读写失败总数 : 0
4、实际配置
这里只测试了mysql的相关配置,其他的需要继续研究
脚本格式信息可以去https://github.com/alibaba/DataX/查看
因为和clickhouse一起学习,这里可以看到支持对 CLickHouse的写
1、mysqlreader
DataX/mysqlreader/doc/mysqlreader.md at master · alibaba/DataX · GitHub
table column方式和querysql方式是冲突的。只能用一种
干货:
jdbcUrl 可以配置多个,依次检查合法性
table 可以配置多个,需保证多张表是同一schema结构?,table必须包含在connection配置单元中
1、配置一个从Mysql数据库同步抽取数据到本地的作业:
通过table column方式
mysql2stream1.json
{ "job": { "setting": { "speed": { "channel": 3 }, "errorLimit": { "record": 0, "percentage": 0.02 } }, "content": [ { "reader": { "name": "mysqlreader", "parameter": { "username": "root", "password": "sa", "column": [ "ryxm", "rysfz" ], "splitPk": "id", "connection": [ { "table": [ "sys_czry" ], "jdbcUrl": [ "jdbc:mysql://172.16.0.101:3306/qyjx_v3.1?useUnicode=true&characterEncoding=utf8&zeroDateTimeBehavior=convertToNull&useSSL=false&serverTimezone=GMT%2B8&&nullCatalogMeansCurrent=true&allowMultiQueries=true&rewriteBatchedStatements=true" ] } ] } }, "writer": { "name": "streamwriter", "parameter": { "print":true } } } ] } }
python datax.py ../job/mysql2stream1.json
2、配置一个自定义SQL的数据库同步任务到本地内容的作业:
通过querysql方式
mysql2stream2.json
{ "job": { "setting": { "speed": { "channel": 1 } }, "content": [ { "reader": { "name": "mysqlreader", "parameter": { "username": "root", "password": "sa", "connection": [ { "querySql": [ "SELECT ryxm,rysfz,rygh from sys_czry;" ], "jdbcUrl": [ "jdbc:mysql://172.16.0.101:3306/qyjx_v3.1?useUnicode=true&characterEncoding=utf8&zeroDateTimeBehavior=convertToNull&useSSL=false&serverTimezone=GMT%2B8&&nullCatalogMeansCurrent=true&allowMultiQueries=true&rewriteBatchedStatements=true" ] } ] } }, "writer": { "name": "streamwriter", "parameter": { "print": true, "encoding": "UTF-8" } } } ] } }
python datax.py ../job/mysql2stream2.json
2、mysqlwrier
DataX/mysqlwriter/doc/mysqlwriter.md at master · alibaba/DataX · GitHub
1、这里使用一份从内存产生到 Mysql 导入的数据
{ "job": { "setting": { "speed": { "channel": 1 } }, "content": [ { "reader": { "name": "streamreader", "parameter": { "column" : [ { "value": "DataX", "type": "string" } ], "sliceRecordCount": 1000 } }, "writer": { "name": "mysqlwriter", "parameter": { "writeMode": "insert", "username": "root", "password": "root", "column": [ "name" ], "session": [ "set session sql_mode='ANSI'" ], "preSql": [ "delete from test" ], "connection": [ { "jdbcUrl": "jdbc:mysql://127.0.0.1:3306/datax?useUnicode=true&characterEncoding=gbk", "table": [ "test" ] } ] } } } ] } }
python datax.py ../job/stream2mysql1.json
结果:
2、这里使用一份从mysql(服务器1)产生到 Mysql(本地) 导入的数据
模拟跨服务器、数据库环境
{ "job": { "setting": { "speed": { "channel": 1 } }, "content": [ { "reader": { "name": "mysqlreader", "parameter": { "username": "root", "password": "sa", "connection": [ { "querySql": [ "SELECT ryxm,rysfz from sys_czry;" ], "jdbcUrl": [ "jdbc:mysql://172.16.0.101:3306/qyjx_v3.1?useUnicode=true&characterEncoding=utf8&zeroDateTimeBehavior=convertToNull&useSSL=false&serverTimezone=GMT%2B8&&nullCatalogMeansCurrent=true&allowMultiQueries=true&rewriteBatchedStatements=true" ] } ] } }, "writer": { "name": "mysqlwriter", "parameter": { "writeMode": "insert", "username": "root", "password": "root", "column": [ "name","rysfz" ], "session": [ "set session sql_mode='ANSI'" ], "preSql": [ "delete from test" ], "connection": [ { "jdbcUrl": "jdbc:mysql://127.0.0.1:3306/datax?useUnicode=true&characterEncoding=gbk", "table": [ "test" ] } ] } } } ] } }
python datax.py ../job/mysql2mysql.json
结果:
异常:
1、如果出现乱码的话
先输入CHCP 65001