安装部署

解压

建议部署在离目标端近的地方，比如部署再目标端本地

tar -zxvf mongo-shake-v2.8.1.tgz

配置

同构环境下主要参数
在这里插入图片描述

启动

执行下述命令启动同步任务，并打印日志信息，-verbose 0表示将日志打印到文件，在后台运行

nohup ./collector.linux -conf=collector.conf -verbose 0 &

全量

观察打印日志信息，如下表示全量完成

[09:38:57 CST 2019/06/20] [INFO] (mongoshake/collector.(*ReplicationCoordinator).Run:80) finish full sync, start incr sync with timestamp: fullBeginTs[1560994443], fullFinishTs[1560994737]

问题

全量同步时，报如下错误，源端环境为3.6.3，存在bug，通过小版本升级至3.6.23解决该问题
bug连接：https://jira.mongodb.org/browse/SERVER-34810

[2022/11/26 17:47:46 CST] [CRIT] splitter reader[DocumentReader id[0], src[mongodb://admin:***@192.168.62.25:27017,192.168.62.26:27017,192.168.62.27:27017] ns[{paymentdb test}] query[map[]]] get next document failed: (CursorNotFound) cursor id 75188853290 not found

如果还有什么问题可以直接到入下网址向作者反应
https://github.com/alibaba/MongoShake/issues

如何监控和管理MongoShake的运行状态？

MongoDB一共从2个方面提供监控，全量和增量，分为2个端口。全量部分的监控是从2.4.1版本才开放的，增量部分监控从1.0.0就开放了。
注意：通过sentinel接口的相应修改都不会持久化，也就是说通过sentinel接口修改的变量在MongoShake发生重启后都会清空，需要重新配置。持久化功能后面版本会做掉，敬请期待。

如何查看监控？

直接curl对应的端口就行，例如，在配置文件中有一个端口的配置，v2.4.1版本以前是http_profile（默认值是9100），v2.4.1开始分拆为2个，full_sync.http_port（默认9101）表示全量的端口，incr_sync.http_port（默认9100）是增量的端口。直接调用curl命令就能看到（加上python -m json可以更加清晰）：

vinllen@~/code/MongoShake$ curl -s  http://127.0.0.1:9101
[{"Uri":"/conf","Method":"GET"},{"Uri":"/progress","Method":"GET"},...]
vinllen@~/code/MongoShake$ curl -s  http://127.0.0.1:9101 | python -m json.tool
[
    {
        "Method": "GET",
        "Uri": "/conf"
    },
    {
        "Method": "GET",
        "Uri": "/progress"
    }
    .... // 省略
]

上面返回的有conf和progress，可以分别再curl这个对应的接口，比如conf表示的是配置文件信息：

vinllen@~/code/MongoShake$ curl -s  http://127.0.0.1:9101/conf | python -m json.tool
{
    "CheckpointInterval": 5000,
    "CheckpointStartPosition": 1585220856,
    "CheckpointStorage": "database",
    "CheckpointStorageCollection": "ckpt_default",
    "CheckpointStorageDb": "mongoshake",
    "CheckpointStorageUrl": "mongodb://xxx:31773,xxx:31772,xxx:31771",
    "ConfVersion": 1,
    "FilterDDLEnable": false,
    "FilterNamespaceBlack": null,
    "FilterNamespaceWhite": null,
    "FilterPassSpecialDb": [],
    "FullSyncCollectionDrop": true,
    "FullSyncCreateIndex": "foreground",
    "FullSyncExecutorDebug": false,
    "FullSyncExecutorFilterOrphanDocument": false,
    "FullSyncExecutorInsertOnDupUpdate": false,
    "FullSyncExecutorMajorityEnable": false,
    "FullSyncHTTPListenPort": 9100,
    "FullSyncReaderCollectionParallel": 6,
    "FullSyncReaderDocumentBatchSize": 128,
    "FullSyncReaderOplogStoreDisk": false,
    "FullSyncReaderOplogStoreDiskMaxSize": 256000,
    "FullSyncReaderReadDocumentCount": 0,
    "FullSyncReaderWriteDocumentParallel": 8,
    "HTTPListenPort": 9100,
    "Id": "mongoshake",
    "IncrSyncAdaptiveBatchingMaxSize": 1024,
    "IncrSyncCollisionEnable": false,
    "IncrSyncConflictWriteTo": "none",
    "IncrSyncDBRef": false,
    "IncrSyncExecutor": 1,
    "IncrSyncExecutorDebug": false,
    "IncrSyncExecutorInsertOnDupUpdate": false,
    "IncrSyncExecutorMajorityEnable": false,
    "IncrSyncExecutorUpsert": false,
    "IncrSyncFetcherBufferCapacity": 256,
    "IncrSyncHTTPListenPort": 9100,
    "IncrSyncMongoFetchMethod": "change_stream",
    "IncrSyncOplogGIDS": [],
    "IncrSyncReaderBufferTime": 1,
    "IncrSyncReaderDebug": "discard",
    "IncrSyncShardKey": "collection",
    "IncrSyncTunnel": "",
    "IncrSyncTunnelAddress": [
        "mongodb://root:***@s-xxx-pub.mongodb.rds.aliyuncs.com:3717"
    ],
    "IncrSyncTunnelMessage": "json",
    "IncrSyncWorker": 8,
    "IncrSyncWorkerBatchQueueSize": 64,
    "IncrSyncWorkerOplogCompressor": "none",
    "LogDirectory": "",
    "LogFileName": "collector.log",
    "LogFlush": false,
    "LogLevel": "info",
    "MasterQuorum": false,
    "MongoConnectMode": "secondaryPreferred",
    "MongoCsUrl": "",
    "MongoSUrl": "",
    "MongoUrls": [
        "mongodb://xxx:31771,xxx:31772,xxx:31773"
    ],
    "SyncMode": "full",
    "SystemProfile": 9200,
    "SystemProfilePort": 9200,
    "TransformNamespace": [],
    "Tunnel": "direct",
    "TunnelAddress": [
        "mongodb://root:***@s-xxx-pub.mongodb.rds.aliyuncs.com:3717"
    ],
    "TunnelMessage": "json",
    "Version": "improve-2.4.1,6badf6dfa00ebc0fc1a34e4864814733c5849daf,release,go1.10.8,2020-04-03_16:15:00"
}

1. 全量监控

用户可以通过curl对应full_sync.http_port值来进行查看，目前提供了以下几个接口：

conf。展示配置文件信息。
progress。展示全量同步的进度。
index。索引同步情况。（暂未开放）
sentinel。控制全量运行，目前仅开放限速。

1.1 progress同步进度展示。

下面是一个同步进度的展示，具体说明请查看注释：

vinllen@~/code/MongoShake$ curl -s  http://127.0.0.1:9101/progress | python -m json.tool
{
    "progress": "23.40%", // 大概的进度，100%表示全量同步完成
    "total_collection_number": 47, // 一共有多少个表
    "finished_collection_number": 11, // 已经完成同步表的数目
    "processing_collection_number": 6, // 正在同步的表的数目
    "wait_collection_number": 30, // 等待同步的表的数目
    "collection_metric": { // 各个表同步的详细信息
        "a.b": "100.00% (4/4)", // db=a, collection=b, 同步完成100%，一共有4条数据，同步了4条数据
        "pock.hh": "9.20% (8064/87667)",  // db=pock, collection=hh, 同步完成9.20%，一共有87667条数据，同步了8064条数据
        "pock.hh2": "100.00% (28/28)",
        "test.mmm": "100.00% (1/1)",
        "test.test": "100.00% (7/7)",
        "ycsb.usertable2": "0.32% (3584/1113323)",
        "ycsb.usertable3": "0.26% (2176/822695)",
        "ycsb.usertable4": "0.51% (2176/428967)",
        "yy.x": "100.00% (4/4)",
        "zz-2018-12-11.cc1": "0.00% (0/289)",
        "zz-2018-12-11.cc2": "0.00% (0/308)",
        "zz-2018-12-11.cc3": "-", // -表示还未开始同步
        "zz-2018-12-12.cc1": "-",
        "zz.flush": "100.00% (1/1)",
        "zz.mmm": "100.00% (25/25)",
        "zz.x": "100.00% (42/42)",
        ... // 省略部分
    }
}

有些情况collection同步的百分比可能会大于100%，例如，假设刚开始的时候源端文档有150条，后来在全量过程中，源端增加到了160条，但是同步的进度还是全量同步开始之前的按150条来算的（即使后来增加到了160条），所以对于这种情况显示的百分比就大于100%了，这个是正常现象。

1.2 sentinel控制

当前sentinel仅提供限速机制。需要注意的是，全量阶段的限速不是严格按照用户给定的速率进行限制，而是按照配置文件中full_sync.reader.document_batch_size的倍速限制。举个例子：

配置100，意味着限速128
配置128，意味着限速128
配置129，意味着限速256
配置1000，意味着限速1024

查看sentinel

用户可以先查看一下当前的sentinel配置

vinllen@~/code/MongoShake$ curl -s 127.0.0.1:9101/sentinel | python -m json.tool
{
    "TPS": 0 // 默认0表示不限速，非0表示启用限速
}

配置sentinel

比如把限速0放开到1000：

vinllen@~/code/MongoShake$ curl -X POST --data '{"TPS":1000}' 127.0.0.1:9101/sentinel/options
{"sentinel":"success"}

再次查看配置是否生效：

vinllen@~/code/MongoShake$ curl -s 127.0.0.1:9101/sentinel | python -m json.tool
{
    "TPS": 1000
}

具体当前的同步速率请查看日志层面提供的tps参数，例如下面这个tps是384:

[17:57:02 CST 2020/04/07] [INFO] (mongoshake/common.(*ReplicationMetric).startup.func1:175) [name=zz-186-replica-3177x, stage=full, get=616833, tps=384]

注意：如果不需要限速请设置为0，不要设置一个过大的值，这将会消耗大量CPU资源。

2. 增量监控

用户可以通过curl对应incr_sync.http_port（v2.4.1开始）或者http_profile（v2.4.1之前）的值来进行查看，目前提供了以下几个接口，关于各个角色worker, executor, persister大家可能比较困惑，请参考内部架构介绍。

conf。展示配置文件信息。
repl。复制的整体信息，主要包括目前同步的checkpoint位点。
queue。syncer内部的队列使用情况。
worker。worker内部的基本情况。
executor。写入端的统计情况。
persist。内部角色persister的内部情况。
sentinel。内部控制信息。
sentinel/options。内部控制选项。

2.1 repl

复制的整体信息，主要包括目前同步的checkpoint位点。

vinllen@~/code/MongoShake$ curl -s http://127.0.0.1:9100/repl | python -m json.tool
{
    "logs_get": 2, // 拉取的oplog的oplog总个数
    "logs_repl": 0,  // 尝试写入目的端的oplog总个数
    "logs_success": 0, // 成功写入的oplog个数
    "lsn": { // 已经拉取的checkpoint位点（不一定写入）
        "time": "1970-01-01 08:00:00", // 1970-01-01这个时间点是初始时间，表示还没有数据写入
        "ts": "0",
        "unix": 0
    },
    "lsn_ack": { // 已经成功写入目的端的checkpoint位点（已经成功写入，但是不一定持久化）
        "time": "1970-01-01 08:00:00",
        "ts": "0",
        "unix": 0
    },
    "lsn_ckpt": { // 已经成功持久化的checkpoint位点
        "time": "1970-01-01 08:00:00",
        "ts": "0",
        "unix": 0
    },
    "now": { // 当前的时间
        "time": "2020-04-03 18:32:28",
        "unix": 1585909948
    },
    "replset": "zz-186-replica-3177x", // 源端的名字
    "tag": "improve-2.4.1,6badf6dfa00ebc0fc1a34e4864814733c5849daf,release,go1.10.8,2020-04-03_18:19:37", // 当前MongoShake代码版本信息
    "tps": 0, // 同步的oplog速率qps
    "who": "mongoshake" // id
}

如果lsn_ckpt长期没有更新，跟now的时间越来越大，那么通常证明同步有异常了，要么是写流量过大导致差距变大，要么是同步中断了。对于3.4以后的版本，即使没有写入，也会定期产生noop oplog推动lsn_ckpt，但对于3.4以前，如果长期没写入，这个差距越来越大是正常现象。

2.2 queue

syncer内部的队列使用情况。

vinllen@~/code/MongoShake$ curl -s http://127.0.0.1:9100/queue | python -m json.tool
{
    "logs_queue_size": 64, // logs_queue的总大小
    "pending_queue_size": 64, // pending_queue的总大小
    "persister_buffer_used": 0, // persister_buffer用了多少
    "syncer_inner_queue": [ // syncer内部queue的使用情况
        {
            "logs_queue_used": 0, // logs_queue用了多少，通常来说，如果塞满了=64，那么证明下游处理是瓶颈，为0通常是拉取是瓶颈，同理下面各个queue
            "pending_queue_used": 0, // pending_queue用了多少
            "queue_id": 0 // queue的id，对于副本集默认一共有4个，sharding的话一个shard对应一个
        },
        {
            "logs_queue_used": 0,
            "pending_queue_used": 0,
            "queue_id": 1
        },
        {
            "logs_queue_used": 0,
            "pending_queue_used": 0,
            "queue_id": 2
        },
        {
            "logs_queue_used": 0,
            "pending_queue_used": 0,
            "queue_id": 3
        }
    ],
    "syncer_replica_set_name": "zz-186-replica-3177x" // 源端mongo的名字
}

2.3 worker

worker内部的使用情况。

vinllen@~/code/MongoShake$ curl -s http://127.0.0.1:9100/worker | python -m json.tool
[
    {
        "count": 0, // 当前worker已经同步了多少条数据
        "jobs_in_queue": 0, // 目前worker queue里面存在的个数
        "jobs_unack_buffer": 0, // 目前未ack的buffer的大小
        "last_ack": "0", // 已经ack的最后一条oplog的ts
        "last_unack": "0", // 上一个未ack的oplog的ts
        "worker_id": 0 // worker的id，用户可以在配置文件里面配置worker的个数
    },
    {
        "count": 0,
        "jobs_in_queue": 0,
        "jobs_unack_buffer": 0,
        "last_ack": "0",
        "last_unack": "0",
        "worker_id": 1
    },
    {
        "count": 0,
        "jobs_in_queue": 0,
        "jobs_unack_buffer": 0,
        "last_ack": "0",
        "last_unack": "0",
        "worker_id": 2
    },
    {
        "count": 0,
        "jobs_in_queue": 0,
        "jobs_unack_buffer": 0,
        "last_ack": "0",
        "last_unack": "0",
        "worker_id": 3
    },
    {
        "count": 0,
        "jobs_in_queue": 0,
        "jobs_unack_buffer": 0,
        "last_ack": "0",
        "last_unack": "0",
        "worker_id": 4
    },
    {
        "count": 0,
        "jobs_in_queue": 0,
        "jobs_unack_buffer": 0,
        "last_ack": "0",
        "last_unack": "0",
        "worker_id": 5
    },
    {
        "count": 0,
        "jobs_in_queue": 0,
        "jobs_unack_buffer": 0,
        "last_ack": "0",
        "last_unack": "0",
        "worker_id": 6
    },
    {
        "count": 0,
        "jobs_in_queue": 0,
        "jobs_unack_buffer": 0,
        "last_ack": "0",
        "last_unack": "0",
        "worker_id": 7
    }
]

2.4 executor (2.4版本以后开放)

写入到mongodb的统计，一个executor对应一个worker，默认情况下（incr_sync.shard_key = collection），一个表的同步只会写到一个executor，但是一个executor可能对应多个表。

vinllen@~/code/MongoShake$ curl -s http://127.0.0.1:9100/executor | python -m json.tool
[
    {
        "ddl": 0, // 多少条ddl数据写入目的MongoDB
        "delete": 0, // 所少条删除数据写入目的MongoDB
        "error": 0, // 写入出错次数有多少
        "id": 0, // 当前executor的id
        "insert": 0, // 多少条insert语句写入目的MongoDB
        "unknown": 0, // 数据类型为止
        "update": 0 // 多少条update语句写入目的MongoDB
    },
    {
        "ddl": 0,
        "delete": 0,
        "error": 0,
        "id": 1,
        "insert": 0,
        "unknown": 0,
        "update": 0
    },
    {
        "ddl": 0,
        "delete": 0,
        "error": 0,
        "id": 2,
        "insert": 0,
        "unknown": 0,
        "update": 0
    },
    {
        "ddl": 0,
        "delete": 0,
        "error": 0,
        "id": 3,
        "insert": 0,
        "unknown": 0,
        "update": 0
    },
    {
        "ddl": 0,
        "delete": 0,
        "error": 0,
        "id": 4,
        "insert": 0,
        "unknown": 0,
        "update": 0
    },
    {
        "ddl": 0,
        "delete": 0,
        "error": 0,
        "id": 5,
        "insert": 0,
        "unknown": 0,
        "update": 0
    },
    {
        "ddl": 0,
        "delete": 0,
        "error": 0,
        "id": 6,
        "insert": 0,
        "unknown": 0,
        "update": 0
    },
    {
        "ddl": 0,
        "delete": 0,
        "error": 0,
        "id": 7,
        "insert": 0,
        "unknown": 0,
        "update": 0
    }
]

2.5 sentinel和sentinel/options

sentinel/options可以控制内部的一些信息，比如临时暂停链路（后续可以恢复），限制同步的速率TPS等。sentinel展示当前的配置结果。

vinllen@~/code/MongoShake$ curl -s http://127.0.0.1:9100/sentinel | python -m json.tool
{
    "DuplicatedDump": false, // 如果插入目的端发现重复，是否把这个冲突数据给dump下来，dump的位置参考incr_sync.conflict_write_to参数
    "OplogDump": 0, // 是否打印每一条oplog，0表示不打印，1表示采样打印，2表示全部打印
    "Pause": false, // 是否停止当前链路
    "TPS": 0 // 当前同步的tps/qps多少
}

v2.4.6版本新增TargetDelay用于延迟同步，接口请参考：v2.4.6版本增加延迟同步功能 sentinel/options用于设置上面参数，直接curl post就行，例如，暂停链路：

vinllen@~/code/MongoShake$ curl -X POST --data '{"Pause": true}' http://127.0.0.1:9100/sentinel/options
{"sentinel":"success"}
vinllen@~/code/MongoShake$ curl -s http://127.0.0.1:9100/sentinel | python -m json.tool
{
    "DuplicatedDump": false,
    "OplogDump": 0,
    "Pause": true, // 可以看到这里已经是true了
    "TPS": 0 
}