rancher版本:v2.6.8
k8s版本:v1.22.13+rke2r1
flink集群版本:1.15.0
flink安装模式:session cluster
写在前面:因为参照官网的说明安装过程中出现了很多问题,特记录于此,避免后续重复踩坑
目录
一.flink官网docker的安装
1.安装前配置
2.安装jobmanager
3.安装taskmanager
二.rancher部署flink集群
1.新建jobmanager服务
3.验证集群是否搭建完成
一.flink官网docker的安装
1.安装前配置
$ FLINK_PROPERTIES="jobmanager.rpc.address: jobmanager"
$ docker network create flink-network
2.安装jobmanager
$ docker run \
--rm \
--name=jobmanager \
--network flink-network \
--publish 8081:8081 \
--env FLINK_PROPERTIES="${FLINK_PROPERTIES}" \
flink:1.15.4-scala_2.12 jobmanager
3.安装taskmanager
$ docker run \
--rm \
--name=taskmanager \
--network flink-network \
--env FLINK_PROPERTIES="${FLINK_PROPERTIES}" \
flink:1.15.4-scala_2.12 taskmanager
然后进入flinkUI首页地址就可以了:localhost:8081.
二.rancher部署flink集群
1.新建jobmanager服务
服务名称取做:flink-jobmanager-one
镜像版本使用:flink:1.15.0
配置端口
集群IP:6123(集群间通讯使用)
节点端口:8081:30099(暴露出的flinkUI端口,必须大于30000)
配置预设参数和环境变量
预设参数:jobmanager(表示当前服务为jobmanager,对应的是taskmanager)
环境变量:挂载的flink-conf.xml文件
flink-conf.xml详细内容:
parallelism.default: 1
rest.address: 0.0.0.0
rest.bind-address: 0.0.0.0
blob.server.port: 6124
query.server.port: 6125
taskmanager.bind-host: 0.0.0.0
taskmanager.numberOfTaskSlots: 100
taskmanager.memory.process.size: 34560m
jobmanager.rpc.port: 6123
jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager-one
jobmanager.execution.failover-strategy: region
jobmanager.memory.heap.size: 16000m
jobmanager.memory.jvm-metaspace.size: 1600m
jobmanager.memory.jvm-overhead.max: 20480m
jobmanager.memory.jvm-overhead.min: 1m
jobmanager.memory.off-heap.size: 1600m
jobmanager.memory.process.size: 20480m
kubernetes.cluster-id: default
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir: file:///opt/flink/ha-storage
详细的挂载配置如下图,键必须是:FLINK_PROPERTIES,值则是整个flink安装目录下的flink-conf.xml的内容,为了便于阅读删除了配置文件中的无用注释。
另外,需要挂载flink的lib目录下jar包,以及高可用模式下需要挂载高可用存储路径,我才用的是NFS挂载,如下图(为避免麻烦打了马赛克)详细地址根据你自己的设置来
jobmanager就到此配置完毕了,点击保存即可启动服务
查看服务启动日志,为如下状态,且无其它报错,代表启动正常
2023-04-21 03:56:14,800 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Starting StandaloneSessionClusterEntrypoint.
2023-04-21 03:56:14,860 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Install default filesystem.
2023-04-21 03:56:14,865 INFO org.apache.flink.core.fs.FileSystem [] - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2023-04-21 03:56:14,947 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Install security context.
2023-04-21 03:56:14,967 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory [] - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2023-04-21 03:56:14,975 INFO org.apache.flink.runtime.security.modules.JaasModule [] - Jaas file will be created as /tmp/jaas-11053256423370857721.conf.
2023-04-21 03:56:14,988 INFO org.apache.flink.runtime.security.contexts.HadoopSecurityContextFactory [] - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2023-04-21 03:56:14,993 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Initializing cluster services.
2023-04-21 03:56:15,004 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Using working directory: WorkingDirectory(/tmp/jm_016b9c4949a2c1712d36faf922b30dc2).
2023-04-21 03:56:15,437 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Trying to start actor system, external address flink-jobmanager-one:6123, bind address 0.0.0.0:6123.
2023-04-21 03:56:16,875 INFO akka.event.slf4j.Slf4jLogger [] - Slf4jLogger started
2023-04-21 03:56:16,924 INFO akka.remote.RemoteActorRefProvider [] - Akka Cluster not in use - enabling unsafe features anyway because `akka.remote.use-unsafe-remote-features-outside-cluster` has been enabled.
2023-04-21 03:56:16,925 INFO akka.remote.Remoting [] - Starting remoting
2023-04-21 03:56:17,192 INFO akka.remote.Remoting [] - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-one:6123]
2023-04-21 03:56:17,402 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Actor system started at akka.tcp://flink@flink-jobmanager-one:6123
2023-04-21 03:56:17,815 INFO org.apache.flink.runtime.blob.FileSystemBlobStore [] - Creating highly available BLOB storage directory at file:/opt/flink/ha-storage/default/blob
2023-04-21 03:56:19,090 INFO org.apache.flink.runtime.blob.BlobServer [] - Created BLOB server storage directory /tmp/jm_016b9c4949a2c1712d36faf922b30dc2/blobStorage
2023-04-21 03:56:19,096 INFO org.apache.flink.runtime.blob.BlobServer [] - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2023-04-21 03:56:19,116 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl [] - No metrics reporter configured, no metrics will be exposed/reported.
2023-04-21 03:56:19,122 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Trying to start actor system, external address flink-jobmanager-one:0, bind address 0.0.0.0:0.
2023-04-21 03:56:19,155 INFO akka.event.slf4j.Slf4jLogger [] - Slf4jLogger started
2023-04-21 03:56:19,162 INFO akka.remote.RemoteActorRefProvider [] - Akka Cluster not in use - enabling unsafe features anyway because `akka.remote.use-unsafe-remote-features-outside-cluster` has been enabled.
2023-04-21 03:56:19,162 INFO akka.remote.Remoting [] - Starting remoting
2023-04-21 03:56:19,177 INFO akka.remote.Remoting [] - Remoting started; listening on addresses :[akka.tcp://flink-metrics@flink-jobmanager-one:44181]
2023-04-21 03:56:19,195 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Actor system started at akka.tcp://flink-metrics@flink-jobmanager-one:44181
2023-04-21 03:56:19,218 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.metrics.dump.MetricQueryService at akka://flink-metrics/user/rpc/MetricQueryService .
2.新建taskmanager服务
服务名称取做:flink-taskmanager-one
镜像版本使用:flink:1.15.0
taskmanager不需要配置端口
配置预设参数和环境变量
预设参数:taskmanager(表示当前服务为taskmanager,对应的是jobmanager)
环境变量:挂载的flink-conf.xml文件
flink-conf.xml详细内容:
parallelism.default: 1
rest.address: 0.0.0.0
rest.bind-address: 0.0.0.0
blob.server.port: 6124
query.server.port: 6125
taskmanager.bind-host: 0.0.0.0
taskmanager.memory.process.size: 1728m
taskmanager.numberOfTaskSlots: 100
jobmanager.rpc.port: 6123
jobmanager.bind-host: 0.0.0.0
jobmanager.rpc.address: flink-jobmanager-one
jobmanager.execution.failover-strategy: region
jobmanager.memory.process.size: 1600m
kubernetes.cluster-id: default
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir: file:///opt/flink/nfs/ha-storage
execution.checkpointing.interval: 3min
execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION
execution.checkpointing.max-concurrent-checkpoints: 1
execution.checkpointing.min-pause: 0
execution.checkpointing.mode: EXACTLY_ONCE
execution.checkpointing.timeout: 10min
execution.checkpointing.tolerable-failed-checkpoints: 0
execution.checkpointing.unaligned: true
state.backend: rocksdb
state.checkpoints.dir: file:///home/flink/checkpoints
state.savepoints.dir: file:///home/flink/flink/savepoints
state.backend.incremental: true
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 10
restart-strategy.failure-rate.failure-rate-interval: 300s
restart-strategy.failure-rate.delay: 15s
taskmanager需要挂载flink的lib目录之外,还要挂载savepoint和checkpoint(如果你只是想跑起来试试,那么不用管这个配置)
jobmanager就到此配置完毕了,点击保存即可启动服务
查看服务启动日志,为如下状态,且无其它报错,代表启动正常
2023-04-21 03:59:07,347 INFO org.apache.flink.core.fs.FileSystem [] - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2023-04-21 03:59:07,475 INFO org.apache.flink.runtime.state.changelog.StateChangelogStorageLoader [] - StateChangelogStorageLoader initialized with shortcut names {memory,filesystem}.
2023-04-21 03:59:07,485 INFO org.apache.flink.runtime.state.changelog.StateChangelogStorageLoader [] - StateChangelogStorageLoader initialized with shortcut names {memory,filesystem}.
2023-04-21 03:59:07,507 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory [] - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2023-04-21 03:59:07,527 INFO org.apache.flink.runtime.security.modules.JaasModule [] - Jaas file will be created as /tmp/jaas-9454914976650115513.conf.
2023-04-21 03:59:07,540 INFO org.apache.flink.runtime.security.contexts.HadoopSecurityContextFactory [] - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2023-04-21 03:59:08,538 INFO org.apache.flink.runtime.blob.FileSystemBlobStore [] - Creating highly available BLOB storage directory at file:/opt/flink/ha-storage/default/blob
2023-04-21 03:59:09,872 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver{configMapName='default-cluster-config-map'}.
2023-04-21 03:59:09,872 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapSharedInformer [] - Starting to watch for default/default-cluster-config-map, watching id:7c948823-ecdd-4ba4-b6be-568113b6a8ad
2023-04-21 03:59:09,874 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils [] - Trying to select the network interface and address to use by connecting to the leading JobManager.
2023-04-21 03:59:09,874 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils [] - TaskManager will try to connect for PT10S before falling back to heuristics
2023-04-21 03:59:15,278 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Trying to connect to address flink-jobmanager-one/****:6123
2023-04-21 03:59:15,380 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [flink-jobmanager-one/10.43.79.104:6123] from local address [localhost/127.0.0.1] with timeout [100] due to: connect timed out
2023-04-21 03:59:15,381 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService.
2023-04-21 03:59:15,381 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Stopping KubernetesLeaderRetrievalDriver{configMapName='default-cluster-config-map'}.
2023-04-21 03:59:15,382 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - TaskManager will use hostname/address 'flink-taskmanager-one-5c484b8685-rrsct' (10.42.2.35) for communication.
2023-04-21 03:59:15,382 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapSharedInformer [] - Stopped to watch for default/default-cluster-config-map, watching id:7c948823-ecdd-4ba4-b6be-568113b6a8ad
2023-04-21 03:59:15,482 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Trying to start actor system, external address ****:0, bind address 0.0.0.0:0.
2023-04-21 03:59:16,610 INFO akka.event.slf4j.Slf4jLogger [] - Slf4jLogger started
2023-04-21 03:59:16,661 INFO akka.remote.RemoteActorRefProvider [] - Akka Cluster not in use - enabling unsafe features anyway because `akka.remote.use-unsafe-remote-features-outside-cluster` has been enabled.
2023-04-21 03:59:16,662 INFO akka.remote.Remoting [] - Starting remoting
2023-04-21 03:59:16,927 INFO akka.remote.Remoting [] - Remoting started; listening on addresses :[akka.tcp://flink@10.42.2.35:33783]
2023-04-21 03:59:17,142 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Actor system started at akka.tcp://flink@*****:33783
2023-04-21 03:59:17,171 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Using working directory: WorkingDirectory(/tmp/tm_10.42.2.35:33783-56875c)
2023-04-21 03:59:17,187 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl [] - No metrics reporter configured, no metrics will be exposed/reported.
2023-04-21 03:59:17,194 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Trying to start actor system, external address *****:0, bind address 0.0.0.0:0.
2023-04-21 03:59:17,226 INFO akka.event.slf4j.Slf4jLogger [] - Slf4jLogger started
2023-04-21 03:59:17,234 INFO akka.remote.RemoteActorRefProvider [] - Akka Cluster not in use - enabling unsafe features anyway because `akka.remote.use-unsafe-remote-features-outside-cluster` has been enabled.
2023-04-21 03:59:17,234 INFO akka.remote.Remoting [] - Starting remoting
2023-04-21 03:59:17,249 INFO akka.remote.Remoting [] - Remoting started; listening on addresses :[akka.tcp://flink-metrics@10.42.2.35:34252]
2023-04-21 03:59:17,272 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Actor system started at akka.tcp://flink-metrics@10.42.2.35:34252
2023-04-21 03:59:17,298 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.metrics.dump.MetricQueryService at akka://flink-metrics/user/rpc/MetricQueryService_10.42.2.35:33783-56875c .
2023-04-21 03:59:17,350 INFO org.apache.flink.runtime.blob.PermanentBlobCache [] - Created BLOB cache storage directory /tmp/tm_10.42.2.35:33783-56875c/blobStorage
2023-04-21 03:59:17,360 INFO org.apache.flink.runtime.blob.TransientBlobCache [] - Created BLOB cache storage directory /tmp/tm_10.42.2.35:33783-56875c/blobStorage
2023-04-21 03:59:17,367 INFO org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled external resources: []
2023-04-21 03:59:17,368 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Starting TaskManager with ResourceID: *****:33783-56875c
2023-04-21 03:59:17,436 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices [] - Temporary file directory '/tmp': total 549 GB, usable 447 GB (81.42% usable)
2023-04-21 03:59:17,442 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager [] - Created a new FileChannelManager for spilling of task related data to disk (joins, sorting, ...). Used directories:
/tmp/flink-io-ebe2ab9f-f60f-4b6b-be7a-b090613e845e
2023-04-21 03:59:17,453 INFO org.apache.flink.runtime.io.network.netty.NettyConfig [] - NettyConfig [server address: /0.0.0.0, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: AUTO, number of server threads: 100 (manual), number of client threads: 100 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
2023-04-21 03:59:17,593 INFO org.apache.flink.runtime.io.network.NettyShuffleServiceFactory [] - Created a new FileChannelManager for storing result partitions of BLOCKING shuffles. Used directories:
/tmp/flink-netty-shuffle-bb8f836c-1bde-446a-955d-d771f33a18f0
2023-04-21 03:59:17,760 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool [] - Allocated 128 MB for network buffer pool (number of memory segments: 4096, bytes per segment: 32768).
2023-04-21 03:59:17,782 INFO org.apache.flink.runtime.io.network.NettyShuffleEnvironment [] - Starting the network environment and its components.
2023-04-21 03:59:17,923 INFO org.apache.flink.runtime.io.network.netty.NettyClient [] - Transport type 'auto': using EPOLL.
2023-04-21 03:59:17,927 INFO org.apache.flink.runtime.io.network.netty.NettyClient [] - Successful initialization (took 144 ms).
2023-04-21 03:59:17,955 INFO org.apache.flink.runtime.io.network.netty.NettyServer [] - Transport type 'auto': using EPOLL.
2023-04-21 03:59:18,023 INFO org.apache.flink.runtime.io.network.netty.NettyServer [] - Successful initialization (took 93 ms). Listening on SocketAddress /0:0:0:0:0:0:0:0%0:44943.
2023-04-21 03:59:18,025 INFO org.apache.flink.runtime.taskexecutor.KvStateService [] - Starting the kvState service and its components.
2023-04-21 03:59:18,068 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/rpc/taskmanager_0 .
2023-04-21 03:59:18,098 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapSharedInformer [] - Starting to watch for default/default-cluster-config-map, watching id:fa1d1e62-3096-49e5-98c6-33cefd409f64
2023-04-21 03:59:18,098 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver{configMapName='default-cluster-config-map'}.
2023-04-21 03:59:18,100 INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Start job leader service.
2023-04-21 03:59:18,102 INFO org.apache.flink.runtime.filecache.FileCache [] - User file cache uses directory /tmp/flink-dist-cache-9504579b-10e1-40d9-8432-5f3e84862ae8
2023-04-21 03:59:19,558 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting to ResourceManager akka.tcp://flink@flink-jobmanager-one:6123/user/rpc/resourcemanager_0(ac2fae3954b4df272f119171641343f0).
2023-04-21 03:59:19,808 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Resolved ResourceManager address, beginning registration
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jboss.netty.util.internal.ByteBufferUtil (file:/tmp/flink-rpc-akka_3755f8e5-ec9a-425f-9815-a45ebb25f358.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of org.jboss.netty.util.internal.ByteBufferUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2023-04-21 03:59:19,923 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful registration at resource manager akka.tcp://flink@flink-jobmanager-one:6123/user/rpc/resourcemanager_0 under registration id f679013e892a684acbbda4efc6a3a0b0.
3.验证集群是否搭建完成
进入配置的地址加对应暴露的30000段的端口,如下面这两图显示,在flinkUI中可分别看到jobmanager和taskmanager的参数配置即表示安装成功
jobmanager正常显示:
taskmanager正常显示,根据你配置的taskmanager数量显示:
至此,安装完毕。