【Kafka源码走读】消息生产者与服务端的连接过程

说明：以下描述的源码都是基于最新版，老版本可能会有所不同。

一. 查找源码入口

kafka-console-producer.sh是消息生产者的脚本，我们从这里入手，可以看到源码的入口：

if [ "x$KAFKA_HEAP_OPTS" = "x" ]; then
    export KAFKA_HEAP_OPTS="-Xmx512M"
fi
exec $(dirname $0)/kafka-run-class.sh kafka.tools.ConsoleProducer "$@"

从上面的代码可以得知，源码是kafka.tools.ConsoleProducer，这是一个scala的文件。

二. 利用源码启动生产者进行调试

阅读源码最好的方式就是在debug下，边看边断点跟踪，所以我们先把环境配置好，以便程序可以run起来。

由于kafka server端配置了认证模式，那么在client侧，也需要加上认证的配置，否则会导致连接server失败。如何开启认证模式，可参考我之前写的这篇文章。我们可以参考kafka-console-producer.sh脚本运行时传入的参数，对应填入idea的Run/Debug Configurations界面中。

脚本：

/kafka/bin/kafka-console-producer.sh --bootstrap-server=127.0.0.1:9092 --topic=notif.test --producer.config=/kafka/config/topic.properties

topic.properties的内容如下：

security.protocol=SASL_PLAINTEXT
sasl.mechanism=PLAIN

这个配置也可以放在producer.properties里面，下面会看到。

idea界面：

红色框起来的部分，就是和无认证模式下的区别，没有这两个参数，连接server就会失败。client.jaas.conf里面的参数，请参考上面提到的开启认证模式的文章。producer.properties是kafka自带配置文件，我们仅需要增加如下配置即可：

security.protocol=SASL_PLAINTEXT
sasl.mechanism=PLAIN

好了，一切就绪，就可以执行run了，控制台如果没有错误，那就说明启动成功了，如下：

说到这里，不得不感慨一下，平时基本上没有run过命令行输入内容的代码。然后，我停留在这个界面半个小时，一直以为没有连接成功，各种排查是哪里配置的不对之类的。突然间想起去看下server端的日志，结果发现连上了。然后试着在上面红色日志下方去输入内容(见下图)，好家伙，consumer侧收到了，大写的尴尬！

kafka同学，你说你要是在我输入的上方再写点提示日志该多好啊。。。

三. 查看生产者连接服务端的过程

既然代码跑起来了，那就开始我们的阅读之旅。首先，在ConsoleProducer.scala中找到入口函数main()方法，这是任何编程语言的启动之源：

  def main(args: Array[String]): Unit = {

    try {
      val config = new ProducerConfig(args)
      val input = System.in
      val producer = new KafkaProducer[Array[Byte], Array[Byte]](producerProps(config))
      try loopReader(producer, newReader(config.readerClass, getReaderProps(config)), input, config.sync)
      finally producer.close()
      Exit.exit(0)
    } catch {
      case e: joptsimple.OptionException =>
        System.err.println(e.getMessage)
        Exit.exit(1)
      case e: Exception =>
        e.printStackTrace()
        Exit.exit(1)
    }
  }

可以看出，它调用了val producer = new KafkaProducer[Array[Byte], Array[Byte]](producerProps(config))。

KafkaProducer是java代码，查看其最终调用的构造函数：

    KafkaProducer(ProducerConfig config,
                  Serializer<K> keySerializer,
                  Serializer<V> valueSerializer,
                  ProducerMetadata metadata,
                  KafkaClient kafkaClient,
                  ProducerInterceptors<K, V> interceptors,
                  Time time) {
        try {
            this.producerConfig = config;
            this.time = time;

            // 此处省略多行代码
            this.errors = this.metrics.sensor("errors");
            this.sender = newSender(logContext, kafkaClient, this.metadata);
            String ioThreadName = NETWORK_THREAD_PREFIX + " | " + clientId;
            this.ioThread = new KafkaThread(ioThreadName, this.sender, true);
            this.ioThread.start();
            config.logUnused();
            AppInfoParser.registerAppInfo(JMX_PREFIX, clientId, metrics, time.milliseconds());
            log.debug("Kafka producer started");
        } catch (Throwable t) {
            // call close methods if internal objects are already constructed this is to prevent resource leak. see KAFKA-2121
            close(Duration.ofMillis(0), true);
            // now propagate the exception
            throw new KafkaException("Failed to construct kafka producer", t);
        }
    }

关注this.sender = newSender(logContext, kafkaClient, this.metadata);这行代码，进入newSender()函数：

     Sender newSender(LogContext logContext, KafkaClient kafkaClient, ProducerMetadata metadata) {
        // 此处省略部分代码
        KafkaClient client = kafkaClient != null ? kafkaClient : ClientUtils.createNetworkClient(producerConfig,
                this.metrics,
                "producer",
                logContext,
                apiVersions,
                time,
                maxInflightRequests,
                metadata,
                throttleTimeSensor,
                clientTelemetryReporter.map(ClientTelemetryReporter::telemetrySender).orElse(null));

        short acks = Short.parseShort(producerConfig.getString(ProducerConfig.ACKS_CONFIG));
        return new Sender(参数省略);
    }

注意这行代码：

KafkaClient client = kafkaClient != null ? kafkaClient : ClientUtils.createNetworkClient(参数省略);

前面都没有对kafkaClient进行赋值，所以这行代码可简化为：

KafkaClient client = ClientUtils.createNetworkClient(参数省略)

接下来查看ClientUtils.createNetworkClient()函数，最终会调用下面这个方法：

    public static NetworkClient createNetworkClient(入参省略) {
        ChannelBuilder channelBuilder = null;
        Selector selector = null;

        try {
            channelBuilder = ClientUtils.createChannelBuilder(config, time, logContext);
            selector = new Selector(config.getLong(CommonClientConfigs.CONNECTIONS_MAX_IDLE_MS_CONFIG),
                    metrics,
                    time,
                    metricsGroupPrefix,
                    channelBuilder,
                    logContext);
            return new NetworkClient(metadataUpdater,
                    metadata,
                    selector,
                    clientId,
                    maxInFlightRequestsPerConnection,
                    后续参数省略);
        } catch (Throwable t) {
            closeQuietly(selector, "Selector");
            closeQuietly(channelBuilder, "ChannelBuilder");
            throw new KafkaException("Failed to create new NetworkClient", t);
        }
    }

我们在第二步调试的时候，不是加了认证的配置参数吗，处理认证配置的方法就在上面的方法里面，具体是如下代码：

channelBuilder = ClientUtils.createChannelBuilder(config, time, logContext);
    public static ChannelBuilder createChannelBuilder(AbstractConfig config, Time time, LogContext logContext) {
        SecurityProtocol securityProtocol = SecurityProtocol.forName(config.getString(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG));
        String clientSaslMechanism = config.getString(SaslConfigs.SASL_MECHANISM);
        return ChannelBuilders.clientChannelBuilder(securityProtocol, JaasContext.Type.CLIENT, config, null,
                clientSaslMechanism, time, true, logContext);

ChannelBuilders.createChannelBuilder()方法只是外层的判断：

    public static ChannelBuilder clientChannelBuilder(入参省略) {

        if (securityProtocol == SecurityProtocol.SASL_PLAINTEXT || securityProtocol == SecurityProtocol.SASL_SSL) {
            if (contextType == null)
                throw new IllegalArgumentException("`contextType` must be non-null if `securityProtocol` is `" + securityProtocol + "`");
            if (clientSaslMechanism == null)
                throw new IllegalArgumentException("`clientSaslMechanism` must be non-null in client mode if `securityProtocol` is `" + securityProtocol + "`");
        }
        return create(securityProtocol, ConnectionMode.CLIENT, contextType, config, listenerName, false, clientSaslMechanism,
                saslHandshakeRequestEnable, null, null, time, logContext, null);
    }

详细的处理逻辑是在ChannelBuilders.create()方法里面：

private static ChannelBuilder create(入参省略) {
        Map<String, Object> configs = channelBuilderConfigs(config, listenerName);

        ChannelBuilder channelBuilder;
        switch (securityProtocol) {
            case SSL:
                requireNonNullMode(connectionMode, securityProtocol);
                channelBuilder = new SslChannelBuilder(connectionMode, listenerName, isInterBrokerListener, logContext);
                break;
            case SASL_SSL:
            case SASL_PLAINTEXT:
                // 业务代码太长，省略
                break;
            case PLAINTEXT:
                channelBuilder = new PlaintextChannelBuilder(listenerName);
                break;
            default:
                throw new IllegalArgumentException("Unexpected securityProtocol " + securityProtocol);
        }

        channelBuilder.configure(configs);
        return channelBuilder;
    }

好了，现在又回到ClientUtils.createNetworkClient()方法：

    public static NetworkClient createNetworkClient(入参省略) {
        ChannelBuilder channelBuilder = null;
        Selector selector = null;

        try {
            channelBuilder = ClientUtils.createChannelBuilder(config, time, logContext);
            selector = new Selector(参数省略);
            return new NetworkClient(metadataUpdater,
                    metadata,
                    selector,
                    clientId,
                    maxInFlightRequestsPerConnection,
                    后续参数省略);
        } catch (Throwable t) {
            closeQuietly(selector, "Selector");
            closeQuietly(channelBuilder, "ChannelBuilder");
            throw new KafkaException("Failed to create new NetworkClient", t);
        }
    }

创建channelBuilder之后，紧接着是创建一个Selector对象，然后再创建一个NetworkClient对象，并返回。创建Selector和NetworkClient对象的构造函数都只是初始化各类参数，没有值得需要注意的地方，所以这里就跳过了。

上述代码执行完毕，则会回到KafkaProducer.newSender()方法：

     Sender newSender(LogContext logContext, KafkaClient kafkaClient, ProducerMetadata metadata) {
        // 此处省略部分代码
        KafkaClient client = kafkaClient != null ? kafkaClient : ClientUtils.createNetworkClient(参数省略);

        short acks = Short.parseShort(producerConfig.getString(ProducerConfig.ACKS_CONFIG));
        return new Sender(参数省略);
    }

从前面的代码可知，ClientUtils.createNetworkClient()方法返回一个NetworkClient对象，kafkaClient是NetworkClient的父类，所以kafkaClient client即NetworkClient client。kafkaClient client赋值完成之后，接着是创建一个Sender对象，并返回。因为Sender对象也只是一些初始化操作，所以这里也跳过。

KafkaProducer.newSender()方法返回一个Sender对象，然后回到KafkaProducer的构造方法：

    KafkaProducer(入参省略) {
        try {
            // 此处省略多行代码
            this.sender = newSender(logContext, kafkaClient, this.metadata);
            String ioThreadName = NETWORK_THREAD_PREFIX + " | " + clientId;
            this.ioThread = new KafkaThread(ioThreadName, this.sender, true);
            this.ioThread.start();
            config.logUnused();
            AppInfoParser.registerAppInfo(JMX_PREFIX, clientId, metrics, time.milliseconds());
            log.debug("Kafka producer started");
        } catch (Throwable t) {
            // 此处省略多行代码
        }
    }

赋值sender之后，接下来是创建KafkaThread对象，构造方法如下：

    public KafkaThread(final String name, Runnable runnable, boolean daemon) {
        super(runnable, name);
        configureThread(name, daemon);
    }

由此可以看出KafkaThread只是对线程做了一些附加的工作，KafkaThread对象创建完成，下一步就是执行start()方法。在KafkaThread的构造函数中传入的Runable参数是Sender对象，所以，我们需要去看下Sender的run()方法：

/**
     * The main run loop for the sender thread
     */
    @Override
    public void run() {
        log.debug("Starting Kafka producer I/O thread.");

        if (transactionManager != null)
            transactionManager.setPoisonStateOnInvalidTransition(true);

        // main loop, runs until close is called
        while (running) {
            try {
                runOnce();
            } catch (Exception e) {
                log.error("Uncaught error in kafka producer I/O thread: ", e);
            }
        }

        log.debug("Beginning shutdown of Kafka producer I/O thread, sending remaining records.");

        // okay we stopped accepting requests but there may still be
        // requests in the transaction manager, accumulator or waiting for acknowledgment,
        // wait until these are completed.
        while (!forceClose && ((this.accumulator.hasUndrained() || this.client.inFlightRequestCount() > 0) || hasPendingTransactionalRequests())) {
            try {
                runOnce();
            } catch (Exception e) {
                log.error("Uncaught error in kafka producer I/O thread: ", e);
            }
        }

        // Abort the transaction if any commit or abort didn't go through the transaction manager's queue
        while (!forceClose && transactionManager != null && transactionManager.hasOngoingTransaction()) {
            if (!transactionManager.isCompleting()) {
                log.info("Aborting incomplete transaction due to shutdown");
                try {
                    // It is possible for the transaction manager to throw errors when aborting. Catch these
                    // so as not to interfere with the rest of the shutdown logic.
                    transactionManager.beginAbort();
                } catch (Exception e) {
                    log.error("Error in kafka producer I/O thread while aborting transaction when during closing: ", e);
                    // Force close in case the transactionManager is in error states.
                    forceClose = true;
                }
            }
            try {
                runOnce();
            } catch (Exception e) {
                log.error("Uncaught error in kafka producer I/O thread: ", e);
            }
        }

        if (forceClose) {
            // We need to fail all the incomplete transactional requests and batches and wake up the threads waiting on
            // the futures.
            if (transactionManager != null) {
                log.debug("Aborting incomplete transactional requests due to forced shutdown");
                transactionManager.close();
            }
            log.debug("Aborting incomplete batches due to forced shutdown");
            this.accumulator.abortIncompleteBatches();
        }
        try {
            this.client.close();
        } catch (Exception e) {
            log.error("Failed to close network client", e);
        }

        log.debug("Shutdown of Kafka producer I/O thread has completed.");
    }

由于这部分代码是重点，所以就没有对代码做简化。上面的代码可以看出，多次调用了runOnce()方法，所以我们来看下这个方法是在做什么：

    /**
     * Run a single iteration of sending
     *
     */
    void runOnce() {
        if (transactionManager != null) {
            try {
                transactionManager.maybeResolveSequences();

                RuntimeException lastError = transactionManager.lastError();

                // do not continue sending if the transaction manager is in a failed state
                if (transactionManager.hasFatalError()) {
                    if (lastError != null)
                        maybeAbortBatches(lastError);
                    client.poll(retryBackoffMs, time.milliseconds());
                    return;
                }

                if (transactionManager.hasAbortableError() && shouldHandleAuthorizationError(lastError)) {
                    return;
                }

                // Check whether we need a new producerId. If so, we will enqueue an InitProducerId
                // request which will be sent below
                transactionManager.bumpIdempotentEpochAndResetIdIfNeeded();

                if (maybeSendAndPollTransactionalRequest()) {
                    return;
                }
            } catch (AuthenticationException e) {
                // This is already logged as error, but propagated here to perform any clean ups.
                log.trace("Authentication exception while processing transactional request", e);
                transactionManager.authenticationFailed(e);
            }
        }

        long currentTimeMs = time.milliseconds();
        long pollTimeout = sendProducerData(currentTimeMs);
        client.poll(pollTimeout, currentTimeMs);
    }

上述代码中最重要的方法应该就是client.poll()了吧，查看poll()方法的注释信息，定义在KafkaClient中：

    /**
     * Do actual reads and writes from sockets.
     *
     * @param timeout The maximum amount of time to wait for responses in ms, must be non-negative. The implementation
     *                is free to use a lower value if appropriate (common reasons for this are a lower request or
     *                metadata update timeout)
     * @param now The current time in ms
     * @throws IllegalStateException If a request is sent to an unready node
     */
    List<ClientResponse> poll(long timeout, long now);

上面注释表示该方法用于对报文进行读写工作。

好了，现在回到KafkaProducer的构造方法，当执行this.ioThread.start()代码之后，KafkaProducer对象的初始化基本上就算完成了。但是，你们发现没有，上面的代码执行流程，都没有发现连接kafka server的代码呢？

起初我怀疑是不是阅读源码时，把哪里的代码给遗漏了，于是又回头走了一遍，还是没发现连接server的过程。没办法了，开启debug模式吧。为了避免一步步debug，根据我的经验，在开启debug之前，我们可以回头看下，上述的各个java类中，哪一个类里面包含了连接server的方法，然后把断点加上去。

因为上述代码就只有几个类，寻找的过程还是很简单的。很快，我就锁定到Selector这个类里面，代码如下：

    /**
     * Begin connecting to the given address and add the connection to this nioSelector associated with the given id
     * number.
     * <p>
     * Note that this call only initiates the connection, which will be completed on a future {@link #poll(long)}
     * call. Check {@link #connected()} to see which (if any) connections have completed after a given poll call.
     * @param id The id for the new connection
     * @param address The address to connect to
     * @param sendBufferSize The send buffer for the new connection
     * @param receiveBufferSize The receive buffer for the new connection
     * @throws IllegalStateException if there is already a connection for that id
     * @throws IOException if DNS resolution fails on the hostname or if the broker is down
     */
    @Override
    public void connect(String id, InetSocketAddress address, int sendBufferSize, int receiveBufferSize) throws IOException {
        ensureNotRegistered(id);
        SocketChannel socketChannel = SocketChannel.open();
        SelectionKey key = null;
        try {
            configureSocketChannel(socketChannel, sendBufferSize, receiveBufferSize);
            boolean connected = doConnect(socketChannel, address);
            key = registerChannel(id, socketChannel, SelectionKey.OP_CONNECT);

            if (connected) {
                // OP_CONNECT won't trigger for immediately connected channels
                log.debug("Immediately connected to node {}", id);
                immediatelyConnectedKeys.add(key);
                key.interestOps(0);
            }
        } catch (IOException | RuntimeException e) {
            if (key != null)
                immediatelyConnectedKeys.remove(key);
            channels.remove(id);
            socketChannel.close();
            throw e;
        }
    }

看方法上面的注释，也很符合我的猜测，来吧，上断点。然后查看断点处的线程栈：

没想到吧，连接server的流程，是执行KafkaThread.start()方法才触发的。前面提到Sender.run()方法是重点，贴出的代码未作简化处理，原因正源于此。执行顺序：

run()->runOnce()->maybeSendAndPollTransactionalRequest()->.......

看下Sender.maybeSendAndPollTransactionalRequest()的源码：

    /**
     * Returns true if a transactional request is sent or polled, or if a FindCoordinator request is enqueued
     */
    private boolean maybeSendAndPollTransactionalRequest() {
        // 省略部分代码
        try {
            // 省略部分代码
            if (targetNode != null) {
                if (!awaitNodeReady(targetNode, coordinatorType)) {
                    log.trace("Target node {} not ready within request timeout, will retry when node is ready.", targetNode);
                    maybeFindCoordinatorAndRetry(nextRequestHandler);
                    return true;
                }
            } else if (coordinatorType != null) {
                // 省略部分代码
            } else {
                // 省略部分代码
            }
                // 省略部分代码
        }
    }

进入Sender.awaitNodeReady()方法：

    private boolean awaitNodeReady(Node node, FindCoordinatorRequest.CoordinatorType coordinatorType) throws IOException {
        if (NetworkClientUtils.awaitReady(client, node, time, requestTimeoutMs)) {
            if (coordinatorType == FindCoordinatorRequest.CoordinatorType.TRANSACTION) {
                // Indicate to the transaction manager that the coordinator is ready, allowing it to check ApiVersions
                // This allows us to bump transactional epochs even if the coordinator is temporarily unavailable at
                // the time when the abortable error is handled
                transactionManager.handleCoordinatorReady();
            }
            return true;
        }
        return false;
    }

接着进入NetworkClientUtils.awaitReady() ：

    public static boolean awaitReady(KafkaClient client, Node node, Time time, long timeoutMs) throws IOException {
        if (timeoutMs < 0) {
            throw new IllegalArgumentException("Timeout needs to be greater than 0");
        }
        long startTime = time.milliseconds();

        if (isReady(client, node, startTime) ||  client.ready(node, startTime))
            return true;

        // 省略部分代码
    }

接着进入NetworkClientUtils.isReady()：

    public static boolean isReady(KafkaClient client, Node node, long currentTime) {
        client.poll(0, currentTime);
        return client.isReady(node, currentTime);
    }

接着进入NetworkClient.poll()：

    @Override
    public List<ClientResponse> poll(long timeout, long now) {
        ensureActive();

        // 省略部分代码

        long metadataTimeout = metadataUpdater.maybeUpdate(now);
        long telemetryTimeout = telemetrySender != null ? telemetrySender.maybeUpdate(now) : Integer.MAX_VALUE;
        // 省略部分代码

        return responses;
    }

继续进入NetworkClient.DefaultMetadataUpdater.maybeUpdate()方法：

    class DefaultMetadataUpdater implements MetadataUpdater {

        // 省略部分代码

        DefaultMetadataUpdater(Metadata metadata) {
            this.metadata = metadata;
            this.inProgress = null;
        }

        // 省略部分代码
        
        public long maybeUpdate(long now) {
            // 省略部分代码
            return maybeUpdate(now, leastLoadedNode.node());
        }
    }

继续进入NetworkClient.DefaultMetadataUpdater.maybeUpdate()方法：

        private long maybeUpdate(long now, Node node) {
            // 省略部分代码

            if (connectionStates.canConnect(nodeConnectionId, now)) {
                // We don't have a connection to this node right now, make one
                log.debug("Initialize connection to node {} for sending metadata request", node);
                initiateConnect(node, now);
                return reconnectBackoffMs;
            }

            return Long.MAX_VALUE;
        }

继续进入NetworkClient.initiateConnect()方法：

    private void initiateConnect(Node node, long now) {
        String nodeConnectionId = node.idString();
        try {
            connectionStates.connecting(nodeConnectionId, now, node.host());
            InetAddress address = connectionStates.currentAddress(nodeConnectionId);
            log.debug("Initiating connection to node {} using address {}", node, address);
            
            // 这里就是连接server的终极入口了
            selector.connect(nodeConnectionId,
                    new InetSocketAddress(address, node.port()),
                    this.socketSendBuffer,
                    this.socketReceiveBuffer);
        } catch (IOException e) {
            // 省略部分代码
        }
    }

好了，终于看到希望了，进入Selector.connect()方法，正是我之前打断点的代码，这里就不再占用篇幅了。

通过打断点跟踪的方式，终于找到了生产者连接server的过程。连接成功之后，就可以发送消息了。我们再回过头来看下ConsoleProducer.main()方法：

  def main(args: Array[String]): Unit = {

    try {
      val config = new ProducerConfig(args)

      // 接受控制台输入
      val input = System.in

      // 连接server
      val producer = new KafkaProducer[Array[Byte], Array[Byte]](producerProps(config))
      
      // 发送消息
      try loopReader(producer, newReader(config.readerClass, getReaderProps(config)), input, config.sync)
      finally producer.close()
      Exit.exit(0)
    } catch {
      // 省略部分代码
    }
  }

总结一下，main()方法就做了三件事：