大家好,我是烤鸭:
前几天遇到tidb节点重启后服务无法重连,确切地说是两个服务,一个可以正常重连,一个不行。
问题复现
由于线上执行慢SQL,导致TiDB 单个节点宕机重启。
其中A服务的3个节点和B服务的1个节点开始报错,重启后B服务恢复,A服务还在报错。
使用的组件版本:
A服务:tomcat-jdbc-9.0.71.jar
B服务:HikariCP-4.0.3.jar
数据库配置:
A服务:
spring.xxx-tidb-data-source.url=jdbc:mysql://xxx.xxx.xxx:6000/xxxA?useUnicode=true&characterEncoding=utf8&zeroDateTimeBehavior=convertToNull&useSSL=false&serverTimezone=GMT%2B8
B服务:
spring.datasource.url=jdbc:mysql://xxx.xxx.xxx:6000/xxxB?useSSL=false&autoReconnect=true
可以看到两边的配置和连接池不同。
报错日志
无法重连服务的报错日志:
Caused by: com.mysql.cj.exceptions.ConnectionIsClosedException: No operations allowed after connection closed.
at sun.reflect.GeneratedConstructorAccessor139.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:61)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:105)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:151)
at com.mysql.cj.NativeSession.checkClosed(NativeSession.java:762)
at com.mysql.cj.jdbc.ConnectionImpl.checkClosed(ConnectionImpl.java:569)
at com.mysql.cj.jdbc.ConnectionImpl.prepareStatement$original$McwhO8C0(ConnectionImpl.java:1580)
... 125 common frames omitted
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: The last packet successfully received from the server was 259,130,913 milliseconds ago. The last packet sent successfully to the server was 259,130,914 milliseconds ago.may or may not be greater than the server-side timeout (the driver was unable to determine the value of either the 'wait_timeout' or 'interactive_timeout' configuration values from the server.. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:61)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:105)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:151)
at com.mysql.cj.exceptions.ExceptionFactory.createCommunicationsException(ExceptionFactory.java:167)
at com.mysql.cj.protocol.a.NativeProtocol.readMessage(NativeProtocol.java:520)
at com.mysql.cj.protocol.a.NativeProtocol.checkErrorMessage(NativeProtocol.java:700)
at com.mysql.cj.protocol.a.NativeProtocol.sendCommand(NativeProtocol.java:639)
at com.mysql.cj.protocol.a.NativeProtocol.sendQueryPacket(NativeProtocol.java:987)
at com.mysql.cj.NativeSession.execSQL(NativeSession.java:666)
at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:930)
at com.mysql.cj.jdbc.ClientPreparedStatement.execute$original$mDKgOeFa(ClientPreparedStatement.java:371)
at com.mysql.cj.jdbc.ClientPreparedStatement.execute$original$mDKgOeFa$accessor$UdYpLG4x(ClientPreparedStatement.java)
at com.mysql.cj.jdbc.ClientPreparedStatement$auxiliary$BAL6Kq1Z.call(Unknown Source)
at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:86)
at com.mysql.cj.jdbc.ClientPreparedStatement.execute(ClientPreparedStatement.java)
at sun.reflect.GeneratedMethodAccessor261.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.tomcat.jdbc.pool.StatementFacade$StatementProxy.invoke(StatementFacade.java:118)
问题猜想
找到变量了,就顺着猜测查一下。
- 数据库配置参数 autoReconnect=true
- 连接池 tomcat-jdbc 和 HikariCP
报错日志都来源于 CJCommunicationsException 这个类,其实调用的是 ExceptionFactory.createLinkFailureMessageBasedOnHeuristics,报错日志来源于配置文件。
我保留了一部分一会可能用到的变量
LocalizedErrorMessages.properties
CommunicationsException.2=\ is longer than the server configured value of
CommunicationsException.3=''wait_timeout''
CommunicationsException.4=''interactive_timeout''
CommunicationsException.5=may or may not be greater than the server-side timeout
CommunicationsException.6=(the driver was unable to determine the value of either the
CommunicationsException.7=''wait_timeout'' or ''interactive_timeout'' configuration values from
CommunicationsException.8=the server.
CommunicationsException.11=. You should consider either expiring and/or testing connection validity
CommunicationsException.12=before use in your application, increasing the server configured values for client timeouts,
CommunicationsException.13=or using the Connector/J connection property ''autoReconnect=true'' to avoid this problem.
CommunicationsException.20=Communications link failure
...
Connection.0=Unable to connect to database.
Connection.1=Cannot connect to MySQL server on {0}:{1}.\n\nMake sure that there is a MySQL server running on the machine/port you are trying to connect to and that the machine this software is running on is able to connect to this host/port (i.e. not firewalled). Also make sure that the server has not been started with the --skip-networking flag.\n\n
Connection.2=No operations allowed after connection closed.
Connection.3=Can''t call commit when autocommit=true
Connection.4=Communications link failure during commit(). Transaction resolution unknown.
Connection.5=Unknown Java encoding for the character set with index ''{0}''. Use the ''customCharsetMapping'' property to force it.
Connection.6=Unknown character set index ''{0}'' received from server. The appropriate client character set can be forced via the ''characterEncoding'' property.
Connection.7=Can''t map {0} given for characterSetResults to a supported MySQL encoding.
Connection.8=Unable to use encoding: {0}
看看源码
从报错日志入手,根据链接状态和上次发送、接收时间判断,拼一些异常原因。
可以看出报错日志有几个组成部分:
1、上次发送packet的时间距离现在超过8小时,getLastPacketSentTime > 28800秒(8小时)
2、当前链接已关闭:No operations allowed after connection closed
ExceptionFactory
public static String createLinkFailureMessageBasedOnHeuristics(PropertySet propertySet, ServerSession serverSession,
PacketSentTimeHolder packetSentTimeHolder, PacketReceivedTimeHolder packetReceivedTimeHolder, Throwable underlyingException) {
long serverTimeoutSeconds = 0;
boolean isInteractiveClient = false;
long lastPacketReceivedTimeMs = packetReceivedTimeHolder == null ? 0L : packetReceivedTimeHolder.getLastPacketReceivedTime();
long lastPacketSentTimeMs = packetSentTimeHolder.getLastPacketSentTime();
if (lastPacketSentTimeMs > lastPacketReceivedTimeMs) {
lastPacketSentTimeMs = packetSentTimeHolder.getPreviousPacketSentTime();
}
if (propertySet != null) {
isInteractiveClient = propertySet.getBooleanProperty(PropertyKey.interactiveClient).getValue();
String serverTimeoutSecondsStr = null;
if (serverSession != null) {
serverTimeoutSecondsStr = isInteractiveClient ? serverSession.getServerVariable("interactive_timeout")
: serverSession.getServerVariable("wait_timeout");
}
if (serverTimeoutSecondsStr != null) {
try {
serverTimeoutSeconds = Long.parseLong(serverTimeoutSecondsStr);
} catch (NumberFormatException nfe) {
serverTimeoutSeconds = 0;
}
}
}
StringBuilder exceptionMessageBuf = new StringBuilder();
long nowMs = System.currentTimeMillis();
if (lastPacketSentTimeMs == 0) {
lastPacketSentTimeMs = nowMs;
}
long timeSinceLastPacketSentMs = (nowMs - lastPacketSentTimeMs);
long timeSinceLastPacketSeconds = timeSinceLastPacketSentMs / 1000;
long timeSinceLastPacketReceivedMs = (nowMs - lastPacketReceivedTimeMs);
int dueToTimeout = DUE_TO_TIMEOUT_FALSE;
StringBuilder timeoutMessageBuf = null;
if (serverTimeoutSeconds != 0) {
if (timeSinceLastPacketSeconds > serverTimeoutSeconds) {
dueToTimeout = DUE_TO_TIMEOUT_TRUE;
timeoutMessageBuf = new StringBuilder();
timeoutMessageBuf.append(Messages.getString("CommunicationsException.2"));
timeoutMessageBuf.append(Messages.getString(isInteractiveClient ? "CommunicationsException.4" : "CommunicationsException.3"));
}
// 上次发送成功的时间距离现在超过8小时
} else if (timeSinceLastPacketSeconds > DEFAULT_WAIT_TIMEOUT_SECONDS) {
dueToTimeout = DUE_TO_TIMEOUT_MAYBE;
timeoutMessageBuf = new StringBuilder();
timeoutMessageBuf.append(Messages.getString("CommunicationsException.5"));
timeoutMessageBuf.append(Messages.getString("CommunicationsException.6"));
timeoutMessageBuf.append(Messages.getString("CommunicationsException.7"));
timeoutMessageBuf.append(Messages.getString("CommunicationsException.8"));
}
if (dueToTimeout == DUE_TO_TIMEOUT_TRUE || dueToTimeout == DUE_TO_TIMEOUT_MAYBE) {
exceptionMessageBuf.append(lastPacketReceivedTimeMs != 0
? Messages.getString("CommunicationsException.ServerPacketTimingInfo",
new Object[] { Long.valueOf(timeSinceLastPacketReceivedMs), Long.valueOf(timeSinceLastPacketSentMs) })
: Messages.getString("CommunicationsException.ServerPacketTimingInfoNoRecv", new Object[] { Long.valueOf(timeSinceLastPacketSentMs) }));
if (timeoutMessageBuf != null) {
exceptionMessageBuf.append(timeoutMessageBuf);
}
exceptionMessageBuf.append(Messages.getString("CommunicationsException.11"));
exceptionMessageBuf.append(Messages.getString("CommunicationsException.12"));
exceptionMessageBuf.append(Messages.getString("CommunicationsException.13"));
} else {
//
// Attempt to determine the reason for the underlying exception (we can only make a best-guess here)
//
if (underlyingException instanceof BindException) {
String localSocketAddress = propertySet.getStringProperty(PropertyKey.localSocketAddress).getValue();
boolean interfaceNotAvaliable;
try {
interfaceNotAvaliable = localSocketAddress != null && NetworkInterface.getByName(localSocketAddress) == null;
} catch (SocketException e1) {
interfaceNotAvaliable = false;
}
exceptionMessageBuf.append(interfaceNotAvaliable ? Messages.getString("CommunicationsException.LocalSocketAddressNotAvailable")
: Messages.getString("CommunicationsException.TooManyClientConnections"));
}
}
if (exceptionMessageBuf.length() == 0) {
// We haven't figured out a good reason, so copy it.
exceptionMessageBuf.append(Messages.getString("CommunicationsException.20"));
if (propertySet.getBooleanProperty(PropertyKey.maintainTimeStats).getValue() && !propertySet.getBooleanProperty(PropertyKey.paranoid).getValue()) {
exceptionMessageBuf.append("\n\n");
exceptionMessageBuf.append(lastPacketReceivedTimeMs != 0
? Messages.getString("CommunicationsException.ServerPacketTimingInfo",
new Object[] { Long.valueOf(timeSinceLastPacketReceivedMs), Long.valueOf(timeSinceLastPacketSentMs) })
: Messages.getString("CommunicationsException.ServerPacketTimingInfoNoRecv", new Object[] { Long.valueOf(timeSinceLastPacketSentMs) }));
}
}
return exceptionMessageBuf.toString();
}
再根据上边的两个猜测,找一下对应的源码。
猜想1:autoReconnect 这个参数是有用的
网上很多说这个参数只针对mysql5以下,这个倒是没找到,官网关于这个参数的说明。
https://dev.mysql.com/doc/connectors/en/connector-j-connp-props-high-availability-and-clustering.html#cj-conn-prop_autoReconnect
如果链接断开的情况下,配置了这个参数可以发起重试。如果不配置的话,就会一直报错。
猜想2:HikariCP 会自动重连,tomcat-jdbc不会
可以看到 tomcat-jdbc 执行 invoke,getConnetion一气呵成,在默认配置下,也没有任何校验,如果因为第三方导致的链接断开,就麻烦了。
public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
if (compare(ISCLOSED_VAL,method)) {
return Boolean.valueOf(isClosed());
}
if (compare(CLOSE_VAL,method)) {
if (connection==null)
{
return null; //noop for already closed.
}
PooledConnection poolc = this.connection;
this.connection = null;
pool.returnConnection(poolc);
return null;
} else if (compare(TOSTRING_VAL,method)) {
return this.toString();
} else if (compare(GETCONNECTION_VAL,method) && connection!=null) {
return connection.getConnection();
} else if (method.getDeclaringClass().isAssignableFrom(XAConnection.class)) {
try {
return method.invoke(connection.getXAConnection(),args);
}catch (Throwable t) {
if (t instanceof InvocationTargetException) {
throw t.getCause() != null ? t.getCause() : t;
} else {
throw t;
}
}
}
if (isClosed()) {
throw new SQLException("Connection has already been closed.");
}
if (compare(UNWRAP_VAL,method)) {
return unwrap((Class<?>)args[0]);
} else if (compare(ISWRAPPERFOR_VAL,method)) {
return Boolean.valueOf(this.isWrapperFor((Class<?>)args[0]));
}
try {
PooledConnection poolc = connection;
if (poolc!=null) {
return method.invoke(poolc.getConnection(),args);
} else {
throw new SQLException("Connection has already been closed.");
}
}catch (Throwable t) {
if (t instanceof InvocationTargetException) {
throw t.getCause() != null ? t.getCause() : t;
} else {
throw t;
}
}
}
HikariCP
HikarPool.getConnection 会校验当前是否存活,如果不存活,会关闭当前链接,并会重新创建连接。
问题解决
当天的解决方案比较简单,重启服务就行。不过如果想彻底解决这个问题的话。
-
升级连接池,tomcat-jdbc 升级 druid 或者 HikariCP ,springboot 2.0以上默认是 HikariCP
-
修改配置,如果还使用 tomcat-jdbc :
增加以下配置:
# 连接存活检查 spring.datasource.tomcat.validation-query='SELECT 1' # 指定多少ms执行一次连接校验. spring.datasource.tomcat.validation-interval=3000 # 指定连接校验查询的超时时间 spring.datasource.tomcat.validation-query-timeout=-1
或者连接地址增加 autoReconnect=true
总结
有些问题不经常遇到,即便是预发1比1的环境可能也没办法避免。除了在事故处理的时候可以完善流程,就只能尽量保持架构统一,尤其是老项目。业务侧增加全方面中间件的监控,尤其是存活检测这种。坑是无穷无尽的,踩完了记得遇到相同的跳过去。
相关文章
https://zhuanlan.zhihu.com/p/361130333?utm_id=0
https://blog.csdn.net/zhangxin09/article/details/124901850
https://blog.csdn.net/u014398624/article/details/47746473