背景
启动 xxl-job admin 端,日志开始报获取锁超时
2024-07-11 15:12:45,375 ERROR [xxl-job, admin JobScheduleHelper#scheduleThread] c.x.j.a.c.thread.JobScheduleHelper [JobScheduleHelper.java : 155] >>>>>>>>>>> xxl-job, JobScheduleHelper#scheduleThread error:{}
com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Lock wait timeout exceeded; try restarting transaction
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:124)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:912)
at com.mysql.cj.jdbc.ClientPreparedStatement.execute(ClientPreparedStatement.java:354)
at com.zaxxer.hikari.pool.ProxyPreparedStatement.execute(ProxyPreparedStatement.java:44)
at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.execute(HikariProxyPreparedStatement.java)
at com.xxl.job.admin.core.thread.JobScheduleHelper$1.run(JobScheduleHelper.java:74)
at java.lang.Thread.run(Thread.java:748)
2024-07-11 15:13:35,464 ERROR [xxl-job, admin JobScheduleHelper#scheduleThread] c.x.j.a.c.thread.JobScheduleHelper [JobScheduleHelper.java : 155] >>>>>>>>>>> xxl-job, JobScheduleHelper#scheduleThread error:{}
com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Lock wait timeout exceeded; try restarting transaction
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:124)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:912)
at com.mysql.cj.jdbc.ClientPreparedStatement.execute(ClientPreparedStatement.java:354)
at com.zaxxer.hikari.pool.ProxyPreparedStatement.execute(ProxyPreparedStatement.java:44)
at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.execute(HikariProxyPreparedStatement.java)
at com.xxl.job.admin.core.thread.JobScheduleHelper$1.run(JobScheduleHelper.java:74)
at java.lang.Thread.run(Thread.java:748)
分析
定位 admin 端代码如下
while (!scheduleThreadToStop) {
// Scan Job
long start = System.currentTimeMillis();
Connection conn = null;
Boolean connAutoCommit = null;
PreparedStatement preparedStatement = null;
boolean preReadSuc = true;
try {
conn = XxlJobAdminConfig.getAdminConfig().getDataSource().getConnection();
connAutoCommit = conn.getAutoCommit();
conn.setAutoCommit(false);
preparedStatement = conn.prepareStatement( "select * from xxl_job_lock where lock_name = 'schedule_lock' for update" );
preparedStatement.execute();
// tx start
...
} catch (Exception e) {
if (!scheduleThreadToStop) {
logger.error(">>>>>>>>>>> xxl-job, JobScheduleHelper#scheduleThread error:{}", e);
}
} finally {
// commit
if (conn != null) {
try {
conn.commit();
} catch (SQLException e) {
if (!scheduleThreadToStop) {
logger.error(e.getMessage(), e);
}
}
try {
conn.setAutoCommit(connAutoCommit);
} catch (SQLException e) {
if (!scheduleThreadToStop) {
logger.error(e.getMessage(), e);
}
}
try {
conn.close();
} catch (SQLException e) {
if (!scheduleThreadToStop) {
logger.error(e.getMessage(), e);
}
}
}
// close PreparedStatement
if (null != preparedStatement) {
try {
preparedStatement.close();
} catch (SQLException e) {
if (!scheduleThreadToStop) {
logger.error(e.getMessage(), e);
}
}
}
}
long cost = System.currentTimeMillis()-start;
// Wait seconds, align second
if (cost < 1000) { // scan-overtime, not wait
try {
// pre-read period: success > scan each second; fail > skip this period;
TimeUnit.MILLISECONDS.sleep((preReadSuc?1000:PRE_READ_MS) - System.currentTimeMillis()%1000);
} catch (InterruptedException e) {
if (!scheduleThreadToStop) {
logger.error(e.getMessage(), e);
}
}
}
这段代码在任务执行之前会通过 SELECT * FROM xxl_job_lock WHERE lock_name = 'schedule_lock' FOR UPDATE; 获取行锁,避免在 admin 集群的情况下重复调度,且这个任务的周期在 5 秒以内。debug 后发现每次都会阻塞在获取行锁,并且 50 秒报一次错误日志
SHOW GLOBAL VARIABLES LIKE 'innodb_lock_wait_timeout';
发现获取锁超时时间确实是 50 秒,也就是说这个周期任务每次都会卡在获取行锁阻塞超时,接下来查看是什么阻塞了每次获取锁的事务
SELECT * FROM INFORMATION_SCHEMA.INNODB_TRX;
这条 SQL 是查当前事务,能查出两条记录,一条是 LOCK_WAITS 状态的 SELECT * FROM xxl_job_lock WHERE lock_name = 'schedule_lock' FOR UPDATE; 语句,即被阻塞的事务;另一条是 RUNNING 状态的长事务,从 2 个小时前开始。初步判断是因为这条长事务
等到想查这条长事务具体的 SQL 时,却发现那条长事务已经提交或回滚了,程序又恢复正常了,晕。。。
最后整理一下定位此类问题的 SQL
SELECT * FROM sys.innodb_lock_waits;
SELECT * FROM INFORMATION_SCHEMA.INNODB_TRX;
SELECT * FROM performance_schema.data_locks;
SHOW FULL PROCESSLIST;
select thread_id,processlist_id from performance_schema.threads where processlist_id=1352;
select * from performance_schema.events_statements_history where thread_id=4767;
select * from performance_schema.events_statements_current;
社区
发现 xxl-job github 也有类似 issue:
https://github.com/xuxueli/xxl-job/issues?q=Lock+wait+timeout+exceeded%3B