SWT系列资料很多来源于Google
(1)概念相关
SWT是SoftWare Watchdog Timeout的缩写,在Android系统中,为了监控SystemServer是否处于正常运行状态,加入了SWT线程来监控SystemServer中重要线程和Service的运行情况。判断如果被阻塞超过60s就会将系统重启,来保证系统恢复成正常状态。
判断阻塞的方法有两个:
- 利用Services注册的monitor来check;
- 发送handle到重要的loop线程来check是否阻塞;
System Server进程是Android的一个核心进程,里面为APP运行提供了核心的服务。如果System Server的一些核心服务和重要线程卡住,就会导致相应的功能异常。
所以有必要在核心服务和核心线程卡住的时候,让系统有自动复位的机会。于是,google引入了Sytem Server watchdog机制。这个机制来监控核心服务和核心线程是否卡住。
原理设计框图:
(2)WatchDog的初始化和监听
(A)Watchdog的初始化
//frameworks/base/services/core/java/com/android/server/Watchdog.java
public static Watchdog getInstance() {
if (sWatchdog == null) {
sWatchdog = new Watchdog();
}
return sWatchdog;
}
private Watchdog() {
mThread = new Thread(this::run, "watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// And the animation thread.
mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
"animation thread", DEFAULT_TIMEOUT));
// And the surface animation thread.
mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
"surface animation thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
mInterestingJavaPids.add(Process.myPid());
// See the notes on DEFAULT_TIMEOUT.
assert DB ||
DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
//mtk enhance
exceptionHang = ExceptionLog.getInstance();
mTraceErrorLogger = new TraceErrorLogger();
}
public void init(Context context, ActivityManagerService activity) {
mActivity = activity;
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
if (exceptionHang != null) {
exceptionHang.WDTMatterJava(0);
}
}
public void start() {
mThread.start();
}
Android的Watchdog是一个单例线程,在System Server启动时就会init &start。
//frameworks/base/services/java/com/android/server/SystemServer.java
private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
t.traceBegin("startBootstrapServices");
// Start the watchdog as early as possible so we can crash the system server
// if we deadlock during early boot
t.traceBegin("StartWatchdog");
final Watchdog watchdog = Watchdog.getInstance();
watchdog.start();
t.traceEnd();
//...
t.traceBegin("InitWatchdog");
watchdog.init(mSystemContext, mActivityManagerService);
t.traceEnd();
}
(B)HandlerChecker的分类
Watchdog在初始化时,会构建很多HandlerChecker,大致可以分为两类。
(b_1)Monitor Checker
用于检查是Monitor对象可能发生的死锁,AMS,PKMS,WMS等核心的系统服务都是Monitor对象。
private final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
private final HandlerChecker mMonitorChecker;
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
public void addMonitor(Monitor monitor) {
synchronized (mLock) {
mMonitorChecker.addMonitorLocked(monitor);
}
}
各个server的monitor函数也只是检测是否可以获得要检测的锁对象,这些service通过Watchdog.getInstance().addMonitor(this)将自己(实现了Watchdog.Monitor)添加到
Watchdog.mMonitorChecker.mMonitors列表中,该列表会不断调用Monitor.monitor()函数。
public class ActivityManagerService extends IActivityManager.Stub
implements Watchdog.Monitor{
public void monitor() {
synchronized (this) { }
}
}
public final class PowerManagerService extends SystemService
implements Watchdog.Monitor {
public void monitor() {
// Grab and release lock for watchdog monitor to detect deadlocks.
synchronized (mLock) {
}
}
}
具体看各个service中实现的monitor函数,发现这个函数很简单,就是去获取对应锁,如果线程死锁或其他原因阻塞,那么必然无法获取锁,monitor()函数执行必然会阻塞。Watchdog就是利用这个原理来判断是否死锁。
(b_2)Looper Checker
用于检查线程的消息队列是否长时间处于工作状态。Watchdog自身的消息队列,Ui,IO,Display这些全局的消息队列都是被检查的对象。此外,一些重要的线程的消息队列,也会加入到Looper Checker中,譬如AMS,PKMS,这些是在对应的对象初始化时加入的。
private final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
public void addThread(Handler thread) {
addThread(thread, DEFAULT_TIMEOUT);
}
public void addThread(Handler thread, long timeoutMillis) {
synchronized (mLock) {
final String name = thread.getLooper().getThread().getName();
mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
}
}
- addThread()将PowerManagerService、PackageManagerService、ActivityManagerService
等几个主线程Handler保存到Watchdog.mHandlerCheckers列表中; - 同时还会把上面提到的mMonitorChecker也保存到Watchdog.mHandlerCheckers中;
- 另外还会将foreground thread、ui thread、i/o thread 、display thread 、main thread的Handler也保存到Watchdog.mHandlerCheckers中来;
Watchdog会不断判断这些线程的Looper是否空闲,如果一直非空闲,那么必然被blocked住了。
(3)WatchDog的运作
通过前面的初始化,已经将watchdog需要监测的对象全部准备就绪。接下来就要看它具体是如何去监测的了。Watchdog本身就是一个线程,我们想知道它是如何去监测各个对象的?那就直接从它的run方法来看就好。
private void run() {
boolean waitedHalf = false; //标识第一个30s超时
boolean mSfHang = false; //标识surfaceflinger是否hang
while (true) {
List<HandlerChecker> blockedCheckers = Collections.emptyList();
String subject = "";
final String sfLog;
boolean allowRestart = true; //发生SWT要不要重启
int debuggerWasConnected = 0;
boolean doWaitedHalfDump = false;
final ArrayList<Integer> pids;
mSfHang = false;
if (exceptionHang != null) {
exceptionHang.WDTMatterJava(300);
}
synchronized (mLock) {
long timeout = CHECK_INTERVAL;
long sfHangTime;
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
//(1)调度所有的HandlerChecker
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked();
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
long start = SystemClock.uptimeMillis();
//(2)开始定期检查
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
mLock.wait(timeout);
// Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
//(3)检查HandlerChecker的完成状态
//mtk enhance
sfHangTime = getSfStatus();
if (DEBUG) Slog.w(TAG, "**Get SF Time **" + sfHangTime);
//sf hang住时间大于40s
if (sfHangTime > TIME_SF_WAIT * 2) {
Slog.v(TAG, "**SF hang Time **" + sfHangTime);
mSfHang = true;
blockedCheckers = getBlockedCheckersLocked();
pids = new ArrayList<>(mInterestingJavaPids);
subject = "";
} else {
//检查状态
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) { //检测完成并正常,继续检查
//after waited_half, system_server not die
if (exceptionHang != null && waitedHalf) {
exceptionHang.switchFtrace(4);
}
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) { //30秒之内,继续检查
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) { //30~60秒之内,dump一些信息并继续检查
if (!waitedHalf) {
Slog.i(TAG, "WAITED_HALF");
waitedHalf = true;
// We've waited half, but we'd need to do the stack trace dump w/o the lock.
pids = new ArrayList<>(mInterestingJavaPids);
doWaitedHalfDump = true;
} else {
continue;
}
} else {
//(4)收集超时的HandlerChecker
// something is overdue!
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
pids = new ArrayList<>(mInterestingJavaPids);
}
}
} // END synchronized (mLock)
if (doWaitedHalfDump) {
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
if (exceptionHang != null) {
exceptionHang.WDTMatterJava(360);
exceptionHang.switchFtrace(3);
}
ActivityManagerService.dumpStackTraces(pids, null, null,
getInterestingNativePids(), null, subject);
continue;
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
//(5)保存一些重要日志,并根据设定,来判断是否需要重启系统
Slog.e(TAG, "**SWT happen **" + subject);
if (exceptionHang != null) {
exceptionHang.switchFtrace(2);
}
sfLog = (mSfHang && subject.isEmpty()) ? "surfaceflinger hang." : "";
EventLog.writeEvent(EventLogTags.WATCHDOG, sfLog.isEmpty() ? subject : sfLog);
if (exceptionHang != null) {
exceptionHang.WDTMatterJava(420);
}
final UUID errorId;
if (mTraceErrorLogger.isAddErrorIdEnabled()) {
errorId = mTraceErrorLogger.generateErrorId();
mTraceErrorLogger.addErrorIdToTrace("system_server", errorId);
} else {
errorId = null;
}
// Log the atom as early as possible since it is used as a mechanism to trigger
// Perfetto. Ideally, the Perfetto trace capture should happen as close to the
// point in time when the Watchdog happens as possible.
FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED, subject);
long anrTime = SystemClock.uptimeMillis();
StringBuilder report = new StringBuilder();
report.append(MemoryPressureUtil.currentPsiState());
ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
StringWriter tracesFileException = new StringWriter();
final File stack = ActivityManagerService.dumpStackTraces(
pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
tracesFileException, subject);
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(5000);
processCpuTracker.update();
report.append(processCpuTracker.printCurrentState(anrTime));
report.append(tracesFileException.getBuffer());
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w');
doSysRq('l');
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
final String localSubject = subject;
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
// If a watched thread hangs before init() is called, we don't have a
// valid mActivity. So we can't log the error to dropbox.
if (mActivity != null) {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null, null,
sfLog.isEmpty() ? localSubject : sfLog, report.toString(),
stack, null, null, null, errorId);
}
}
};
dropboxThread.start();
try {
dropboxThread.join(4000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
IActivityController controller;
synchronized (mLock) {
controller = mController;
}
if ((mSfHang == false) && (controller != null)) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
if (!Build.IS_USER && isCrashLoopFound()
&& !WatchdogProperties.should_ignore_fatal_count().orElse(false)) {
breakCrashLoop();
}
exceptionHang.WDTMatterJava(330);
if (mSfHang) {
Slog.w(TAG, "SF hang!");
if (getSfReboot() > 3) {
Slog.w(TAG, "SF hang reboot time larger than 3 time, reboot device!");
rebootSystem("Maybe SF driver hang, reboot device.");
} else {
setSfReboot();
}
Slog.v(TAG, "killing surfaceflinger for surfaceflinger hang");
String[] sf = new String[] {"/system/bin/surfaceflinger"};
int[] pid_sf = Process.getPidsForCommands(sf);
if (pid_sf[0] > 0) {
Process.killProcess(pid_sf[0]);
}
Slog.v(TAG, "kill surfaceflinger end");
} else {
Process.killProcess(Process.myPid());
}
System.exit(10);
}
waitedHalf = false;
}
}
以上代码片段主要的运行逻辑如下:
- Watchdog运行后,便开始无限循环,依次调用每一个HandlerChecker的scheduleCheckLocked()方法;
- 调度完HandlerChecker之后,便开始定期检查是否超时,每一次检查的间隔时间由CHECK_INTERVAL常量设定,为30秒;
- 每一次检查都会调用evaluateCheckerCompletionLocked()方法来评估一下HandlerChecker的完成状态:
(a)COMPLETED表示已经完成
(b)WAITING和WAITED_HALF表示还在等待,但未超时
(c)OVERDUE表示已经超时。默认情况下,timeout是1分钟,但监测对象可以通过传参自行设定,譬如PKMS的Handler Checker的超时是10分钟 - 如果超时时间到了,还有HandlerChecker处于未完成的状态(OVERDUE),则通过getBlockedCheckersLocked()方法,获取阻塞的HandlerChecker,生成一些描述信息;
- 保存日志,包括一些运行时的堆栈信息,这些日志是我们解决Watchdog问题的重要依据。如果判断需要杀掉system_server进程,则给当前进程(system_server)发送signal 9;