MTK平台的SWT异常的简单总结（1）—

SWT系列资料很多来源于Google

（1）概念相关

SWT是SoftWare Watchdog Timeout的缩写，在Android系统中，为了监控SystemServer是否处于正常运行状态，加入了SWT线程来监控SystemServer中重要线程和Service的运行情况。判断如果被阻塞超过60s就会将系统重启，来保证系统恢复成正常状态。

判断阻塞的方法有两个：

利用Services注册的monitor来check；
发送handle到重要的loop线程来check是否阻塞；

System Server进程是Android的一个核心进程，里面为APP运行提供了核心的服务。如果System Server的一些核心服务和重要线程卡住，就会导致相应的功能异常。

所以有必要在核心服务和核心线程卡住的时候，让系统有自动复位的机会。于是，google引入了Sytem Server watchdog机制。这个机制来监控核心服务和核心线程是否卡住。

在这里插入图片描述
原理设计框图：

在这里插入图片描述

（2）WatchDog的初始化和监听

（A）Watchdog的初始化

//frameworks/base/services/core/java/com/android/server/Watchdog.java

public static Watchdog getInstance() {
        if (sWatchdog == null) {
            sWatchdog = new Watchdog();
        }

        return sWatchdog;
    }

private Watchdog() {
        mThread = new Thread(this::run, "watchdog");
        // Initialize handler checkers for each common thread we want to check.  Note
        // that we are not currently checking the background thread, since it can
        // potentially hold longer running operations with no guarantees about the timeliness
        // of operations there.

        // The shared foreground thread is the main checker.  It is where we
        // will also dispatch monitor checks and do other work.
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
        // And the animation thread.
        mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
                "animation thread", DEFAULT_TIMEOUT));
        // And the surface animation thread.
        mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
                "surface animation thread", DEFAULT_TIMEOUT));

        // Initialize monitor for Binder threads.
        addMonitor(new BinderThreadMonitor());

        mInterestingJavaPids.add(Process.myPid());

        // See the notes on DEFAULT_TIMEOUT.
        assert DB ||
                DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;

        //mtk enhance
        exceptionHang = ExceptionLog.getInstance();

        mTraceErrorLogger = new TraceErrorLogger();
    }

public void init(Context context, ActivityManagerService activity) {
        mActivity = activity;
        context.registerReceiver(new RebootRequestReceiver(),
                new IntentFilter(Intent.ACTION_REBOOT),
                android.Manifest.permission.REBOOT, null);
        if (exceptionHang != null) {
            exceptionHang.WDTMatterJava(0);
        }
    }

public void start() {
        mThread.start();
    }

Android的Watchdog是一个单例线程，在System Server启动时就会init &start。

//frameworks/base/services/java/com/android/server/SystemServer.java

private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
        t.traceBegin("startBootstrapServices");

        // Start the watchdog as early as possible so we can crash the system server
        // if we deadlock during early boot
        t.traceBegin("StartWatchdog");
        final Watchdog watchdog = Watchdog.getInstance();
        watchdog.start();
        t.traceEnd();

	//...

		t.traceBegin("InitWatchdog");
        watchdog.init(mSystemContext, mActivityManagerService);
        t.traceEnd();
}

（B）HandlerChecker的分类

Watchdog在初始化时，会构建很多HandlerChecker，大致可以分为两类。

（b_1）Monitor Checker

用于检查是Monitor对象可能发生的死锁，AMS，PKMS，WMS等核心的系统服务都是Monitor对象。

private final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
private final HandlerChecker mMonitorChecker;

mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
	mHandlerCheckers.add(mMonitorChecker);

public void addMonitor(Monitor monitor) {
        synchronized (mLock) {
            mMonitorChecker.addMonitorLocked(monitor);
        }
    }

在这里插入图片描述
各个server的monitor函数也只是检测是否可以获得要检测的锁对象，这些service通过Watchdog.getInstance().addMonitor(this)将自己（实现了Watchdog.Monitor）添加到
Watchdog.mMonitorChecker.mMonitors列表中，该列表会不断调用Monitor.monitor()函数。

public class ActivityManagerService extends IActivityManager.Stub 
		implements Watchdog.Monitor{

	public void monitor() {
        synchronized (this) { }
    }
}

public final class PowerManagerService extends SystemService
        implements Watchdog.Monitor {

	public void monitor() {
        // Grab and release lock for watchdog monitor to detect deadlocks.
        synchronized (mLock) {
        }
    }
}

具体看各个service中实现的monitor函数，发现这个函数很简单，就是去获取对应锁，如果线程死锁或其他原因阻塞，那么必然无法获取锁，monitor()函数执行必然会阻塞。Watchdog就是利用这个原理来判断是否死锁。

（b_2）Looper Checker

用于检查线程的消息队列是否长时间处于工作状态。Watchdog自身的消息队列，Ui，IO，Display这些全局的消息队列都是被检查的对象。此外，一些重要的线程的消息队列，也会加入到Looper Checker中，譬如AMS，PKMS，这些是在对应的对象初始化时加入的。

private final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();

public void addThread(Handler thread) {
        addThread(thread, DEFAULT_TIMEOUT);
    }

public void addThread(Handler thread, long timeoutMillis) {
        synchronized (mLock) {
            final String name = thread.getLooper().getThread().getName();
            mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
        }
    }

在这里插入图片描述

addThread()将PowerManagerService、PackageManagerService、ActivityManagerService
等几个主线程Handler保存到Watchdog.mHandlerCheckers列表中；
同时还会把上面提到的mMonitorChecker也保存到Watchdog.mHandlerCheckers中；
另外还会将foreground thread、ui thread、i/o thread 、display thread 、main thread的Handler也保存到Watchdog.mHandlerCheckers中来；

Watchdog会不断判断这些线程的Looper是否空闲，如果一直非空闲，那么必然被blocked住了。

（3）WatchDog的运作

通过前面的初始化，已经将watchdog需要监测的对象全部准备就绪。接下来就要看它具体是如何去监测的了。Watchdog本身就是一个线程，我们想知道它是如何去监测各个对象的？那就直接从它的run方法来看就好。

private void run() {
        boolean waitedHalf = false;	//标识第一个30s超时
        boolean mSfHang = false;	//标识surfaceflinger是否hang
        while (true) {
            List<HandlerChecker> blockedCheckers = Collections.emptyList();
            String subject = "";
            final String sfLog;
            boolean allowRestart = true;		//发生SWT要不要重启
            int debuggerWasConnected = 0;
            boolean doWaitedHalfDump = false;
            final ArrayList<Integer> pids;

            mSfHang = false;
            if (exceptionHang != null) {
                exceptionHang.WDTMatterJava(300);
            }
            synchronized (mLock) {
                long timeout = CHECK_INTERVAL;
                long sfHangTime;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
                
                //（1）调度所有的HandlerChecker
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    hc.scheduleCheckLocked();
                }

                if (debuggerWasConnected > 0) {
                    debuggerWasConnected--;
                }

                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
                long start = SystemClock.uptimeMillis();

				//（2）开始定期检查
                while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        mLock.wait(timeout);
                        // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }

				//（3）检查HandlerChecker的完成状态
                //mtk enhance
                sfHangTime = getSfStatus();
                if (DEBUG) Slog.w(TAG, "**Get SF Time **" + sfHangTime);

				//sf hang住时间大于40s
                if (sfHangTime > TIME_SF_WAIT * 2) {
                    Slog.v(TAG, "**SF hang Time **" + sfHangTime);
                    mSfHang = true;
                    blockedCheckers = getBlockedCheckersLocked();
                    pids = new ArrayList<>(mInterestingJavaPids);
                    subject = "";
                } else {
                	//检查状态
                    final int waitState = evaluateCheckerCompletionLocked();
                    if (waitState == COMPLETED) {		//检测完成并正常，继续检查
                        //after waited_half, system_server not die
                        if (exceptionHang != null && waitedHalf) {
                            exceptionHang.switchFtrace(4);
                        }
                        // The monitors have returned; reset
                        waitedHalf = false;
                        continue;
                    } else if (waitState == WAITING) {		//30秒之内，继续检查
                        // still waiting but within their configured intervals; back off and recheck
                        continue;
                    } else if (waitState == WAITED_HALF) {		//30~60秒之内，dump一些信息并继续检查
                        if (!waitedHalf) {
                            Slog.i(TAG, "WAITED_HALF");
                            waitedHalf = true;
                            // We've waited half, but we'd need to do the stack trace dump w/o the lock.
                            pids = new ArrayList<>(mInterestingJavaPids);
                            doWaitedHalfDump = true;
                        } else {
                            continue;
                        }
                    } else {

						//（4）收集超时的HandlerChecker
                        // something is overdue!
                        blockedCheckers = getBlockedCheckersLocked();
                        subject = describeCheckersLocked(blockedCheckers);
                        allowRestart = mAllowRestart;
                        pids = new ArrayList<>(mInterestingJavaPids);
                    }
                }
            } // END synchronized (mLock)

            if (doWaitedHalfDump) {
                // We've waited half the deadlock-detection interval.  Pull a stack
                // trace and wait another half.
                if (exceptionHang != null) {
                    exceptionHang.WDTMatterJava(360);
                    exceptionHang.switchFtrace(3);
                }
                ActivityManagerService.dumpStackTraces(pids, null, null,
                        getInterestingNativePids(), null, subject);
                continue;
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.

			//（5）保存一些重要日志，并根据设定，来判断是否需要重启系统
            Slog.e(TAG, "**SWT happen **" + subject);
            if (exceptionHang != null) {
                exceptionHang.switchFtrace(2);
            }
            sfLog = (mSfHang && subject.isEmpty()) ? "surfaceflinger hang." : "";
            EventLog.writeEvent(EventLogTags.WATCHDOG, sfLog.isEmpty() ? subject : sfLog);
            if (exceptionHang != null) {
                exceptionHang.WDTMatterJava(420);
            }

            final UUID errorId;
            if (mTraceErrorLogger.isAddErrorIdEnabled()) {
                errorId = mTraceErrorLogger.generateErrorId();
                mTraceErrorLogger.addErrorIdToTrace("system_server", errorId);
            } else {
                errorId = null;
            }

            // Log the atom as early as possible since it is used as a mechanism to trigger
            // Perfetto. Ideally, the Perfetto trace capture should happen as close to the
            // point in time when the Watchdog happens as possible.
            FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED, subject);

            long anrTime = SystemClock.uptimeMillis();
            StringBuilder report = new StringBuilder();
            report.append(MemoryPressureUtil.currentPsiState());
            ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
            StringWriter tracesFileException = new StringWriter();
            final File stack = ActivityManagerService.dumpStackTraces(
                    pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
                    tracesFileException, subject);

            // Give some extra time to make sure the stack traces get written.
            // The system's been hanging for a minute, another second or two won't hurt much.
            SystemClock.sleep(5000);

            processCpuTracker.update();
            report.append(processCpuTracker.printCurrentState(anrTime));
            report.append(tracesFileException.getBuffer());

            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
            doSysRq('w');
            doSysRq('l');

            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked.  (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            final String localSubject = subject;
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        // If a watched thread hangs before init() is called, we don't have a
                        // valid mActivity. So we can't log the error to dropbox.
                        if (mActivity != null) {
                            mActivity.addErrorToDropBox(
                                    "watchdog", null, "system_server", null, null, null,
                                    sfLog.isEmpty() ? localSubject : sfLog, report.toString(),
                                    stack, null, null, null, errorId);
                        }
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(4000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}

            IActivityController controller;
            synchronized (mLock) {
                controller = mController;
            }
            if ((mSfHang == false) && (controller != null)) {
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    // 1 = keep waiting, -1 = kill system
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }

            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
                Slog.w(TAG, "*** GOODBYE!");
                if (!Build.IS_USER && isCrashLoopFound()
                        && !WatchdogProperties.should_ignore_fatal_count().orElse(false)) {
                    breakCrashLoop();
                }
                exceptionHang.WDTMatterJava(330);
                if (mSfHang) {
                    Slog.w(TAG, "SF hang!");
                    if (getSfReboot() > 3) {
                        Slog.w(TAG, "SF hang reboot time larger than 3 time, reboot device!");
                        rebootSystem("Maybe SF driver hang, reboot device.");
                    } else {
                        setSfReboot();
                    }
                    Slog.v(TAG, "killing surfaceflinger for surfaceflinger hang");
                    String[] sf = new String[] {"/system/bin/surfaceflinger"};
                    int[] pid_sf = Process.getPidsForCommands(sf);
                    if (pid_sf[0] > 0) {
                        Process.killProcess(pid_sf[0]);
                    }
                    Slog.v(TAG, "kill surfaceflinger end");
                } else {
                    Process.killProcess(Process.myPid());
                }
                System.exit(10);
            }

            waitedHalf = false;
        }
    }

以上代码片段主要的运行逻辑如下：

Watchdog运行后，便开始无限循环，依次调用每一个HandlerChecker的scheduleCheckLocked()方法；
调度完HandlerChecker之后，便开始定期检查是否超时，每一次检查的间隔时间由CHECK_INTERVAL常量设定，为30秒；
每一次检查都会调用evaluateCheckerCompletionLocked()方法来评估一下HandlerChecker的完成状态：
（a）COMPLETED表示已经完成
（b）WAITING和WAITED_HALF表示还在等待，但未超时
（c）OVERDUE表示已经超时。默认情况下，timeout是1分钟，但监测对象可以通过传参自行设定，譬如PKMS的Handler Checker的超时是10分钟
如果超时时间到了，还有HandlerChecker处于未完成的状态(OVERDUE)，则通过getBlockedCheckersLocked()方法，获取阻塞的HandlerChecker，生成一些描述信息；
保存日志，包括一些运行时的堆栈信息，这些日志是我们解决Watchdog问题的重要依据。如果判断需要杀掉system_server进程，则给当前进程(system_server)发送signal 9；