前言

我们排查ANR问题的时候，会发现有时候anr文件中捕获的堆栈信息，并不准确，而且经常会打印下面这样的堆栈：

"main" prio=5 tid=1 Native
  | group="main" sCount=1 ucsCount=0 flags=1 obj=0x71ac6f78 self=0xb400007d1e23a7b0
  | sysTid=6066 nice=-10 cgrp=top-app sched=0/0 handle=0x7e6554b4f8
  | state=S schedstat=( 808244194 250487416 988 ) utm=49 stm=31 core=3 HZ=100
  | stack=0x7fd5329000-0x7fd532b000 stackSize=8188KB
  | held mutexes=
  native: #00 pc 00000000000a33b8  /apex/com.android.runtime/lib64/bionic/libc.so (__epoll_pwait+8) (BuildId: 01331f74b0bb2cb958bdc15282b8ec7b)
  native: #01 pc 0000000000010dfc  /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+176) (BuildId: 5a0d720732600c94ad8354a1188e9f52)
  native: #02 pc 000000000015a56c  /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44) (BuildId: a31474ac581b716d4588f8c97eb06009)
  at android.os.MessageQueue.nativePollOnce(Native method)
  at android.os.MessageQueue.next(MessageQueue.java:335)
  at android.os.Looper.loopOnce(Looper.java:161)
  at android.os.Looper.loop(Looper.java:288)
  at android.app.ActivityThread.main(ActivityThread.java:7872)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:548)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:936)

很明显，这种堆栈代表主线程处于空闲状态，既然空闲为什么会ANR呢？这就涉及到本文要讲的，发生了漂移。简单来说，就是捕获的堆栈，并不是发生ANR时的堆栈，因为堆栈已经漂移了。

堆栈漂移介绍

为什么会出现堆栈的漂移呢？这个其实很好理解，系统不可能时时刻刻去观察所有应用的堆栈状态，等到系统意识到某个应用发生ANR的时候，这时候才能开始去采集。但是对于应用侧来说，不可能等到系统采集了之后再继续执行，而且对于APP来说，收到通知之前，也不知道自身发生了ANR了。所以当应用收到系统的通知，开始对堆栈进行采集的时候，往往已经不是发生ANR时的那个堆栈了。

那么，既然会存在这个时间片上的不准确，产生堆栈漂移，那么这个漂移时间是多久呢？知道这一点，对于我们分析ANR是很有帮助的，我们可以反向去查找对应时间点的主线程任务。所以，接下来我们就通过对源码的阅读，一步一步的了解一下，这个漂移是如何产生的，漂移时间是多久。

原理分析

之前讲解ANR流程的时候，说到最终所有的ANR，都会走到ProcessErrorStateRecord的appNotResponding方法（低版本是AppErrors中的appNotResponding方法），典型的功能内聚设计。接下来，我们就简单分析一下这个流程。

本文以android13的源码为例。

class ProcessErrorStateRecord {
    void appNotResponding(){
        ArrayList<Integer> firstPids = new ArrayList<>(5);
        SparseArray<Boolean> lastPids = new SparseArray<>(20);
        ...
        final int pid = mApp.getPid();
        ...
        //这里打印的05-11 16:41:24.073 1141 1156 I am_anr : [0,3416,com.xt.client,550026821,executing service com.xt.client/.TestService]
        EventLog.writeEvent(EventLogTags.AM_ANR, mApp.userId, pid, mApp.processName,mApp.info.flags, annotation);
        //
        firstPids.add(pid);
        if (!isSilentAnr && !onlyDumpSelf) {
            //往firstPids中添加收集清单
        }
        //构建ANR异常描述
        StringBuilder info = new StringBuilder();
        info.setLength(0);
        info.append("ANR in ").append(mApp.processName);
        ...
        //采集堆栈
        File tracesFile = ActivityManagerService.dumpStackTraces(firstPids,isSilentAnr ? null : processCpuTracker, isSilentAnr ? null : lastPids, nativePids, tracesFileException, offsets, annotation, criticalEventLog);
        //把ANR异常记录到dropbox中
        mService.addErrorToDropBox()
        //生成
        if (mApp.getWindowProcessController().appNotResponding(info.toString(),...
        if (mService.mUiHandler != null) {
            //延时5S显示ANR弹窗
            mService.mUiHandler.sendMessageDelayed(msg, anrDialogDelayMs);
        }
    }
}

appNotResponding的整个流程，并且没有什么耗时的点，主要是采集堆栈的dumpStackTraces方法。

public class ActivityManagerService{
    static File dumpStackTraces(...) {
        if (processCpuTracker != null) {
            processCpuTracker.init();
            //首先，sleep200毫秒，用于CPU使用率采集
            Thread.sleep(200);
        }
        processCpuTracker.update();
        //CPU负载最高的5个也加入采集队列
        extraPids = new ArrayList<>();
        for (int i = 0; i < N && extraPids.size() < 5; i++) {
            extraPids.add(stats.pid);
        }
        //生成trace文件
        tracesFile = createAnrDumpFile(tracesDir);
        Pair<Long, Long> offsets = dumpStackTraces(tracesFile.getAbsolutePath(), firstPids, nativePids, extraPids);
        return tracesFile;
    }
}

可以看到Thread.sleep(200);，线程休眠200毫秒，这是第一个耗时点。

另外system_server对CPU进行分析，以及获取一些必要的状态，也是需要耗费时间的。

接下来我们看一下dumpStackTraces方法，其主要功能就是通过信号量挨个通知被采集的进行已经上报：

public static Pair<Long, Long> dumpStackTraces(String tracesFile, ArrayList<Integer> firstPids,
        ArrayList<Integer> nativePids, ArrayList<Integer> extraPids) {
    //捕获最近使用进程的堆栈
    if (firstPids != null) {
        for (int i = 0; i < num; i++) {
            final int pid = firstPids.get(i);
            final long timeTaken = dumpJavaTracesTombstoned(pid, tracesFile,remainingTime);
        }
    }
    //捕获native进程的堆栈
    if (nativePids != null) {
        for (int pid : nativePids) {
            Debug.dumpNativeBacktraceToFileTimeout(pid, tracesFile, (int) (nativeDumpTimeoutMs / 1000));
        }
    }
    //捕获高负载应用的堆栈
    if (extraPids != null) {
        for (int pid : extraPids) {
            final long timeTaken = dumpJavaTracesTombstoned(pid, tracesFile, remainingTime);
        }
    }
}

很明显，利用Debug采集其他进程的堆栈信息，也属于一个耗时操作。