Linux shell编程：监控进程CPU使用率并使用 perf 抓取高CPU进程信息

news2025/12/14 0:54:09

0. 概要

本文将介绍一个用于监控一组进程CPU使用率的Shell脚本，，当检测到某进程的CPU使用率超出阈值时，使用 perf 工具抓取该进程的详细信息。
本shell脚本为了能在普通嵌入式系统上运行做了妥协和优化。

1. shell脚本流程的简要图示：

在这里插入图片描述

2. perf介绍

perf 是 Linux 内核提供的一个强大性能分析工具，能够用于分析和调优系统性能。它支持多种事件类型，如CPU时钟、缓存命中/未命中、中断等。

在本脚本中，当某个进程的CPU使用率超过设定阈值（例如80%）时，会使用以下命令抓取该进程的详细性能数据：

perf record -F 99 -e cpu-clock -p $pid -g -o "perf-$process_name.data" -- sleep $perf_sleep_time

-F 99：以每秒99次的频率进行采样。
-e cpu-clock：采样的事件类型为CPU时钟周期。
-p $pid：指定要采样的进程ID。
-g：记录调用栈信息，帮助分析性能瓶颈。
-o "perf-$process_name.data"：将采样数据输出到指定文件中。
-- sleep $perf_sleep_time：持续采样时间为10秒。

通过抓取高CPU使用率进程的详细性能数据，我们可以深入分析性能瓶颈，找出导致高CPU使用的原因，从而进行针对性的优化。

更多介绍请查看:
使用perf(火焰图)查看热点函数和系统调用最大延迟函数
如何使用perf 统计cpu和内存？

3. shell脚本详解

日志文件配置：

# Log file location
LOGFILE="process_monitor.log"
# Redirect standard input, output, and error to log file
exec 1>>"$LOGFILE"
exec 2>>"$LOGFILE"

这部分代码配置日志文件，并将标准输入、输出和错误重定向到日志文件中。

后台运行检测：
```
# Check if the script is already running
if [ "$1" != "background" ]; then
    "$0" background &
    exit 0
fi
```
这段代码用于检测脚本是否已经在后台运行，如果没有，则重新以后台模式启动自己。

初始化上次报告时间文件：

# Initialize last report time file
last_report_time_file="last_report_time"
touch "$last_report_time_file"

初始化用于存储上次报告时间的文件。

获取CPU总时间的函数：

# Function to get the total CPU usage from /proc/stat
get_total_cpu_time() {
    awk '/^cpu / {print $2 + $3 + $4 + $5 + $6 + $7 + $8}' /proc/stat
}

从 /proc/stat 文件中获取CPU总时间。

获取进程CPU时间的函数：

# Function to get the process CPU usage from /proc/[pid]/stat
get_process_cpu_time() {
    pid=$1
    awk '{print $14 + $15 + $16 + $17}' /proc/$pid/stat
}

从 /proc/[pid]/stat 文件中获取指定进程的CPU时间。

计算进程CPU使用率的函数：

# Function to calculate CPU usage of a process
calculate_cpu_usage() {
    pid=$1
    prev_process_time=$(get_process_cpu_time "$pid")
    prev_total_time=$(get_total_cpu_time)
    sleep 1
    process_time=$(get_process_cpu_time "$pid")
    total_time=$(get_total_cpu_time)

    process_delta=$((process_time - prev_process_time))
    total_delta=$((total_time - prev_total_time))

    cpu_usage=$((100 * process_delta / total_delta))
    echo $cpu_usage
}

计算指定进程的CPU使用率。

加载上次报告时间的函数：

# Function to load the last report time for a PID
load_last_report_time() {
    pid=$1
    grep "^$pid=" "$last_report_time_file" | cut -d'=' -f2
}

从文件中加载上次报告时间。

保存上次报告时间的函数：

# Function to save the last report time for a PID
save_last_report_time() {
    pid=$1
    time=$2
    sed -i "/^$pid=/d" "$last_report_time_file"
    echo "$pid=$time" >> "$last_report_time_file"
}

将上次报告时间保存到文件中。

进程监控列表：

# List of process names to monitor
process_names="top systemd"

定义需要监控的进程名称列表。

监控循环：

 while true; do
    current_time=$(date +%s)
    for process_name in $process_names; do
      if [ -n "$DEBUG_ON" ]; then
          echo "Checking process: $process_name"
      fi
  
      # Find all matching process PIDs
      pids=$(ps aux | grep "$process_name" | grep -v grep | awk '{print $2}')
      for pid in $pids; do
          # Calculate CPU usage
          cpu_usage=$(calculate_cpu_usage "$pid")
          # Check if CPU usage exceeds $max_cpu_usage%
          if [ "$cpu_usage" -gt $max_cpu_usage ]; then
              echo "High CPU usage detected for process '$process_name' (PID: $pid): $cpu_usage%"
              # Load the last report time for this PID
              last_time=$(load_last_report_time "$pid")
              last_time=${last_time:-0}
              time_diff=$((current_time - last_time))
  
              # Check if the last report time is more than 60 seconds ago
              if [ "$time_diff" -ge 60 ]; then
                  echo "time_diff: $time_diff, perf record -F 99 -e cpu-clock -p $pid -g -o perf-$process_name.data -- sleep $perf_sleep_time"
                  ps -p "$pid" -o pid,ppid,cmd,%mem,%cpu >> "$LOGFILE"
                  perf record -F 99 -e cpu-clock -p $pid -g -o "perf-$process_name.data" -- sleep $perf_sleep_time
                  # Save the last report time for this PID
                  save_last_report_time "$pid" "$current_time"
  
                  # sleep for 1 second
                  sleep 1
              fi
          else
              if [ -n "$DEBUG_ON" ]; then
                  echo "CPU usage for process '$process_name' (PID: $pid): $cpu_usage%"
              fi
          fi
      done
  done
  done

这是主要的监控循环，定期检查指定进程的CPU使用率，并在超过阈值时使用 perf 抓取详细信息。

4. 完整脚本实现

以下是优化后的Shell脚本，适用于普通嵌入式系统：

#!/bin/sh

# This script monitors the CPU usage of a list of processes

DEBUG_ON=1
# Log file location
LOGFILE="process_monitor.log"

# Redirect standard input, output, and error to log file
exec 1>>"$LOGFILE"
exec 2>>"$LOGFILE"

# Check if the script is already running
if [ "$1" != "background" ]; then
    "$0" background &
    exit 0
fi

# Initialize last report time file
last_report_time_file="last_report_time"
touch "$last_report_time_file"

# Function to get the total CPU usage from /proc/stat
get_total_cpu_time() {
    awk '/^cpu / {print $2 + $3 + $4 + $5 + $6 + $7 + $8}' /proc/stat
}

# Function to get the process CPU usage from /proc/[pid]/stat
get_process_cpu_time() {
    pid=$1
    awk '{print $14 + $15 + $16 + $17}' /proc/$pid/stat
}

# Function to calculate CPU usage of a process
calculate_cpu_usage() {
    pid=$1
    prev_process_time=$(get_process_cpu_time "$pid")
    prev_total_time=$(get_total_cpu_time)
    sleep 1
    process_time=$(get_process_cpu_time "$pid")
    total_time=$(get_total_cpu_time)

    process_delta=$((process_time - prev_process_time))
    total_delta=$((total_time - prev_total_time))

    cpu_usage=$((100 * process_delta / total_delta))
    echo $cpu_usage
}

# Function to load the last report time for a PID
load_last_report_time() {
    pid=$1
    grep "^$pid=" "$last_report_time_file" | cut -d'=' -f2
}

# Function to save the last report time for a PID
save_last_report_time() {
    pid=$1
    time=$2
    sed -i "/^$pid=/d" "$last_report_time_file"
    echo "$pid=$time" >> "$last_report_time_file"
}

# List of process names to monitor
process_names="top systemd"


echo "Monitoring CPU usage for processes: $process_names"

# Perf sleep time
perf_sleep_time=10
max_cpu_usage=80

# Monitoring loop
while true; do
    current_time=$(date +%s)
    for process_name in $process_names; do
        if [ -n "$DEBUG_ON" ]; then
            echo "Checking process: $process_name"
        fi

        # Find all matching process PIDs
        # pids=$(ps | grep "$process_name" | grep -v grep | awk '{print $1}')
        pids=$(ps aux | grep "$process_name" | grep -v grep | awk '{print $2}')
        for pid in $pids; do
            # Calculate CPU usage
            cpu_usage=$(calculate_cpu_usage "$pid")
            # Check if CPU usage exceeds $max_cpu_usage%
            if [ "$cpu_usage" -gt $max_cpu_usage ]; then
                echo "High CPU usage detected for process '$process_name' (PID: $pid): $cpu_usage%"
                # Load the last report time for this PID
                last_time=$(load_last_report_time "$pid")
                last_time=${last_time:-0}
                time_diff=$((current_time - last_time))

                # Check if the last report time is more than 60 seconds ago
                if [ "$time_diff" -ge 60 ]; then
                    echo "time_diff: $time_diff, perf record -F 99 -e cpu-clock -p $pid -g -o perf-$process_name.data -- sleep $perf_sleep_time"
                    ps -p "$pid" -o pid,ppid,cmd,%mem,%cpu >> "$LOGFILE"
                    perf record -F 99 -e cpu-clock -p $pid -g -o "perf-$process_name.data" -- sleep $perf_sleep_time
                    # Save the last report time for this PID
                    save_last_report_time "$pid" "$current_time"

                    # sleep for 1 second
                    sleep 1
                fi
            else
                if [ -n "$DEBUG_ON" ]; then
                    echo "CPU usage for process '$process_name' (PID: $pid): $cpu_usage%"
                fi
            fi
        done
    done

done