Redis性能问题排查手段

1. 定位问题点

当发生业务系统访问Redis慢或者超时异常时，可能的原因有以下三个：

客户端问题： 如果客户端使用的是Java版本的Lettuce SDK，如果业务应用的CPU使用率比较高时，由于不能及时处理Redis返回的命令，也会报Command Timeout的异常。可以查看业务应用的CPU使用率情况
网络问题： 可以从客户端进行ping操作验证网络是否存在延迟

Redis性能问题： 通过redis-cli工具判断Redis是否存在性能问题

# 判断Redis整体是否存在性能问题，正常情况下如果是本机执行avg应小于1（原理是发送ping命令，计算返回时长，无需输入密码）
redis-cli -h "127.0.0.1" -p "6379" --latency-history
min: 0, max: 1, avg: 0.11 (1323 samples) -- 15.00 seconds range
min: 0, max: 1, avg: 0.10 (1330 samples) -- 15.00 seconds range
min: 0, max: 1, avg: 0.12 (1321 samples) -- 15.01 seconds range
min: 0, max: 1, avg: 0.10 (1319 samples) -- 15.01 seconds range
min: 0, max: 1, avg: 0.10 (1321 samples) -- 15.00 seconds range
min: 0, max: 1, avg: 0.10 (1319 samples) -- 15.01 seconds range

2. 定位Redis具体性能问题

如果通过Redis性能问题： 通过redis-cli工具判断Redis是否存在性能问题定位到Redis存在性能问题，则可以通过以下手段定位具体原因

查看Redis内存占用

通过redisl-cli连接到Redis，执行info memory命令，重点查看used_memory_rss_human、total_system_memory_human、maxmemory_human几个参数

maxmemory_human如果为0，表示不限制Redis的内存使用，通常不建议这样配置
当used_memory_human_rss达到maxmemory（如果maxmemory配置为0，则参考total_system_memory_human）的90%时容易产生性能问题
确保maxmemory_human < total_system_memory_human

redis-cli -h "127.0.0.1" -p "6379" -a ""
127.0.0.1:6379> info memory
# Memory
used_memory:904768
# 当used_memory_human达到maxmemory（如果maxmemory配置为0，则参考total_system_memory_human）的80%时容易产生性能问题
used_memory_human:883.56K
used_memory_rss:7266304
used_memory_rss_human:6.93M
used_memory_peak:964504
used_memory_peak_human:941.90K
used_memory_peak_perc:93.81%
used_memory_overhead:863304
used_memory_startup:842624
used_memory_dataset:41464
used_memory_dataset_perc:66.72%
allocator_allocated:1803176
allocator_active:7864320
allocator_resident:8912896
total_system_memory:2085294080
total_system_memory_human:1.94G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
# maxmemory配置为0，表示不限制Redis的内存使用，当used_memory_human较大时，容易产生性能问题
maxmemory:0
maxmemory_human:0B

查看Redis的命令数

通过redisl-cli连接到Redis，执行info stats命令，重点查看instantaneous_ops_per_sec参数
- 由于Redis支持的最高Ops与key、value的大小有很大的关系，可以参考下图的Redis性能判断Redis是否达到性能瓶颈
```
redis-cli -h "127.0.0.1" -p "6379" -a ""
127.0.0.1:6379> info stats
# Stats
instantaneous_ops_per_sec:0
```

查看是否存在磁盘性能问题

如果Redis配置了AOF持久化，并且appendfsync设置为everysec，即每秒持久化磁盘一次。如果磁盘性能存在问题就会影响Redis的读写性能。

查看Redis的日志文件是否存在如下内容：
```
Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
```
解决办法：临时设置appendfsync参数为no，如果磁盘的性能太差，即使appendfsync设置为no，不再打印fsync slow的日志，也可能会影响Redis性能，可以通过后续介绍的latency-monitor定位到。
```
redis-cli -h "127.0.0.1" -p "6379" -a ""
127.0.0.1:6379> config set appendfsync no
```

查看连接数

当Redis的连接数超过最大连接数时会对新的连接进行排队或者直接拒绝，需确认connected_clients小于maxclients

redis-cli -h "127.0.0.1" -p "6379" -a ""
127.0.0.1:6379> info clients
# Clients
connected_clients:1
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:16
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

查看是否有慢查询（时间复杂度较高的命令）

redis-cli -h "127.0.0.1" -p "6379" -a ""
127.0.0.1:6379> slowlog get 100
1) 1) (integer) 0
   2) (integer) 1668745341
   3) (integer) 1001516
   4) 1) "debug"
      2) "sleep"
      3) "1"
   5) "127.0.0.1:40420"
   6) ""

检查操作系统是否存在性能问题

# 判断Redis所在的服务器是否存在性能问题，这个指标需要与正常的运行的Redis示例所在服务器进行对比才行，如果参数是正常运行Redis的两倍左右，就说明操作系统性能存在问题
# 该命令只是简单的运行一些加减运算的命令，必须在Redis服务器上执行，所以用于判断操作系统的是否存在固有延迟
redis-cli --intrinsic-latency 60
Max latency so far: 1 microseconds.
Max latency so far: 65 microseconds.
Max latency so far: 107 microseconds.
Max latency so far: 448 microseconds.

开启延迟监控（对尖峰延迟尤其有效）

开启latency-monitor监控，该功能对于定位存在延迟尖峰的场景尤其有用（即Redis响应慢的现象是间歇性或偶发的）
```
# 开启latency-monitor监控（单位是毫秒）
redis-cli -h "127.0.0.1" -p "6379" -a ""
CONFIG SET latency-monitor-threshold 100
```
Redis是对事件的延迟进行统计，可以通过以下命令查询是否存在延迟事件
```
# 显示每个事件的最新一次时间戳和耗时
latency latest
1) 1) "command"           # Event name
   2) (integer) 1439479413  # Unix timestamp
   3) (integer) 381   # Latency of latest event
   4) (integer) 6802    # All time maximum latency

# 显示指定事件的历史时间戳和耗时
latency history command
1) 1) (integer) 1425038819   # Unix timestamp
   2) (integer) 383      # Execution time (in ms)
2) 1) (integer) 1425038944
   2) (integer) 4513
   
# 清理latency记录
latency reset

# 图形化的方式展示指定事件的历史统计
latency graph command
```
事件列表参考：
- command: regular commands.
- fast-command: O(1) and O(log N) commands.
- fork: the fork(2) system call.
- rdb-unlink-temp-file: the unlink(2) system call.
- aof-fsync-always: the fsync(2) system call when invoked by the appendfsync allways policy.
- aof-write: writing to the AOF - a catchall event for write(2) system calls.
- aof-write-pending-fsync: the write(2) system call when there is a pending fsync.
- aof-write-active-child: the write(2) system call when there are active child processes.
- aof-write-alone: the write(2) system call when no pending fsync and no active child process.
- aof-fstat: the fstat(2) system call.
- aof-rename: the rename(2) system call for renaming the temporary file after completing BGREWRITEAOF.
- aof-rewrite-diff-write: writing the differences accumulated while performing BGREWRITEAOF.
- active-defrag-cycle: the active defragmentation cycle.
- expire-cycle: the expiration cycle.
- eviction-cycle: the eviction cycle.
- eviction-del: deletes during the eviction cycle.