魏长东

weichangdong

东邪

Redis命令和调优

 吞吐量

Redis提供的INFO命令不仅能够查看实时的吞吐量(ops/sec),还能看到一些有用的运行时信息。下面用grep过滤出一些比较重要的实时信息,比如已连接的和在阻塞的客户端、已用内存、拒绝连接、实时的tps和数据流量等:

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1 info |
 grep -e "connected_clients" -e "blocked_clients" -e "used_memory_human"
 -e "used_memory_peak_human" 
-e "rejected_connections" -e "evicted_keys" -e "instantaneous"

connected_clients:1
blocked_clients:0
used_memory_human:799.66K
used_memory_peak_human:852.35K
instantaneous_ops_per_sec:0
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
evicted_keys:0

延迟

2.2.1 客户端PING

从客户端可以监控Redis的延迟,利用Redis提供的PING命令,不断PING服务端,记录服务端响应PONG的时间。下面开两个终端,一个监控延迟,一个监视服务端收到的命令:

[root@vm redis-3.0.3]# src/redis-cli --latency -h 127.0.0.1
min: 0, max: 1, avg: 0.08

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1
127.0.0.1:6379> monitor
OK
1439361594.867170 [0 127.0.0.1:59737] "PING"
1439361594.877413 [0 127.0.0.1:59737] "PING"
1439361594.887643 [0 127.0.0.1:59737] "PING"
1439361594.897858 [0 127.0.0.1:59737] "PING"
1439361594.908063 [0 127.0.0.1:59737] "PING"
1439361594.918277 [0 127.0.0.1:59737] "PING"
1439361594.928469 [0 127.0.0.1:59737] "PING"
1439361594.938693 [0 127.0.0.1:59737] "PING"
1439361594.948899 [0 127.0.0.1:59737] "PING"
1439361594.959110 [0 127.0.0.1:59737] "PING"

果我们故意用DEBUG命令制造延迟,就能看到一些输出上的变化:

[root@vm redis-3.0.3]# src/redis-cli --latency -h 127.0.0.1
min: 0, max: 1995, avg: 1.60 (2361 samples)

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1
127.0.0.1:6379> debug sleep 1
OK
(1.00s)
127.0.0.1:6379> debug sleep .15
OK
127.0.0.1:6379> debug sleep .5
OK
(0.50s)
127.0.0.1:6379> debug sleep 2
OK
(2.00s)

服务端内部机制

服务端内部的延迟监控稍微麻烦一些,因为延迟记录的默认阈值是0。尽管空间和时间耗费很小,Redis为了高性能还是默认关闭了它。所以首先我们要开启它,设置一个合理的阈值,例如下面命令中设置的100ms:

127.0.0.1:6379> CONFIG SET latency-monitor-threshold 100
OK

因为Redis执行命令非常快,所以我们用DEBUG命令人为制造一些慢执行命令:

127.0.0.1:6379> debug sleep 2
OK
(2.00s)
127.0.0.1:6379> debug sleep .15
OK
127.0.0.1:6379> debug sleep .5
OK

下面就用LATENCY的各种子命令来查看延迟记录:

  • LATEST:四列分别表示事件名、最近延迟的Unix时间戳、最近的延迟、最大延迟。
  • HISTORY:延迟的时间序列。可用来产生图形化显示或报表。
  • GRAPH:以图形化的方式显示。最下面以竖行显示的是指延迟在多久以前发生。
  • RESET:清除延迟记录。

 

127.0.0.1:6379> latency latest
1) 1) "command"
   2) (integer) 1439358778
   3) (integer) 500
   4) (integer) 2000

127.0.0.1:6379> latency history command
1) 1) (integer) 1439358773
   2) (integer) 2000
2) 1) (integer) 1439358776
   2) (integer) 150
3) 1) (integer) 1439358778
   2) (integer) 500

127.0.0.1:6379> latency graph command
command - high 2000 ms, low 150 ms (all time high 2000 ms)
--------------------------------------------------------------------------------
#  
|  
|  
|_#

666
mmm

在执行一条DEBUG命令会发现GRAPH图的变化,多出一条新的柱状线,下面的时间2s就是指延迟刚发生两秒钟:

127.0.0.1:6379> debug sleep 1.5
OK
(1.50s)
127.0.0.1:6379> latency graph command
command - high 2000 ms, low 150 ms (all time high 2000 ms)
--------------------------------------------------------------------------------
#   
|  #
|  |
|_#|

2222
333s
mmm 

还有一个有趣的子命令DOCTOR,它能列出一些指导建议,例如开启慢日志进一步追查问题原因,查看是否有大对象被踢出或过期,以及操作系统的配置建议等。

127.0.0.1:6379> latency doctor
Dave, I have observed latency spikes in this Redis instance. You don't mind talking about it, do you Dave?

1. command: 3 latency spikes (average 883ms, mean deviation 744ms, period 210.00 sec). Worst all time event 2000ms.

I have a few advices for you:

- Check your Slow Log to understand what are the commands you are running which are too slow to execute. Please check http://redis.io/commands/slowlog for more information.
- Deleting, expiring or evicting (because of maxmemory policy) large objects is a blocking operation. If you have very large objects that are often deleted, expired, or evicted, try to fragment those objects into multiple smaller objects.
- I detected a non zero amount of anonymous huge pages used by your process. This creates very serious latency events in different conditions, especially when Redis is persisting on disk. To disable THP support use the command
 'echo never > /sys/kernel/mm/transparent_hugepage/enabled',
 make sure to also add it into /etc/rc.local so that the command will be executed again after a reboot. Note that even if you have already disabled THP, 
you still need to restart the Redis process to get rid of the huge pages already created.

度量延迟Baseline

延迟中的一部分是来自环境的,比如操作系统内核、虚拟化环境等等。Redis提供了让我们度量这一部分延迟基线(Baseline)的方法:

[root@vm redis-3.0.3]# src/redis-cli --intrinsic-latency 100 -h 127.0.0.1
Max latency so far: 2 microseconds.
Max latency so far: 3 microseconds.
Max latency so far: 26 microseconds.
Max latency so far: 37 microseconds.
Max latency so far: 1179 microseconds.
Max latency so far: 1623 microseconds.
Max latency so far: 1795 microseconds.
Max latency so far: 2142 microseconds.

35818026 total runs (avg latency: 2.7919 microseconds / 27918.90 nanoseconds per run).
Worst run took 767x longer than the average latency.

–intrinsic-latency后面是测试的时长(秒),一般100秒足够了。

持续实时监控

Unix的WATCH命令是一个非常实用的工具,它可以实时监视任意命令的输出结果。比如上面我们提到的命令,稍加改造就能变成持续地实时监控工具:

[root@vm redis-3.0.3]# watch -n 1 -d "src/redis-cli -h 127.0.0.1 info | grep -e "connected_clients" -e "blocked_clients" -e "used_memory_human" -e "used_memory_peak_human" -e "rejected_connections" -e "evicted_keys" -e "instantaneous""

Every 1.0s: src/redis-cli -h 127.0.0.1 info | grep -e...  Wed Aug 12 14:30:40 2015

connected_clients:1
blocked_clients:0
used_memory_human:799.66K
used_memory_peak_human:852.35K
instantaneous_ops_per_sec:0
instantaneous_input_kbps:0.01
instantaneous_output_kbps:1.23
rejected_connections:0
evicted_keys:0

[root@vm redis-3.0.3]# watch -n 1 -d "src/redis-cli -h 127.0.0.1 latency graph command"

Every 1.0s: src/redis-cli -h 127.0.0.1 latency graph command                                                                                                               Wed Aug 12 14:33:25 2015

command - high 2000 ms, low 150 ms (all time high 2000 ms)
--------------------------------------------------------------------------------
#
|  #
|  |
|_#|

4441
0006
mmmm

 慢操作日志

像SORT、LREM、SUNION等操作在大对象上会非常耗时,使用时要注意参照官方API上每个命令的算法复杂度。用前面介绍过的慢操作日志监控操作的执行时间。就像主流数据库提供的慢SQL日志一样,Redis也提供了记录慢操作的日志。注意这部分日志只会计算纯粹的操作耗时。

slowlog-log-slower-than设置慢操作的阈值,slowlog-max-len设置保存个数,因为慢操作日志与延迟记录一样,都是保存在内存中的:

127.0.0.1:6379> config set slowlog-log-slower-than 500
OK 
127.0.0.1:6379> debug sleep 1
OK
(0.50s)
127.0.0.1:6379> debug sleep .6
OK
127.0.0.1:6379> slowlog get 10
1) 1) (integer) 2
   2) (integer) 1439369937
   3) (integer) 473178
   4) 1) "debug"
      2) "sleep"
      3) ".6"
2) 1) (integer) 1
   2) (integer) 1439369821
   3) (integer) 499357
   4) 1) "debug"
      2) "sleep"
      3) "1"
3) 1) (integer) 0
   2) (integer) 1439365058
   3) (integer) 417846
   4) 1) "debug"
      2) "sleep"
      3) "1"

输出的四列的含义分别是:记录的自增ID、命令执行时的时间戳、命令的执行耗时(ms)、命令的内容。注意上面的DEBUG命令并没有包含休眠时间,而只是命令的处理时间。