从使用者，DBA，内核开发三个不同角度来分析SQL的性能问题 ......

有时候写的SQL有性能问题时往往束手无策，而求助于DBA。今天，我们从使用者、DBA、内核开发三个不同的角度来分析一个有趣的SQL性能问题的案例，从浅入深了解postgreSQL的优化器。

问题描述

同事A来问我这个假DBA一条SQL的性能问题：

A：两条SQL语句只有limit不一样，而limit 1的执行比limit 10的慢N倍
我：是不是缓存问题，先执行limit 10再执行limit 1试试
A：......，执行了，limit还是很慢

两条SQL生产环境执行情况

limit 10

1	select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 10;

Execution Time: 1.307 ms

limit 1

1	select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 1;

Execution Time: 144.098 ms

分析

执行计划

既然不是缓存问题，那我们先看看执行计划有什么不一样的

limit 1

# explain analyze verbose select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 1;
                                                                                QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..416.25 rows=1 width=73) (actual time=135.213..135.214 rows=1 loops=1)
   Output: xxx
   ->  Index Scan Backward using user_gift_pkey on yay.user_gift  (cost=0.43..368000.44 rows=885 width=73) (actual time=135.212..135.212 rows=1 loops=1)
         Output: xxx
         Filter: ((user_gift.user_id = 11695667) AND (user_gift.user_type = 'default'::user_type))
         Rows Removed by Filter: 330192
 Planning Time: 0.102 ms
 Execution Time: 135.235 ms
(8 rows)

Time: 135.691 ms

limit 10

# explain analyze verbose select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 10;
                                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=868.20..868.22 rows=10 width=73) (actual time=1.543..1.545 rows=10 loops=1)
   Output: xxx
   ->  Sort  (cost=868.20..870.41 rows=885 width=73) (actual time=1.543..1.543 rows=10 loops=1)
         Output: xxx
         Sort Key: user_gift.id DESC
         Sort Method: top-N heapsort  Memory: 27kB
         ->  Index Scan using idx_user_type on yay.user_gift  (cost=0.56..849.07 rows=885 width=73) (actual time=0.020..1.366 rows=775 loops=1)
               Output: xxx
               Index Cond: (user_gift.user_id = 11695667)
               Filter: (user_gift.user_type = 'default'::user_type)
 Planning Time: 0.079 ms
 Execution Time: 1.564 ms
(12 rows)

Time: 1.871 ms

可以看到，两个SQL执行计划不一样：

limit 1语句：使用主键进行倒序扫描， Index Scan Backward using user_gift_pkey on yay.user_gift
limit 10语句：使用(user_id, user_type)复合索引直接查找用户数据，Index Scan using idx_user_type on yay.user_gift

为什么执行计划不一样？

total cost

其实postgreSQL的执行计划并没有“问题”，因为limit 1的total cost Limit (cost=0.43..416.25 rows=1 width=73) 是416，run cost是416-0.43=415.57。而limit 10的total cost Limit (cost=868.20..868.22 rows=10 width=73)是868.22。

如果使用Index Scan Backward using user_gift_pkey的方式估算，那么limit 1成本是415, limit 2是415*2=830, limit 3 是 1245，大于868，所以当limit 3的时候会使用Index Scan using idx_user_type扫索引的计划。

验证

# explain  select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 2;
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..831.95 rows=2 width=73)
   ->  Index Scan Backward using user_gift_pkey on user_gift  (cost=0.43..367528.67 rows=884 width=73)
         Filter: ((user_id = 11695667) AND (user_type = 'default'::user_type))
(3 rows)
 
Time: 0.341 ms
# explain  select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 3;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Limit  (cost=866.19..866.20 rows=3 width=73)
   ->  Sort  (cost=866.19..868.40 rows=884 width=73)
         Sort Key: id DESC
         ->  Index Scan using idx_user_type on user_gift  (cost=0.56..854.76 rows=884 width=73)
               Index Cond: (user_id = 11695667)
               Filter: (user_type = 'default'::user_type)
(6 rows)
 
Time: 0.352 ms

结果显示：

当limit 2时，执行计划是Index Scan Backward using user_gift_pkey
当limit 3时，就改变计划了，Index Scan using idx_user_type on user_gift

实际执行时间

limit 1时成本估算的是416.25，比limit 10的868.22还是要快的。
但是实际limit 1执行cost是135.691 ms，而limit 10执行cost是1.871 ms，比limit 10慢了70倍！！！

我们重新执行下explain，加上buffers选项

# explain (analyze, buffers, verbose)  select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 1;
                                                                                QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..416.29 rows=1 width=73) (actual time=451.542..451.544 rows=1 loops=1)
   Output: xxx
   Buffers: shared hit=214402 read=5280 dirtied=2302
   I/O Timings: read=205.027
   ->  Index Scan Backward using user_gift_pkey on yay.user_gift  (cost=0.43..368032.94 rows=885 width=73) (actual time=451.540..451.540 rows=1 loops=1)
         Output: xxx
         Filter: ((user_gift.user_id = 11695667) AND (user_gift.user_type = 'default'::user_type))
         Rows Removed by Filter: 333462
         Buffers: shared hit=214402 read=5280 dirtied=2302
         I/O Timings: read=205.027
 Planning Time: 1.106 ms
 Execution Time: 451.594 ms
(12 rows)

# explain (analyze, buffers, verbose)  select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 3;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=860.51..860.52 rows=3 width=73) (actual time=14.633..14.634 rows=3 loops=1)
   Output: xxx
   Buffers: shared hit=467 read=321
   I/O Timings: read=10.112
   ->  Sort  (cost=860.51..862.72 rows=885 width=73) (actual time=14.632..14.632 rows=3 loops=1)
         Output: xxx
         Sort Key: user_gift.id DESC
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=467 read=321
         I/O Timings: read=10.112
         ->  Index Scan using idx_user_type on yay.user_gift  (cost=0.56..849.07 rows=885 width=73) (actual time=0.192..14.424 rows=775 loops=1)
               Output: xxx
               Index Cond: (user_gift.user_id = 11695667)
               Filter: (user_gift.user_type = 'default'::user_type)
               Buffers: shared hit=464 read=321
               I/O Timings: read=10.112
 Planning Time: 0.111 ms
 Execution Time: 14.658 ms
(18 rows)

可以看出：

limit 1时的IO成本I/O Timings: read=205.027，Rows Removed by Filter: 333462显示过滤了333462行记录
limit 3时IO成本I/O Timings: read=10.112，

从上面输出Buffers: shared hit=214402 read=5280 dirtied=2302可以看出limit 1的计划从磁盘读取了5280个blocks(pages)才找到符合where条件的记录。

为什么要读取这么多数据呢？我们来看看统计信息：

schemaname             | yay
tablename              | user_gift
attname                | id
inherited              | f
null_frac              | 0
avg_width              | 8
n_distinct             | -1
most_common_vals       | 
most_common_freqs      | 
histogram_bounds       | {93,9817,19893,28177,.......}
correlation            | 0.788011
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   | 

schemaname             | yay
tablename              | user_gift
attname                | user_id
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | -0.175761
most_common_vals       | {11576819,10299480,14020501,.......,11695667,......}
most_common_freqs      | {0.000353333,0.000326667,0.000246667,......,9.33333e-05,......}
histogram_bounds       | {3,10002181,10005599,10009672,......,11693300,11698290,......}
correlation            | 0.53375
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   | 

schemaname             | yay
tablename              | user_gift
attname                | user_type
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | 3
most_common_vals       | {default, invalid, deleted}
most_common_freqs      | {0.997923,0.00194,0.000136667}
histogram_bounds       | 
correlation            | 0.99763
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   |

从统计信息里可以看出：

user_id字段的most_common_vals中有11695667(user_id)的值，则可以直接通过其对应的most_common_freqs来得到其selectivity是9.33333e-05；
user_type字段为default对应的selectivity是0.997923。
所以where user_id=11695667 and user_type='default'的selectivity是0.0000933333*0.997923 = 0.0000931394467359。

那么可以估算出满足where条件的用户数是0.0000931394467359 * 9499740(总用户数) = 884.8，和执行计划(cost=0.43..367528.67 rows=884 width=73)的884行一样。

而优化器的估算是基于数据分布均匀这个假设的：

从user_gift_pkey(主键id)扫描的话：只要扫描9499740/884=10746行就能找到满足条件的记录，且无须进行排序(order by id desc)
从idx_user_type索引扫描的话：虽然能很快找到此用户的数据，但是需要给884行进行排序，扫描+排序的cost比从主键扫描要高。

那么数据分布真的均匀吗？继续查看数据的实际分布：

表最大的page=128709

# select max(ctid) from user_gift;
     max
-------------
 (128709,29)
(1 row)

user id=11695667的最大page=124329

# select max(ctid), min(ctid) from user_gift where user_id=11695667;
     max     |    min
-------------+-----------
 (124329,22) | (3951,64)
(1 row)

表本身的pages和tuples数量

# SELECT relpages, reltuples FROM pg_class WHERE relname = 'user_gift';
 relpages |  reltuples
----------+-------------
   128875 | 9.49974e+06
(1 row)

每个page存储的记录数：9.49974e+06 tuples / 128875 pages = 73.713 tuples/page。

计算：表(main table)的B+tree的最大page是128709，而实际用户11695667的最大page是124329，128709 - 124329 = 4380，需要扫描4380个page才能找到符合where条件的记录所在的page，所以过滤的rows是4380 pages * 73.713 tuples/page ≈ 322862。

实际limit 1时扫描了5280个pages(包含了主键索引的pages)，过滤了333462万行记录，和估算的基本一样：

1
2
3

Rows Removed by Filter: 333462
         Buffers: shared hit=214402 read=5280 dirtied=2302
         I/O Timings: read=205.027

所以，此用户数据分布倾斜了：

优化器假设数据分布均匀，只需要扫描10746个记录
而实际需要扫描322862个记录

那么扫描5280个pages要多久？

需要读取的数据量：5280pages * 8KB/page = 41.2MB的数据。

[root]$ fio -name iops -rw=randread -bs=8k -runtime=10 -iodepth=1 -filename /dev/sdb -ioengine mmap -buffered=1
...
Run status group 0 (all jobs):
		READ: bw=965KiB/s (988kB/s), 965KiB/s-965KiB/s (988kB/s-988kB/s), io=9656KiB (9888kB), run=10005-10005msec

[root]$ fio -name iops -rw=read -bs=8k -runtime=10 -iodepth=1 -filename /dev/sdb -ioengine mmap -direct=1
...
Run status group 0 (all jobs):
   READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s), io=5132MiB (5381MB), run=10001-10001msec

从fio结果可以看出，此数据库机器磁盘的顺序读取速度约为 500MB/s，如果数据都是顺序的，那么扫描40MB数据需要约80ms，

如果数据都是随机的，那么需要40秒。不是所有的数据都是顺序访问的，而且测试的是非线上机器，没有其他IO进程在运行。

到这里问题基本定位清楚了： > postgreSQL的优化器认为数据分布是均匀的，只需要倒序扫描很快就找到符合条件的记录，而实际上此用户的数据分布在表的前端，就导致了实际执行start-up time如此慢了。

从内核视角来分析

我们从postgreSQL内核的角度来继续分析几个问题：

优化器如何估算cost
优化器如何统计actual time

表的信息

表结构

# \d user_gift;
                                         Table "yay.user_gift"
    Column    |           Type           | Collation | Nullable |                    Default
--------------+--------------------------+-----------+----------+------------------------------------------------
 id           | bigint                   |           | not null | nextval('user_gift_id_seq'::regclass)
 user_id      | integer                  |           | not null |
 ug_name | character varying(100)   |           | not null |
 expired_time | timestamp with time zone |           |          | now()
 created_time | timestamp with time zone |           | not null | now()
 updated_time | timestamp with time zone |           | not null | now()
 user_type   | user_type               |           | not null | 'default'::user_type
Indexes:
    "user_gift_pkey" PRIMARY KEY, btree (id)
    "idx_user_type" btree (user_id, ug_name)
    "user_gift_ug_name_idx" btree (ug_name)

主键索引

# SELECT relpages, reltuples FROM pg_class WHERE relname = 'user_gift_pkey';
 relpages |  reltuples
----------+-------------
    40035 | 9.49974e+06
(1 row)

user_id 索引

# SELECT relpages, reltuples FROM pg_class WHERE relname = 'idx_user_type';
 relpages |  reltuples
----------+-------------
   113572 | 9.49974e+06
(1 row)

表本身的pages是128875

# SELECT relpages, reltuples FROM pg_class WHERE relname = 'user_gift';
 relpages |  reltuples
----------+-------------
   128875 | 9.49974e+06
(1 row)

user id=11695667的数据775行

=# select count(1) from user_gift where user_id=11695667;
 count
-------
   775
(1 row)

=#  select count(1)  from user_gift where user_id=11695667 and user_type = 'default' ;
 count
-------
   775
(1 row)

树高度

# 主键高度
# select * from  bt_metap('user_gift_pkey');
 magic  | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples
--------+---------+------+-------+----------+-----------+-------------+-------------------------
 340322 |       3 |  412 |     2 |      412 |         2 |           0 |             9.31928e+06
(1 row)


// idx_user_type 高度
# select * from  bt_metap('idx_user_type');
 magic  | version | root  | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples
--------+---------+-------+-------+----------+-----------+-------------+-------------------------
 340322 |       3 | 15094 |     3 |    15094 |         3 |           0 |             9.49974e+06
(1 row)

估算cost

start-up cost

postgreSQL对于每种索引的成本估算是不一样的，我们看看B+tree的start-up成本是如何估算的：

// selfuncs.c
void
btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
			   Cost *indexStartupCost, Cost *indexTotalCost,
			   Selectivity *indexSelectivity, double *indexCorrelation,
			   double *indexPages)
{
    ......
    
	descentCost = ceil(log(index->tuples) / log(2.0)) * cpu_operator_cost;
	costs.indexStartupCost += descentCost;

    ......
    // This cost is somewhat arbitrarily set at 50x cpu_operator_cost per page touched
    descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost;
	costs.indexStartupCost += descentCost;
	
	......
}

其实start-up cost估算很简单，只考虑从B+tree的root page遍历到leaf page，且将这个page读入第一个tuple(记录)的cost。

start-up估算公式如下： \[ \left \{ ceil({\log_2 (N_{index,tuple})}) + (Height_{index} + 1) \times 50 \right \}\ \times cpu\_operator\_cost \]

N(index,tuple) ：索引tuples(记录)数量
Height(index) ：索引B+tree的高度
cpu_operator_cost : 默认值0.0025

使用user_gift_pkey计划的start-up cost

从上面表信息中可以看出：

N(index,tuple) ：9.49974e+06，
Height(index) ： 2

所以 \[ \left \{ ceil({\log_2 (9499740)}) + (2 + 1) \times 50 \right \}\ \times cpu\_operator\_cost = 173 \times 0.0025 = 0.435 \] 和postgreSQL估算的start-up cost=0.43 一样。

使用idx_user_type计划的start-up cost

N(index,tuple) ：9.49974e+06，
Height(index) ： 3 \[ \left \{ ceil({\log_2 (9499740)}) + (3 + 1) \times 50 \right \}\ \times cpu\_operator\_cost = 223 \times 0.0025 = 0.5575 \] 和postgreSQL估算的start-up cost=0.56 一样。

run cost

run cost的估算是比较复杂的，判断的条件非常多，无法用一个固定的公式计算出来，所以这里只是简单描述下，有兴趣的可以看postgreSQL源码src/backend/optimizer/path/costsize.c的cost_index函数，针对这个案例，一般情况下可以根据此链接的脚本进行来模拟计算cost。

run cost \[ run\_cost = 索引成本 + 主表成本 \]

索引成本 \[ 索引成本 = 随机读取索引相关pages的成本 + 操作相关tuples的成本 \]

主表成本 \[ 主表成本 = max\_io\_cost + index\_correlation ^ 2 \times (min\_io\_cost - max\_io\_cost) \]

ndex_correlation : 索引相关性。索引的顺序与主表数据排列顺序的关联性，用来描述通过索引扫描数据时，回表的顺序读的概率

max io cost（最坏情况下IO成本）

所有pages都是随机读取 \[ max\_io\_cost = pages\_fetched \times random\_page\_cost \]

min_io_cost（最优情况下IO成本）

第一个page是随机读取，后面pages都是顺序读取 \[ min\_io\_cost = 1 \times random\_page\_cost + (pages\_fetched - 1) \times seq\_page\_cost \]

actual start-up time vs estimated start-up cost

刚刚的分析中有一个疑问被忽略了：estimated start-up cost是开始执行计划到从表中读到的第一个tuple的cost(cost is an arbitrary unit)；而actual start-up time则是开始执行计划到从表中读取到第一个符合where条件的tuple的时间。这是为什么呢？

SQL处理流程：postgreSQL将SQL转化成AST，然后进行优化，再将AST转成执行器(executor)来实现具体的操作。不进行优化的执行器是这样的：

┌──────────────┐
│  projection  │
└──────┬───────┘
       │
       │
┌──────▼──────┐
│    limit    │
└──────┬──────┘
       │
       │
┌──────▼──────┐
│  selection  │
└──────┬──────┘
       │
       │
┌──────▼──────┐
│ index scan  │
└─────────────┘

简化的执行流程如下：

index scan executor：扫描到一个tuple，就返回给selection executor
selection executor：对tuple进行过滤，如果符合条件则返回给limit executor，如果不符合则继续调用index scan executor
limit executor：当达到limit限制则将数据返回给projection executor
projection executor：过滤掉非select列的数据

那么如果进行优化，一般会将selection executor和projection executor合并到index scan executor中执行，以减少数据在executor之间的传递。

┌─────────────┐
│    limit    │
└──────┬──────┘
       │
       │
┌──────▼──────┐
│ index scan  │
│             │
│ + selection │
│ + projection│
└─────────────┘

优化后的执行流程：

index scan executor：扫描到tuple，然后进行selection过滤，如果符合条件就进行projection再返回给limit，如果不符合条件，则继续扫描
limit executor：当达到limit限制则将数据返回

而通过下面代码可以看出，postgreSQL对于执行时间的统计是基于executor的，

// src/backend/executor/execProcnode.c
static TupleTableSlot *
ExecProcNodeInstr(PlanState *node)
{
	TupleTableSlot *result;
	InstrStartNode(node->instrument);
	result = node->ExecProcNodeReal(node);
	
	// 统计执行指标
	InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
	return result;
}

所以actual time的start-up是从启动executor直到扫描到符合where语句的第一条结果为止。

再看看实际的函数调用栈，user_id=xxx的过滤已经下沉到index scan executor里面了。

--->    int4eq(FunctionCallInfo fcinfo) (/home/ken/cpp/postgres/src/backend/utils/adt/int.c:379)
        ExecInterpExpr(ExprState * state, ExprContext * econtext, _Bool * isnull) (/home/ken/cpp/postgres/src/backend/executor/execExprInterp.c:704)
        ExecInterpExprStillValid(ExprState * state, ExprContext * econtext, _Bool * isNull) (/home/ken/cpp/postgres/src/backend/executor/execExprInterp.c:1807)
        ExecEvalExprSwitchContext(ExprState * state, ExprContext * econtext, _Bool * isNull) (/home/ken/cpp/postgres/src/include/executor/executor.h:322)
--->    ExecQual(ExprState * state, ExprContext * econtext) (/home/ken/cpp/postgres/src/include/executor/executor.h:391)
        ExecScan(ScanState * node, ExecScanAccessMtd accessMtd, ExecScanRecheckMtd recheckMtd) (/home/ken/cpp/postgres/src/backend/executor/execScan.c:227)
--->    ExecIndexScan(PlanState * pstate) (/home/ken/cpp/postgres/src/backend/executor/nodeIndexscan.c:537)
        ExecProcNodeInstr(PlanState * node) (/home/ken/cpp/postgres/src/backend/executor/execProcnode.c:466)
        ExecProcNodeFirst(PlanState * node) (/home/ken/cpp/postgres/src/backend/executor/execProcnode.c:450)
        ExecProcNode(PlanState * node) (/home/ken/cpp/postgres/src/include/executor/executor.h:248)
--->    ExecLimit(PlanState * pstate) (/home/ken/cpp/postgres/src/backend/executor/nodeLimit.c:96)
        ExecProcNodeInstr(PlanState * node) (/home/ken/cpp/postgres/src/backend/executor/execProcnode.c:466)
        ExecProcNodeFirst(PlanState * node) (/home/ken/cpp/postgres/src/backend/executor/execProcnode.c:450)
        ExecProcNode(PlanState * node) (/home/ken/cpp/postgres/src/include/executor/executor.h:248)
        ExecutePlan(EState * estate, PlanState * planstate, _Bool use_parallel_mode, CmdType operation, _Bool sendTuples, uint64 numberTuples, ScanDirection direction, DestReceiver * dest, _Bool execute_once) (/home/ken/cpp/postgres/src/backend/executor/execMain.c:1632)
        standard_ExecutorRun(QueryDesc * queryDesc, ScanDirection direction, uint64 count, _Bool execute_once) (/home/ken/cpp/postgres/src/backend/executor/execMain.c:350)
        ExecutorRun(QueryDesc * queryDesc, ScanDirection direction, uint64 count, _Bool execute_once) (/home/ken/cpp/postgres/src/backend/executor/execMain.c:294)
        ExplainOnePlan(PlannedStmt * plannedstmt, IntoClause * into, ExplainState * es, const char * queryString, ParamListInfo params, QueryEnvironment * queryEnv, const instr_time * planduration, const BufferUsage * bufusage) (/home/ken/cpp/postgres/src/backend/commands/explain.c:571)
        ExplainOneQuery(Query * query, int cursorOptions, IntoClause * into, ExplainState * es, const char * queryString, ParamListInfo params, QueryEnvironment * queryEnv) (/home/ken/cpp/postgres/src/backend/commands/explain.c:404)
        ExplainQuery(ParseState * pstate, ExplainStmt * stmt, ParamListInfo params, DestReceiver * dest) (/home/ken/cpp/postgres/src/backend/commands/explain.c:275)

下面代码是scan的实现，其中的ExecQual(qual, econtext)是对tuple进行过滤，因为selection已经合并到scan中了。

TupleTableSlot *
ExecScan(ScanState *node, ExecScanAccessMtd accessMtd, ExecScanRecheckMtd recheckMtd)
{
    ......
	for (;;)
	{
		TupleTableSlot *slot;
		slot = ExecScanFetch(node, accessMtd, recheckMtd);
        ......
		econtext->ecxt_scantuple = slot;
		
        // Note : selection判断
		if (qual == NULL || ExecQual(qual, econtext))
		{
			if (projInfo)
			{
				return ExecProject(projInfo);
			}
			else
			{
				return slot;
			}
		}
		else
			InstrCountFiltered1(node, 1);
	}
}

解决方案

禁用走主键扫描

既然计划走的是user_gift_pkey倒序扫描，那么我们可以手动避免优化器使用这个索引。

1	# explain analyze verbose select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id+0 desc limit 1;

将order by id改成order by id+0，由于id+0是个表达式所以优化器就就不会使用user_gift_pkey这个索引了。

这个方案不适合所有场景，如果数据分布均匀的话则某些情况下使用user_gift_pkey扫描更加合理。

增加(user_id, id)索引

1	create index idx_user_id on user_gift(user_id, id);

通过增加where条件列和排序键的复合索引，来避免走主键扫描。

写在最后

从排除缓存因素，分析查询计划，定位数据分布倾斜，到调试内核源码来进一步确定原因，最终成功解决性能问题。通过这个有趣的SQL优化经历，相信能给大家带来收获。