70.2. 多元統計資訊範例

版本:11

70.2.1. 欄位相依性

Multivariate correlation can be demonstrated with a very simple data set — a table with two columns, both containing the same values:

CREATE TABLE t (a INT, b INT);
INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
ANALYZE t;

第 14.2 節所述,查詢計劃程序可以使用從 pg_class 取得的頁面數量和資料筆數來確定 t 的基數:

SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
relpages | reltuples
----------+-----------
45 | 10000

The data distribution is very simple; there are only 100 distinct values in each column, uniformly distributed.

The following example shows the result of estimating a WHERE condition on the a column:

EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1;
QUERY PLAN
-------------------------------------------------------------------​------------
Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual rows=100 loops=1)
Filter: (a = 1)
Rows Removed by Filter: 9900

The planner examines the condition and determines the selectivity of this clause to be 1%. By comparing this estimate and the actual number of rows, we see that the estimate is very accurate (in fact exact, as the table is very small). Changing the WHERE condition to use the b column, an identical plan is generated. But observe what happens if we apply the same condition on both columns, combining them with AND:

EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-------------------------------------------------------------------​----------
Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900

The planner estimates the selectivity for each condition individually, arriving at the same 1% estimates as above. Then it assumes that the conditions are independent, and so it multiplies their selectivities, producing a final selectivity estimate of just 0.01%. This is a significant underestimate, as the actual number of rows matching the conditions (100) is two orders of magnitude higher.

This problem can be fixed by creating a statistics object that directs ANALYZE to calculate functional-dependency multivariate statistics on the two columns:

CREATE STATISTICS stts (dependencies) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-------------------------------------------------------------------​------------
Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900

70.2.2. 多元 N-Distinct 組合數量

A similar problem occurs with estimation of the cardinality of sets of multiple columns, such as the number of groups that would be generated by a GROUP BY clause. When GROUP BY lists a single column, the n-distinct estimate (which is visible as the estimated number of rows returned by the HashAggregate node) is very accurate:

EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a;
QUERY PLAN
-------------------------------------------------------------------​----------------------
HashAggregate (cost=195.00..196.00 rows=100 width=12) (actual rows=100 loops=1)
Group Key: a
-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000 loops=1)

But without multivariate statistics, the estimate for the number of groups in a query with two columns in GROUP BY, as in the following example, is off by an order of magnitude:

EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
QUERY PLAN
-------------------------------------------------------------------​-------------------------
HashAggregate (cost=220.00..230.00 rows=1000 width=16) (actual rows=100 loops=1)
Group Key: a, b
-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)

By redefining the statistics object to include n-distinct counts for the two columns, the estimate is much improved:

DROP STATISTICS stts;
CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
QUERY PLAN
-------------------------------------------------------------------​-------------------------
HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100 loops=1)
Group Key: a, b
-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)

70.2.3. MCV Lists (最常見值列表)

第 70.2.1 節所述,函數相依性是非常便宜且高效能的統計類型,但其主要侷限性是它們的全域特質(僅在欄位等級追踪相依性,而不是在同一個欄位的值之間追踪)。

本節介紹 MCV(最常用值)列表的多元變型,這是對第 70.1 節中描述的每個欄位統計資訊的延伸功能。這些統計資訊透過儲存單獨的值來解決限制,但是就建構 ANALYZE 中的統計資訊、儲存與計劃時間而言,成本上自然要昂貴得多。

讓我們再次查看第 70.2.1 節中的查詢,但是這次是在同一欄位上建立的 MCV 列表(請確定刪除欄位相依性統計資訊,以確保查詢計劃程序使用新建立的統計資訊)。

DROP STATISTICS stts;
CREATE STATISTICS stts2 (mcv) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-------------------------------------------------------------------​------------
Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900

The estimate is as accurate as with the functional dependencies, mostly thanks to the table being fairly small and having a simple distribution with a low number of distinct values. Before looking at the second query, which was not handled by functional dependencies particularly well, let's inspect the MCV list a bit.

Inspecting the MCV list is possible using pg_mcv_list_items set-returning function.

SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts2';
index | values | nulls | frequency | base_frequency
-------+----------+-------+-----------+----------------
0 | {0, 0} | {f,f} | 0.01 | 0.0001
1 | {1, 1} | {f,f} | 0.01 | 0.0001
...
49 | {49, 49} | {f,f} | 0.01 | 0.0001
50 | {50, 50} | {f,f} | 0.01 | 0.0001
...
97 | {97, 97} | {f,f} | 0.01 | 0.0001
98 | {98, 98} | {f,f} | 0.01 | 0.0001
99 | {99, 99} | {f,f} | 0.01 | 0.0001
(100 rows)

This confirms there are 100 distinct combinations in the two columns, and all of them are about equally likely (1% frequency for each one). The base frequency is the frequency computed from per-column statistics, as if there were no multi-column statistics. Had there been any null values in either of the columns, this would be identified in the nulls column.

When estimating the selectivity, the planner applies all the conditions on items in the MCV list, and then sums the frequencies of the matching ones. See mcv_clauselist_selectivity in src/backend/statistics/mcv.c for details.

Compared to functional dependencies, MCV lists have two major advantages. Firstly, the list stores actual values, making it possible to decide which combinations are compatible.

EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
QUERY PLAN
-------------------------------------------------------------------​--------
Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=0 loops=1)
Filter: ((a = 1) AND (b = 10))
Rows Removed by Filter: 10000

Secondly, MCV lists handle a wider range of clause types, not just equality clauses like functional dependencies. For example, consider the following range query for the same table:

EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a <= 49 AND b > 49;
QUERY PLAN
-------------------------------------------------------------------​--------
Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=0 loops=1)
Filter: ((a <= 49) AND (b > 49))
Rows Removed by Filter: 10000