This chapter builds on the material covered in Section 14.1 and Section 14.2 to show some additional details about how the planner uses the system statistics to estimate the number of rows each part of a query might return. This is a significant part of the planning process, providing much of the raw material for cost calculation.
The intent of this chapter is not to document the code in detail, but to present an overview of how it works. This will perhaps ease the learning curve for someone who subsequently wishes to read the code.
版本:11
Multivariate correlation can be demonstrated with a very simple data set — a table with two columns, both containing the same values:
As explained in Section 14.2, the planner can determine cardinality of t
using the number of pages and rows obtained from pg_class
:
The data distribution is very simple; there are only 100 distinct values in each column, uniformly distributed.
The following example shows the result of estimating a WHERE
condition on the a
column:
The planner examines the condition and determines the selectivity of this clause to be 1%. By comparing this estimate and the actual number of rows, we see that the estimate is very accurate (in fact exact, as the table is very small). Changing the WHERE
condition to use the b
column, an identical plan is generated. But observe what happens if we apply the same condition on both columns, combining them with AND
:
The planner estimates the selectivity for each condition individually, arriving at the same 1% estimates as above. Then it assumes that the conditions are independent, and so it multiplies their selectivities, producing a final selectivity estimate of just 0.01%. This is a significant underestimate, as the actual number of rows matching the conditions (100) is two orders of magnitude higher.
This problem can be fixed by creating a statistics object that directs ANALYZE
to calculate functional-dependency multivariate statistics on the two columns:
A similar problem occurs with estimation of the cardinality of sets of multiple columns, such as the number of groups that would be generated by a GROUP BY
clause. When GROUP BY
lists a single column, the n-distinct estimate (which is visible as the estimated number of rows returned by the HashAggregate node) is very accurate:
But without multivariate statistics, the estimate for the number of groups in a query with two columns in GROUP BY
, as in the following example, is off by an order of magnitude:
By redefining the statistics object to include n-distinct counts for the two columns, the estimate is much improved: