摘要:
直方图的在查询优化中的作用主要是为了优化器中的代价模型提供代价中的统计信息计算, 本文对其进行分析
Histogram In MySQL
mysql hitograms目录中为mysql所提供的直方图的相关基础设施代码。在mysql 8.0之前其没有使用直方图作为统计信息来为查询优化提供支持。
早期的物理优化所提供的信息多为由innodb中关于表信息来直接表述。
直方图的在查询优化中的作用主要是为了优化器中的代价模型提供代价中的统计信息计算,例如:数据分布情况,每个表中样本统计记录数等等。传统的直方图又包括等高和等宽两种。
下面这张图可以直观的反映了直方图(Hist in PostgreSQL)
从Oracle的帮助文档中我们可以看出对应histogram给出的定义如下:
A histogram is a special type of column statistic that provides more detailed information about the data distribution in a table column. A histogram sorts values into "buckets," as you might sort coins into buckets.
Based on the NDV(number of distinct values) and the distribution of the data, the database chooses the type of histogram to create. (In some cases, when creating a histogram, the database samples an internally predetermined number of rows.) The types of histograms are as follows:
-
Frequency histograms and top frequency histograms
-
Height-Balanced histograms (legacy)
-
Hybrid histograms
那么统计信息的作用是什么呢?
By default the optimizer assumes a uniform distribution of rows across the distinct values in a column.
For columns that contain data skew (a nonuniform distribution of data within the column), a histogram enables the optimizer to generate accurate cardinality estimates for filter and join predicates that involve these columns.
For example, a California-based book store ships 95% of the books to California, 4% to Oregon, and 1% to Nevada. The book orders table has 300,000 rows. A table column stores the state to which orders are shipped. A user queries the number of books shipped to Oregon. Without a histogram, the optimizer assumes an even distribution of 300000/3 (the NDV is 3), estimating cardinality at 100,000 rows. With this estimate, the optimizer chooses a full table scan. With a histogram, the optimizer calculates that 4% of the books are shipped to Oregon, and chooses an index scan.
下面看看mysql中是如何实现historgram的吧:首先我们看看在histogram目录下,都有哪些文件:value_map.xxx, value_map_type.h, sigletone.xx histogram.xxx equi_height.xx equi_height_bucket.xx
sigletone.xx中构建相应的直方图信息,如:获取NDV,选择率,bucket等信息。主要是提供一些关于hist操作的对外接口。equi-height.xxx 等高直方图,一个等高的直方图以json方式表现其形式如下:
An equi-height histogram converted to a JSON object, follows the following
"schema":
{
// Last time the histogram was updated. As of now, this means "when the
// histogram was created" (incremental updates are not supported). Date/time
// is given in UTC.
// -- J_DATETIME
"last-updated": "2015-11-04 15:19:51.000000",
// Histogram type. Always "equi-height" for equi-height histograms.
// -- J_STRING
"histogram-type": "equi-height",
// Fraction of NULL values. This is the total fraction of NULL values in the
// original data set.
// -- J_DOUBLE
"null-values": 0.1,
// Histogram buckets. May be an empty array, if for instance the source
// only contain NULL values.
// -- J_ARRAY
"buckets":
[
[
// Lower inclusive value.
// -- Data type depends on the source column.
"0",
// Upper inclusive value.
// -- Data type depends on the source column.
"002a38227ecc7f0d952e85ffe37832d3f58910da",
// Cumulative frequence
// -- J_DOUBLE
0.001978728666831561,
// Number of distinct values in this bucket.
// -- J_UINT
10
]
]
}
histogram主要是实现给定类型的histogram。一个historgram是以json的方式进行存储的。因为,该种方式下,其可以保存的bucket不受限制且相关的数据类型的长度可以获得完整保存下来。
我们看看比较关键的函数, build_histogram函数。从这里可以看出,直方图的类型依赖于其输入数据量的大小以及相应的bucket的数量。如果NDV小于或者等于bucket的数量,我们我们只需要创建一个signleton 的histogram即可满足创建hist的需求。因为,在该singletone hist里面的每个bucket均可以处理一个ndv。
当如果ndv数量大于相应的bucket数量后,那么我们就需要创建多个histogram,用来描述这些数据。这里的histogram和我们常说的histogram是两个不同的事物。我们说的histogram是一个抽象的逻辑概念,而这里的hist则是具象化的hist对象,表示一个具体的hist对象。因此,系统中会存在各种类型的hist对象。
Create a histogram from a value map.
This function will build a histogram from a value map. The histogram type
depends on both the size of the input data, as well as the number of buckets
specified. If the number of distinct values is less than or equal to the
number of buckets, a Singleton histogram will be created. Otherwise, an
equi-height histogram will be created.
The histogram will be allocated on the supplied mem_root, and it is the
callers responsibility to properly clean up when the histogram isn't needed
anymore.
equi_heigth_bucket为等高hist中的每个bucket的定义。每个bucket保存了以下四种类型的数据:
- (1)低的包含性的值;(可以简单理解为bucket中value_map的最小值)Lower inclusive value
- (2)高的包含性值,(可以简单理解bucket中的value_map里最大值);Upper Inclusive value
- (3)累计频率 (0-1.0)
- (4)每个桶里面的 NDV值。
对应后续的直方图在查询优化中的使用,在后续文章中给出,这里就简单的介绍一下mysql中histogram目录下的相关代码。这些目录中的代码相对简单,这里就不再详细给出,会结合后续的查询优化讲解给出。
结语:
原文发表于 Histogram In MySQL