十二
09
接着上篇BLOG的情况,接下去继续分析数据分布情况对索引的影响。
1. 在上例中数据分布是均匀的,我们现在来制造一些数据分布的倾斜,考察一下情况。
############################################################################# # 制造索引列数据分布倾斜 ############################################################################# SQL> select col1,count(*) from alex_t03 group by col1 order by 1; COL1 COUNT(*) ---------- ---------- 1 10000 2 10000 3 10000 5 50000 9 10000 10 10000 已选择6行。 SQL> pro 索引理想选择度 SQL> select trunc((1/count(distinct col1))*100,2)||'%' col_ratio from alex_t03; COL_RATIO -------------------- 16.66% SQL> pro 索引实际选择度(col1=1) SQL> select trunc((select count(*) from alex_t03 where col1=1)/(select count(*) from alex_t03)*100,2)||'%' select_ratio from dual; SELECT_RATIO -------------------- 10% SQL> pro 索引实际选择度(col1=5) SQL> select trunc((select count(*) from alex_t03 where col1=5)/(select count(*) from alex_t03)*100,2)||'%' select_ratio from dual; SELECT_RATIO -------------------- 50%
上述看到,数据分布倾斜后,索引的理想选择度和实际选择度不一样了,CBO会如何选择呢?
2. 使用旧有的统计信息,不重新收集,发现CBO默认还是理想的均匀数据分布的算法。
SQL> select * from alex_t03 where col1=1; ---------------------------------------------------------- Plan hash value: 577105205 ------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 10013 | 117K| 53 (2)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| ALEX_T03 | 10013 | 117K| 53 (2)| 00:00:01 | |* 2 | INDEX RANGE SCAN | IDX_ALEX_T03_COL1 | 10013 | | 26 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------- SQL> select * from alex_t03 where col1=5; ---------------------------------------------------------- Plan hash value: 577105205 ------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 10013 | 117K| 53 (2)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| ALEX_T03 | 10013 | 117K| 53 (2)| 00:00:01 | |* 2 | INDEX RANGE SCAN | IDX_ALEX_T03_COL1 | 10013 | | 26 (0)| 00:00:01 | -------------------------------------------------------------------------------------------------
3. 重新收集统计信息,但不构建直方图。
############################################################################# # 重新收集统计信息(不构建直方图) ############################################################################# SQL> select * from alex_t03 where col1=1; ---------------------------------------------------------- Plan hash value: 4180973395 ------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 16649 | 195K| 87 (5)| 00:00:02 | |* 1 | TABLE ACCESS FULL| ALEX_T03 | 16649 | 195K| 87 (5)| 00:00:02 | ------------------------------------------------------------------------------ SQL> select * from alex_t03 where col1=5; ---------------------------------------------------------- Plan hash value: 4180973395 ------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 16649 | 195K| 87 (5)| 00:00:02 | |* 1 | TABLE ACCESS FULL| ALEX_T03 | 16649 | 195K| 87 (5)| 00:00:02 | ------------------------------------------------------------------------------
发现不论实际选择度=10%的COL1=1的情况,还是实际选择度=50%的COL1=5的情况,都走了全表扫描。
这样对COL1=5的情况固然没有问题,但对COL1=1的情况,就不对了。
4. 再重新收集统计信息,这次构建直方图。
############################################################################# # 重新收集统计信息(构建直方图) ############################################################################# SQL> select * from alex_t03 where col1=1; ---------------------------------------------------------- Plan hash value: 577105205 ------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 10345 | 121K| 59 (2)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| ALEX_T03 | 10345 | 121K| 59 (2)| 00:00:01 | |* 2 | INDEX RANGE SCAN | IDX_ALEX_T03_COL1 | 10461 | | 31 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------- SQL> select * from alex_t03 where col1=5; ---------------------------------------------------------- Plan hash value: 4180973395 ------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 48527 | 568K| 87 (5)| 00:00:02 | |* 1 | TABLE ACCESS FULL| ALEX_T03 | 48527 | 568K| 87 (5)| 00:00:02 | ------------------------------------------------------------------------------
这次,我们发现实际选择度=10%的COL1=1的情况,走了索引扫描,实际选择度=50%的COL1=5的情况,走了全表扫描,这样就对了。
5. 总结
(1)在缺少直方图的情况下,对于索引列的数据分布,尽可能保证其均匀分布,否则会执行计划偏差。
(2)由于收集直方图成本较高,一般情况下不建议收集。
(3)索引列的数据分布在设计之初,尽可能保证较均匀分布,即理想选择度和实际选择度尽可能相等。
no comment untill now