Sunday, February 1, 2015

Why Index Covering Is Not That Important

I was quite surprised that the documentation recommends using multicolumn indexes in moderation: "Multicolumn indexes should be used sparingly. In most situations, an index on a single column is sufficient and saves space and time."

For a long time SQL Server user with a strong habit of using covering indexes, that was counter-intuitive. I had to do some learning to understand why this recommendation makes sense. The following post demonstrates why using covering indexes is less necessary in PostgreSql.

TL;DR;

PostgreSql is very likely to efficiently retrieve data via a single-column index. As a result, there is not much need for covering indexes.

Setting up the Table

We shall start with some basics: building a very narrow table with 1M rows, each row being exactly 16 bytes:

create table narrow_table as

with numbers as(
select generate_series as id from generate_series(0,1048575))
select id, 
sin(id) as sin_id,
123 as some_value 
from numbers;
alter table narrow_table
add constraint pk_narrow_table primary key(id);
create index narrow_table_sin_id on narrow_table(sin_id);
analyze narrow_table;

Running a Simple Query

Suppose that we have a range query on an indexed column sin_id:

select * from narrow_table 
where sin_id between %1 and %2

Will the index be used to satisfy the query? It depends on the number of rows meeting the criteria - if almost all the rows meet the criteria, it is cheaper to just scan the whole table. On the other hand, selecting just a couple of rows via an index is clearly cheaper than scanning the table. Somewhere between these two extreme cases there is the indifference point, the case when scanning the table and accessing it via the index have approximately the same costs.

Let us find this indifference point by running range queries and checking whether an index is used. For example, the following query scans the whole:
 table, and 41% rows meet the filter condition:

select count(distinct some_value),
 count(*)/1048576.0*100 as cardinality_in_percent
from narrow_table 
where sin_id between 0 and 0.96

If we decrease the range just a little bit to (0, 0.95), then 39.89% rows meet the filter condition, and the query planner chooses to scan the index. So the indifference point in this particular case is about 40% - if 40% rows meet the filter condition, it is does not matter much whether we scan the table or use the index: the cost are going to be more or less the same.

After my SQL Server experience, I think that this value, 40%, is very high for such a narrow table (its rows are only 16 bytes), and this is very good news. We'll discuss it more later in this post.

Note: the exact value of indifference point will be different for different tables. It depends on rows' width - for wider tables the value is higher. The reason is simple: if all the pages are going to be read anyway, than it is cheaper to just scan the table. If the table is narrow, then on average there are more rows per page. As such, it is more likely that every page in the table has at least one row that needs to be read to satisfy the query.

Low Sensitivity to Stale Statistics

Our example is a bit unrealistic, because we have not modified the table after creating the statistics, so the statistics are not stale. In real life the statistics are usually a bit off, which may cause the query planner to choose a wrong plan.

Suppose, for example, that the query planner estimated that 40% rows meet the filter condition, and it chose to use the index. However, in reality 41% rows are selected, so the query would run faster if we scanned the whole table. Clearly our plan is suboptimal, but not by much: it runs only 1/40=2.5% worse than the table scan.

Finding Indifference Point on SQL Server - It Is Low

It is very common to create clustering indexes on SQL Server tables, so we shall use such a table in our example. For a SQL Server table with a clustered index, populated just like out PostgreSql one, the indifference point is nowhere near 40%. In fact, it is below 1%. You can build the table on SQL Server, and see for yourself, using the following script:

SELECT ID, SIN(ID) AS UncorrelatedNumber, 123 AS SomeNumber 
INTO NarrowTable FROM TableWith1MRows; 
 
 ALTER TABLE NarrowTable ALTER COLUMN ID INT NOT NULL; 
 ALTER TABLE NarrowTable ADD CONSTRAINT PK_NarrowTable PRIMARY KEY(ID);  
 CREATE INDEX NarrowTable_UncorrelatedNumber ON NarrowTable(UncorrelatedNumber); 
 SELECT COUNT(DISTINCT SomeNumber),  
 
   COUNT(*)/1048576.0*100 AS CardinalityInPercent  
FROM NarrowTable WHERE UncorrelatedNumber BETWEEN 0 AND 0.1; 
 

Note: on SQL Server the indifference point also depends on table size, or, more precisely, on the depth of clustering index - for larger tables the value is lower. This is beyond the scope of this post.

Note: it is essential that the index on UncorrelatedNumber is not clustered. If it were clustered, it would be used to access even very wide ranges on the table. However, we can only have one clustered index per table.

It Is Low Indifference Point That Makes Us Frequently Use Index Covering

As we have seen, the indifference point for a SQL Server non-clustered index on UncorrelatedNumber is low, under 1%, which has two important consequences.

First, we need to address high sensitivity to stale statistics. If, for example, the optimizer thinks that 1% of rows meet the criteria, and chooses to use the index, and the actual number of rows meeting the criteria is 2%, the query runs 100% slower that the table scan.

Second, our index is not going to help if we want to select a range wider than the indifference point, such as 5%.

To address these two problems, we commonly use index covering.

Since both these problems are less pronounced on PostgreSql, the need to create covering indexes on PostgreSql is less pronounced as well.

What do you think?

No comments:

Post a Comment