Learning PostgreSql: 2015

Sunday, February 1, 2015

Why Index Covering Is Not That Important

I was quite surprised that the documentation recommends using multicolumn indexes in moderation: "Multicolumn indexes should be used sparingly. In most situations, an index on a single column is sufficient and saves space and time."

For a long time SQL Server user with a strong habit of using covering indexes, that was counter-intuitive. I had to do some learning to understand why this recommendation makes sense. The following post demonstrates why using covering indexes is less necessary in PostgreSql.

TL;DR;

PostgreSql is very likely to efficiently retrieve data via a single-column index. As a result, there is not much need for covering indexes.

Setting up the Table

We shall start with some basics: building a very narrow table with 1M rows, each row being exactly 16 bytes:

create table narrow_table as

with numbers as(
select generate_series as id from generate_series(0,1048575))
select id, 
sin(id) as sin_id,
123 as some_value 
from numbers;
alter table narrow_table
add constraint pk_narrow_table primary key(id);
create index narrow_table_sin_id on narrow_table(sin_id);
analyze narrow_table;

Running a Simple Query

Suppose that we have a range query on an indexed column sin_id:

select * from narrow_table

where sin_id between %1 and %2

Will the index be used to satisfy the query? It depends on the number of rows meeting the criteria - if almost all the rows meet the criteria, it is cheaper to just scan the whole table. On the other hand, selecting just a couple of rows via an index is clearly cheaper than scanning the table. Somewhere between these two extreme cases there is the indifference point, the case when scanning the table and accessing it via the index have approximately the same costs.

Let us find this indifference point by running range queries and checking whether an index is used. For example, the following query scans the whole:
table, and 41% rows meet the filter condition:

select count(distinct some_value),
 count(*)/1048576.0*100 as cardinality_in_percent
from narrow_table 
where sin_id between 0 and 0.96

If we decrease the range just a little bit to (0, 0.95), then 39.89% rows meet the filter condition, and the query planner chooses to scan the index. So the indifference point in this particular case is about 40% - if 40% rows meet the filter condition, it is does not matter much whether we scan the table or use the index: the cost are going to be more or less the same.

After my SQL Server experience, I think that this value, 40%, is very high for such a narrow table (its rows are only 16 bytes), and this is very good news. We'll discuss it more later in this post.

Note: the exact value of indifference point will be different for different tables. It depends on rows' width - for wider tables the value is higher. The reason is simple: if all the pages are going to be read anyway, than it is cheaper to just scan the table. If the table is narrow, then on average there are more rows per page. As such, it is more likely that every page in the table has at least one row that needs to be read to satisfy the query.

Low Sensitivity to Stale Statistics

Our example is a bit unrealistic, because we have not modified the table after creating the statistics, so the statistics are not stale. In real life the statistics are usually a bit off, which may cause the query planner to choose a wrong plan.

Suppose, for example, that the query planner estimated that 40% rows meet the filter condition, and it chose to use the index. However, in reality 41% rows are selected, so the query would run faster if we scanned the whole table. Clearly our plan is suboptimal, but not by much: it runs only 1/40=2.5% worse than the table scan.

Finding Indifference Point on SQL Server - It Is Low

It is very common to create clustering indexes on SQL Server tables, so we shall use such a table in our example. For a SQL Server table with a clustered index, populated just like out PostgreSql one, the indifference point is nowhere near 40%. In fact, it is below 1%. You can build the table on SQL Server, and see for yourself, using the following script:

SELECT ID, SIN(ID) AS UncorrelatedNumber, 123 AS SomeNumber 
INTO NarrowTable FROM TableWith1MRows; 
 
 ALTER TABLE NarrowTable ALTER COLUMN ID INT NOT NULL; 
 ALTER TABLE NarrowTable ADD CONSTRAINT PK_NarrowTable PRIMARY KEY(ID);  
 CREATE INDEX NarrowTable_UncorrelatedNumber ON NarrowTable(UncorrelatedNumber); 
 SELECT COUNT(DISTINCT SomeNumber),  
 
   COUNT(*)/1048576.0*100 AS CardinalityInPercent  
FROM NarrowTable WHERE UncorrelatedNumber BETWEEN 0 AND 0.1;

Note: on SQL Server the indifference point also depends on table size, or, more precisely, on the depth of clustering index - for larger tables the value is lower. This is beyond the scope of this post.

Note: it is essential that the index on UncorrelatedNumber is not clustered. If it were clustered, it would be used to access even very wide ranges on the table. However, we can only have one clustered index per table.

It Is Low Indifference Point That Makes Us Frequently Use Index Covering

As we have seen, the indifference point for a SQL Server non-clustered index on UncorrelatedNumber is low, under 1%, which has two important consequences.

First, we need to address high sensitivity to stale statistics. If, for example, the optimizer thinks that 1% of rows meet the criteria, and chooses to use the index, and the actual number of rows meeting the criteria is 2%, the query runs 100% slower that the table scan.

Second, our index is not going to help if we want to select a range wider than the indifference point, such as 5%.

To address these two problems, we commonly use index covering.

Since both these problems are less pronounced on PostgreSql, the need to create covering indexes on PostgreSql is less pronounced as well.

What do you think?

Saturday, January 24, 2015

Explicitly Stating Your Intent May Get You Better Execution Plan

Suppose that we need to query a PostgreSql database, and get the number of developers who currently have tickets assigned to them. The following query clearly describes the intent, and gets a good execution plan (the scripts that set up test data are provided below):

explain analyze
select count(*) from developers as d
where exists(select * from tickets as t 

  where t.developer_id=d.developer_id);
"Aggregate  (cost=141506.53..141506.54 rows=1 width=0) (actual time=1542.890..1542.891 rows=1 loops=1)"

"  ->  Merge Semi Join  (cost=1.59..139010.67 rows=998343 width=0) (actual time=0.076..1461.177 rows=1048576 loops=1)"

"        Merge Cond: (d.developer_id = t.developer_id)"

"        ->  Index Only Scan using pk_developers on developers d  (cost=0.42..36032.07 rows=1048576 width=4) (actual time=0.045..289.787 rows=1048576 loops=1)"

"              Heap Fetches: 1048576"

"        ->  Index Only Scan using tickets_developer_id on tickets t  (cost=0.43..74148.15 rows=2097151 width=4) (actual time=0.024..598.723 rows=2097151 loops=1)"

"              Heap Fetches: 2097252"

"Planning time: 0.927 ms"

"Execution time: 1542.987 ms"

Clearly this execution plan, a merge join, is very efficient.

There are many more contrived, more complicated ways to accomplish the same thing, such as the following query:

explain analyze
select count(distinct d.developer_id) 
from developers as d join tickets as t 

  on t.developer_id=d.developer_id;
"Aggregate  (cost=139472.49..139472.50 rows=1 width=4) (actual time=4273.890..4273.891 rows=1 loops=1)"

"  ->  Hash Join  (cost=36476.96..134229.61 rows=2097151 width=4) (actual time=404.983..2438.207 rows=2097151 loops=1)"

"        Hash Cond: (t.developer_id = d.developer_id)"

"        ->  Seq Scan on tickets t  (cost=0.00..40572.51 rows=2097151 width=4) (actual time=0.008..362.940 rows=2097151 loops=1)"

"        ->  Hash  (cost=19273.76..19273.76 rows=1048576 width=4) (actual time=404.567..404.567 rows=1048576 loops=1)"

"              Buckets: 16384  Batches: 16  Memory Usage: 2321kB"

"              ->  Seq Scan on developers d  (cost=0.00..19273.76 rows=1048576 width=4) (actual time=0.006..179.046 rows=1048576 loops=1)"

"Planning time: 0.870 ms"

"Execution time: 4273.981 ms"

Clearly this query is logically equivalent to the previous one. Also this query runs as an expensive hash join, using up considerable memory, and taking almost three times longer to complete.

As we have seen, it usually makes sense to express our intent clearly. Contrived queries not only are more difficult to understand and maintain, they may also perform worse.

The following script creates and populates the tables used in this post.

create table developers as

with numbers as(
select generate_series as n from generate_series(0,1048575))
select n as developer_id, 
cast(n as text) as name,
cast('Some developer data ........' as text) as some_data 
from numbers;
alter table developers
add constraint pk_developers primary key(developer_id);
drop table tickets;
create table tickets as

with numbers as(
select generate_series as n from generate_series(0,1048575*2))
select n/2 as developer_id,
n as ticket_id, 
cast('Normal' as text) as ticket_priority,
cast('Some ticket data ........' as text) as some_data 
from numbers;
alter table tickets
add constraint pk_ticket primary key(ticket_id);
create index tickets_developer_id on tickets(developer_id);
update tickets set ticket_priority='High'
where ticket_id between 1000 and 1100;
analyze developers;
analyze tickets;

Wednesday, January 21, 2015

Learning Bitmap Index Scan, Recheck Cond, and Bitmap Heap Scan

As Postgres scans an index and finds a matching key, it can choose not to read the matching row from the table right away. Instead, it can remember on which page was that matching row, finish scanning the index, and read each data page with matching rows only once. As a result, Postgres expects to make less page reads.

Let us set up an example and see how it works. We shall create a very narrow table, with only 12 bytes per row.

create table narrow_table as
with numbers as(select generate_series as n from generate_series(0,1048575))

select n as seq_number,

trunc(random()*1048575) as rand_number

from numbers;

alter table narrow_table
add constraint pk_narrow_table primary key(seq_number);
create index narrow_table_rand on narrow_table(rand_number);
cluster narrow_table using pk_narrow_table;
-- vacuum full must be run as a separate command
vacuum full analyze narrow_table;

This table is physically ordered by seq_number column, as a result of cluster command. However, there is very little correlation between rand_number values and physical order of rows:

SELECT attname, correlation

FROM pg_stats

WHERE tablename LIKE '%narrow%';
"seq_number"1
"rand_number"-0.00202257

The following query uses Bitmap Index Scan, Recheck Cond, and Bitmap Heap Scan:

explain analyze select seq_number, rand_number

from narrow_table

where rand_number between 1 and 1000;
"Bitmap Heap Scan on narrow_table  (cost=23.02..2667.43 rows=1034 width=12) (actual time=0.805..1.848 rows=968 loops=1)"
"  Recheck Cond: ((rand_number >= 1::double precision) AND (rand_number <= 1000::double precision))"
"  Heap Blocks: exact=903"
"  ->  Bitmap Index Scan on narrow_table_rand  (cost=0.00..22.77 rows=1034 width=0) (actual time=0.511..0.511 rows=968 loops=1)"
"        Index Cond: ((rand_number >= 1::double precision) AND (rand_number <= 1000::double precision))"
"Planning time: 0.276 ms"
"Execution time: 1.947 ms"

First, during Bitmap Index Scan step, the index was scanned, and 968 rows matched the range condition. These rows were on 903 pages, which were read during the Bitmap Heap Scan step. All the rows on those pages were matched against the range condition, which was mentioned as Recheck Cond in the execution plan.

The relative complexity of this plan allows Postgres to satisfy the query by reading only 903 pages. This makes sense if random page reads are expensive. Let us have the query planner think that random page reads are no more expensive than sequential ones:

SET random_page_cost = 1;

explain analyze select seq_number, rand_number

from narrow_table

where rand_number between 1 and 1000;


"Index Scan using narrow_table_rand on narrow_table  (cost=0.42..971.98 rows=1034 width=12) (actual time=0.071..1.355 rows=968 loops=1)"
"  Index Cond: ((rand_number >= 1::double precision) AND (rand_number <= 1000::double precision))"
"Planning time: 0.273 ms"
"Execution time: 1.465 ms"

This time the plan is much simpler - as Postgres scans the index and finds a matching key, the corresponding row is read right away, via a random page read. Since random page reads are cheap, going for a more complex plan to save less than 10% of reads does not make sense any more.

Also note that the query planner takes into account the correlation between columns values and physical location of rows. In all previous queries the correlation was very low, so every time a matching index key was found, most likely it was on a different page.

Let us see what happens when the correlation is high. Our table has been explicitly clustered on its first column, so the correlation between seq_number and physical location of rows is 1, which is the highest possible value. This means that as we scan the index and find matching key, every next matching row is very likely to be on the same page.

As a result, even if we set the price of random page reads to be very high, such as 10 instead of the default 4, there are not going to be many page reads - the next row is very likely to be found right on the page we have just read for the previous match. So the execution plan is going to be a simple index scan:

SET random_page_cost = 10;

explain analyze select seq_number, rand_number from narrow_table

where seq_number between 1 and 1000;
"Index Scan using pk_narrow_table on narrow_table  (cost=0.42..64.20 rows=939 width=12) (actual time=0.062..0.516 rows=1000 loops=1)"
"  Index Cond: ((seq_number >= 1) AND (seq_number <= 1000))"
"Planning time: 0.229 ms"
"Execution time: 0.641 ms"

As we have seen, if random page reads are expensive, and if the correlation between column values and physical location of rows is not very high, the query planner may use Bitmap Index Scan, Recheck Cond, and Bitmap Heap Scan in order to minimize the number of random page reads.