Saturday, January 24, 2015

Explicitly Stating Your Intent May Get You Better Execution Plan

Suppose that we need to query a PostgreSql database, and get the number of developers who currently have tickets assigned to them. The following query clearly describes the intent, and gets a good execution plan (the scripts that set up test data are provided below):
explain analyze
select count(*) from developers as d
where exists(select * from tickets as t 

  where t.developer_id=d.developer_id);
"Aggregate  (cost=141506.53..141506.54 rows=1 width=0) (actual time=1542.890..1542.891 rows=1 loops=1)"

"  ->  Merge Semi Join  (cost=1.59..139010.67 rows=998343 width=0) (actual time=0.076..1461.177 rows=1048576 loops=1)"

"        Merge Cond: (d.developer_id = t.developer_id)"

"        ->  Index Only Scan using pk_developers on developers d  (cost=0.42..36032.07 rows=1048576 width=4) (actual time=0.045..289.787 rows=1048576 loops=1)"

"              Heap Fetches: 1048576"

"        ->  Index Only Scan using tickets_developer_id on tickets t  (cost=0.43..74148.15 rows=2097151 width=4) (actual time=0.024..598.723 rows=2097151 loops=1)"

"              Heap Fetches: 2097252"

"Planning time: 0.927 ms"

"Execution time: 1542.987 ms"

Clearly this execution plan, a merge join, is very efficient. 

There are many more contrived, more complicated ways to accomplish the same thing, such as the following query:
explain analyze
select count(distinct d.developer_id) 
from developers as d join tickets as t 

  on t.developer_id=d.developer_id;
"Aggregate  (cost=139472.49..139472.50 rows=1 width=4) (actual time=4273.890..4273.891 rows=1 loops=1)"

"  ->  Hash Join  (cost=36476.96..134229.61 rows=2097151 width=4) (actual time=404.983..2438.207 rows=2097151 loops=1)"

"        Hash Cond: (t.developer_id = d.developer_id)"

"        ->  Seq Scan on tickets t  (cost=0.00..40572.51 rows=2097151 width=4) (actual time=0.008..362.940 rows=2097151 loops=1)"

"        ->  Hash  (cost=19273.76..19273.76 rows=1048576 width=4) (actual time=404.567..404.567 rows=1048576 loops=1)"

"              Buckets: 16384  Batches: 16  Memory Usage: 2321kB"

"              ->  Seq Scan on developers d  (cost=0.00..19273.76 rows=1048576 width=4) (actual time=0.006..179.046 rows=1048576 loops=1)"

"Planning time: 0.870 ms"

"Execution time: 4273.981 ms"
Clearly this query is logically equivalent to the previous one. Also this query runs as an expensive hash join, using up considerable memory, and taking almost three times longer to complete.

As we have seen, it usually makes sense to express our intent clearly. Contrived queries not only are more difficult to understand and maintain, they may also perform worse.

The following script creates and populates the tables used in this post.

create table developers as

with numbers as(
select generate_series as n from generate_series(0,1048575))
select n as developer_id, 
cast(n as text) as name,
cast('Some developer data ........' as text) as some_data 
from numbers;
alter table developers
add constraint pk_developers primary key(developer_id);
drop table tickets;
create table tickets as

with numbers as(
select generate_series as n from generate_series(0,1048575*2))
select n/2 as developer_id,
n as ticket_id, 
cast('Normal' as text) as ticket_priority,
cast('Some ticket data ........' as text) as some_data 
from numbers;
alter table tickets
add constraint pk_ticket primary key(ticket_id);
create index tickets_developer_id on tickets(developer_id);
update tickets set ticket_priority='High'
where ticket_id between 1000 and 1100;
analyze developers;
analyze tickets;






No comments:

Post a Comment