query plan not optimal

Поиск
Список
Период
Сортировка
От Marc Cousin
Тема query plan not optimal
Дата
Msg-id 52B311C4.1070108@gmail.com
обсуждение исходный текст
Ответы Re: query plan not optimal
Re: query plan not optimal
Список pgsql-performance
Hi,

I'm having something I feel is a bit of a limitation of the optimizer (or something I don't understand :) ).

Sorry, this is a rather long mail.

I have a workaround for the problem below, but I don't like cheating the optimizer for no good reason.

First a little bit of context, because the example below may feel dumb if you read it with no explanation:

I'm working on the bacula project, trying to further optimize data insertion into the database (we've been working on
thissubject for a few years now, as backup speed is a core functionality of any backup software). 

Bacula is a backup software: we are storing all of the backup's metadata into the database. It means all backed up
files…

The two very large tables are file and path:

  * file contains metadata about ALL files that have been backed up and are still available in the database. It
sometimescontain several billions records. 

  * path contains the full path of a file. It is stored out of file to save (lots) of space, as most path are identical
infiles. It still usually contains 20 to 50 million records, and can be as big as 20GB. 

So we have something like this:

  create table file (fileid bigserial, filename text, pathid bigint);
  alter table file add primary key (fileid);

  create table path (pathid bigserial, path text);
  alter table path add primary key (pathid);
  create unique index idx_path on path(path);

I removed some columns from the file table: we store stat(), a checksum, etc. They're not needed for this example.

In order to insert data efficiently, backups first store data into temp tables which look like this:

  create table batch (path text, filename text);

(once again I removed all metadata columns)

If you want to create a batch table with useful data to replicate what I'm going to show, you can try something like
this:

  find /home -printf '%h\t%f\n' |  psql -c "COPY batch FROM STDIN" bacula

We analyze the batch table: in the real code, it is a temp table, so it is compulsory:

  analyze batch;

Then we insert missing paths. This is one of the plans that fail, but we'll focus on the second one: as we are starting
froman empty path table, this query won't be realistic. 

  insert into path (path) select path from batch where not exists (select 1 from path where path.path=batch.path) group
bypath; 

We analyze:

  analyze path;


So now we insert into the file table.

insert into file (pathid,filename) select pathid, filename from batch join path using (path);

Here is the plan:

bacula=# explain select pathid, filename from batch join path using (path);
                              QUERY PLAN
-----------------------------------------------------------------------
 Hash Join  (cost=822.25..22129.85 rows=479020 width=26)
   Hash Cond: (batch.path = path.path)
   ->  Seq Scan on batch  (cost=0.00..11727.20 rows=479020 width=85)
   ->  Hash  (cost=539.89..539.89 rows=22589 width=81)
         ->  Seq Scan on path  (cost=0.00..539.89 rows=22589 width=81)
(5 lignes)


I have this plan almost all the time. Lets add a bunch of useless records in path to be more realistic and make things
worse:usually, bacula inserts data in 500,000 batches, and path is very large (millions of records), and is bigger than
work_mem(it would have to be 20GB in most extreme cases), so it may trash disks heavily (many servers can be inserting
atthe same time). 

Let's simulate this (remove indexes before, put them back afterwards :) )

  insert into path (path) select generate_series(1000000,50000000); #Create a realistic path table
  analyze path;

Here is the plan:

bacula=# explain (analyze,buffers,verbose) select pathid, filename from batch join path using (path);
                                                               QUERY PLAN
                

----------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1646119.66..1945643.59 rows=479020 width=26) (actual time=27275.240..36745.904 rows=479020 loops=1)
   Output: path.pathid, batch.filename
   Hash Cond: (batch.path = path.path)
   Buffers: shared hit=130760 read=179917 written=1823, temp read=224130 written=223876
   ->  Seq Scan on public.batch  (cost=0.00..11727.20 rows=479020 width=85) (actual time=0.259..176.031 rows=479020
loops=1)
         Output: batch.filename, batch.path
         Buffers: shared read=6937 written=1823
   ->  Hash  (cost=793966.96..793966.96 rows=49022696 width=16) (actual time=27274.725..27274.725 rows=49022590
loops=1)
         Output: path.pathid, path.path
         Buckets: 131072  Batches: 128  Memory Usage: 18329kB
         Buffers: shared hit=130760 read=172980, temp written=218711
         ->  Seq Scan on public.path  (cost=0.00..793966.96 rows=49022696 width=16) (actual time=0.231..9650.452
rows=49022590loops=1) 
               Output: path.pathid, path.path
               Buffers: shared hit=130760 read=172980
 Total runtime: 36781.919 ms
(15 lignes)


It seems like a logical choice (and it's working quite well, but we only have 1 of these running, and my path table is
stillvery small compared to real use cases) 

Let's force a nested loop (we don't do it that way usually, we lower the *_page_cost)

  set enable_hashjoin TO off; set enable_mergejoin TO off;

bacula=# explain (analyze,buffers,verbose) select pathid, filename from batch join path using (path);
                                                            QUERY PLAN
          

----------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.56..4001768.10 rows=479020 width=26) (actual time=2.303..15371.237 rows=479020 loops=1)
   Output: path.pathid, batch.filename
   Buffers: shared hit=2403958 read=7539
   ->  Seq Scan on public.batch  (cost=0.00..11727.20 rows=479020 width=85) (actual time=0.340..160.142 rows=479020
loops=1)
         Output: batch.path, batch.filename
         Buffers: shared read=6937
   ->  Index Scan using idx_path on public.path  (cost=0.56..8.32 rows=1 width=16) (actual time=0.030..0.031 rows=1
loops=479020)
         Output: path.pathid, path.path
         Index Cond: (path.path = batch.path)
         Buffers: shared hit=2403958 read=602
 Total runtime: 15439.043 ms


As you can see, more than twice as fast, and a very high hit ratio on the path table, even if we start from a cold
cache(I did, here, both PostgreSQL and OS). We have an excellent hit ratio because the batch table contains few
differentpath (several files in a directory), and is already quite clustered, as it comes from a backup, which is of
courseperformed directory by directory. 

What I think is the cause of the problem is that the planner doesn't take into account that we are going to fetch the
exactsame values all the time in the path table, so we'll have a very good hit ratio. Maybe the n_distinct from
batch.pathcould be used to refine the caching effect on the index scan ? The correlation is almost 0, but that's
normal,the directories aren't sorted alphabetically, but all records for a given path are grouped together. 

For now, we work around this by using very low values for seq_page_cost and random_page_cost for these 2 queries. I
justfeel that maybe PostgreSQL could do a bit better here, so I wanted to submit this use case for discussion. 

Regards

Marc


В списке pgsql-performance по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: query not using index
Следующее
От: Mack Talcott
Дата:
Сообщение: Re: Debugging shared memory issues on CentOS