Обсуждение: query plan not optimal

Поиск
Список
Период
Сортировка

query plan not optimal

От
Marc Cousin
Дата:
Hi,

I'm having something I feel is a bit of a limitation of the optimizer (or something I don't understand :) ).

Sorry, this is a rather long mail.

I have a workaround for the problem below, but I don't like cheating the optimizer for no good reason.

First a little bit of context, because the example below may feel dumb if you read it with no explanation:

I'm working on the bacula project, trying to further optimize data insertion into the database (we've been working on
thissubject for a few years now, as backup speed is a core functionality of any backup software). 

Bacula is a backup software: we are storing all of the backup's metadata into the database. It means all backed up
files…

The two very large tables are file and path:

  * file contains metadata about ALL files that have been backed up and are still available in the database. It
sometimescontain several billions records. 

  * path contains the full path of a file. It is stored out of file to save (lots) of space, as most path are identical
infiles. It still usually contains 20 to 50 million records, and can be as big as 20GB. 

So we have something like this:

  create table file (fileid bigserial, filename text, pathid bigint);
  alter table file add primary key (fileid);

  create table path (pathid bigserial, path text);
  alter table path add primary key (pathid);
  create unique index idx_path on path(path);

I removed some columns from the file table: we store stat(), a checksum, etc. They're not needed for this example.

In order to insert data efficiently, backups first store data into temp tables which look like this:

  create table batch (path text, filename text);

(once again I removed all metadata columns)

If you want to create a batch table with useful data to replicate what I'm going to show, you can try something like
this:

  find /home -printf '%h\t%f\n' |  psql -c "COPY batch FROM STDIN" bacula

We analyze the batch table: in the real code, it is a temp table, so it is compulsory:

  analyze batch;

Then we insert missing paths. This is one of the plans that fail, but we'll focus on the second one: as we are starting
froman empty path table, this query won't be realistic. 

  insert into path (path) select path from batch where not exists (select 1 from path where path.path=batch.path) group
bypath; 

We analyze:

  analyze path;


So now we insert into the file table.

insert into file (pathid,filename) select pathid, filename from batch join path using (path);

Here is the plan:

bacula=# explain select pathid, filename from batch join path using (path);
                              QUERY PLAN
-----------------------------------------------------------------------
 Hash Join  (cost=822.25..22129.85 rows=479020 width=26)
   Hash Cond: (batch.path = path.path)
   ->  Seq Scan on batch  (cost=0.00..11727.20 rows=479020 width=85)
   ->  Hash  (cost=539.89..539.89 rows=22589 width=81)
         ->  Seq Scan on path  (cost=0.00..539.89 rows=22589 width=81)
(5 lignes)


I have this plan almost all the time. Lets add a bunch of useless records in path to be more realistic and make things
worse:usually, bacula inserts data in 500,000 batches, and path is very large (millions of records), and is bigger than
work_mem(it would have to be 20GB in most extreme cases), so it may trash disks heavily (many servers can be inserting
atthe same time). 

Let's simulate this (remove indexes before, put them back afterwards :) )

  insert into path (path) select generate_series(1000000,50000000); #Create a realistic path table
  analyze path;

Here is the plan:

bacula=# explain (analyze,buffers,verbose) select pathid, filename from batch join path using (path);
                                                               QUERY PLAN
                

----------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1646119.66..1945643.59 rows=479020 width=26) (actual time=27275.240..36745.904 rows=479020 loops=1)
   Output: path.pathid, batch.filename
   Hash Cond: (batch.path = path.path)
   Buffers: shared hit=130760 read=179917 written=1823, temp read=224130 written=223876
   ->  Seq Scan on public.batch  (cost=0.00..11727.20 rows=479020 width=85) (actual time=0.259..176.031 rows=479020
loops=1)
         Output: batch.filename, batch.path
         Buffers: shared read=6937 written=1823
   ->  Hash  (cost=793966.96..793966.96 rows=49022696 width=16) (actual time=27274.725..27274.725 rows=49022590
loops=1)
         Output: path.pathid, path.path
         Buckets: 131072  Batches: 128  Memory Usage: 18329kB
         Buffers: shared hit=130760 read=172980, temp written=218711
         ->  Seq Scan on public.path  (cost=0.00..793966.96 rows=49022696 width=16) (actual time=0.231..9650.452
rows=49022590loops=1) 
               Output: path.pathid, path.path
               Buffers: shared hit=130760 read=172980
 Total runtime: 36781.919 ms
(15 lignes)


It seems like a logical choice (and it's working quite well, but we only have 1 of these running, and my path table is
stillvery small compared to real use cases) 

Let's force a nested loop (we don't do it that way usually, we lower the *_page_cost)

  set enable_hashjoin TO off; set enable_mergejoin TO off;

bacula=# explain (analyze,buffers,verbose) select pathid, filename from batch join path using (path);
                                                            QUERY PLAN
          

----------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.56..4001768.10 rows=479020 width=26) (actual time=2.303..15371.237 rows=479020 loops=1)
   Output: path.pathid, batch.filename
   Buffers: shared hit=2403958 read=7539
   ->  Seq Scan on public.batch  (cost=0.00..11727.20 rows=479020 width=85) (actual time=0.340..160.142 rows=479020
loops=1)
         Output: batch.path, batch.filename
         Buffers: shared read=6937
   ->  Index Scan using idx_path on public.path  (cost=0.56..8.32 rows=1 width=16) (actual time=0.030..0.031 rows=1
loops=479020)
         Output: path.pathid, path.path
         Index Cond: (path.path = batch.path)
         Buffers: shared hit=2403958 read=602
 Total runtime: 15439.043 ms


As you can see, more than twice as fast, and a very high hit ratio on the path table, even if we start from a cold
cache(I did, here, both PostgreSQL and OS). We have an excellent hit ratio because the batch table contains few
differentpath (several files in a directory), and is already quite clustered, as it comes from a backup, which is of
courseperformed directory by directory. 

What I think is the cause of the problem is that the planner doesn't take into account that we are going to fetch the
exactsame values all the time in the path table, so we'll have a very good hit ratio. Maybe the n_distinct from
batch.pathcould be used to refine the caching effect on the index scan ? The correlation is almost 0, but that's
normal,the directories aren't sorted alphabetically, but all records for a given path are grouped together. 

For now, we work around this by using very low values for seq_page_cost and random_page_cost for these 2 queries. I
justfeel that maybe PostgreSQL could do a bit better here, so I wanted to submit this use case for discussion. 

Regards

Marc


Re: query plan not optimal

От
Jeff Janes
Дата:
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.56..4001768.10 rows=479020 width=26) (actual time=2.303..15371.237 rows=479020 loops=1)
   Output: path.pathid, batch.filename
   Buffers: shared hit=2403958 read=7539
   ->  Seq Scan on public.batch  (cost=0.00..11727.20 rows=479020 width=85) (actual time=0.340..160.142 rows=479020 loops=1)
         Output: batch.path, batch.filename
         Buffers: shared read=6937
   ->  Index Scan using idx_path on public.path  (cost=0.56..8.32 rows=1 width=16) (actual time=0.030..0.031 rows=1 loops=479020)
         Output: path.pathid, path.path
         Index Cond: (path.path = batch.path)
         Buffers: shared hit=2403958 read=602
 Total runtime: 15439.043 ms


As you can see, more than twice as fast, and a very high hit ratio on the path table, even if we start from a cold cache (I did, here, both PostgreSQL and OS). We have an excellent hit ratio because the batch table contains few different path (several files in a directory), and is already quite clustered, as it comes from a backup, which is of course performed directory by directory.

What is your effective_cache_size set to?

Cheers,

Jeff

Re: query plan not optimal

От
Marc Cousin
Дата:

On 19/12/2013 19:33, Jeff Janes wrote:
>                                                                 QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------
>      Nested Loop  (cost=0.56..4001768.10 rows=479020 width=26) (actual
>     time=2.303..15371.237 rows=479020 loops=1)
>        Output: path.pathid, batch.filename
>        Buffers: shared hit=2403958 read=7539
>        ->  Seq Scan on public.batch  (cost=0.00..11727.20 rows=479020
>     width=85) (actual time=0.340..160.142 rows=479020 loops=1)
>              Output: batch.path, batch.filename
>              Buffers: shared read=6937
>        ->  Index Scan using idx_path on public.path  (cost=0.56..8.32
>     rows=1 width=16) (actual time=0.030..0.031 rows=1 loops=479020)
>              Output: path.pathid, path.path
>              Index Cond: (path.path = batch.path)
>              Buffers: shared hit=2403958 read=602
>      Total runtime: 15439.043 ms
>
>
>     As you can see, more than twice as fast, and a very high hit ratio
>     on the path table, even if we start from a cold cache (I did, here,
>     both PostgreSQL and OS). We have an excellent hit ratio because the
>     batch table contains few different path (several files in a
>     directory), and is already quite clustered, as it comes from a
>     backup, which is of course performed directory by directory.
>
>
> What is your effective_cache_size set to?
>
> Cheers,
>
> Jeff
Yeah, I had forgotten to set it up correctly on this test environment
(its value is correctly set in production environments). Putting it to a
few gigabytes here gives me this cost:

bacula=# explain select pathid, filename from batch join path using (path);
                                 QUERY PLAN
----------------------------------------------------------------------------
 Nested Loop  (cost=0.56..2083904.10 rows=479020 width=26)
   ->  Seq Scan on batch  (cost=0.00..11727.20 rows=479020 width=85)
   ->  Index Scan using idx_path on path  (cost=0.56..4.32 rows=1 width=16)
         Index Cond: (path = batch.path)
(4 lignes)

It still chooses the hash join though, but by a smaller margin.

And it still only will access a very small part of path (always the same
5000 records) during the query, which isn't accounted for in the cost if
I understand correctly ?


Re: query plan not optimal

От
Kevin Grittner
Дата:
Marc Cousin <cousinmarc@gmail.com> wrote:

> Then we insert missing paths. This is one of the plans that fail

> insert into path (path)
>   select path from batch
>     where not exists
>           (select 1 from path where path.path=batch.path)
>     group by path;

I know you said you wanted to focus on a different query, but this
one can easily be optimized.  Right now it is checking for an
existing row in path for each row in batch; and it only needs to
check once for each path.  One way to write it would be:

insert into path (path)
  select path from (select distinct path from batch) b
    where not exists
          (select 1 from path p where p.path = b.path);

> So now we insert into the file table.
>
> insert into file (pathid,filename)
>   select pathid, filename from batch join path using (path);

> What I think is the cause of the problem is that the planner
> doesn't take into account that we are going to fetch the exact
> same values all the time in the path table, so we'll have a very
> good hit ratio.

It kinda takes that into account for the index part of things via
the effective_cache_size setting.  That should normally be set to
50% to 75% of machine RAM.

> Maybe the n_distinct from batch.path could be used to refine the
> caching effect on the index scan ?

Interesting idea.

> For now, we work around this by using very low values for
> seq_page_cost and random_page_cost for these 2 queries.
If you are not already doing so, you might want to try setting
cpu_tuple_cost to something in the 0.03 to 0.05 range.  I have
found that the default is too low compared to other cpu cost
factors, and raising it makes the exact settings for page costs
less sensitive -- that is, you get good plans over a wider range of
page cost settings.  I have sometimes been unable to get a good
plan for a query without boosting this, regardless of what I do
with other settings.

Running with a development build on my 16GB development PC, I got
your fast plan with your "big data" test case by making only this
one adjustment from the postgresql.conf defaults:

set effective_cache_size = '2GB';

I also got the fast plan if I left effective_cache_size at the
default and only changed:

set cpu_tuple_cost = 0.03;

I know that there have been adjustments to cost calculations for
use of large indexes in both minor and major releases.  If a little
sensible tuning of cost factors to better match reality doesn't do
it for you, you might want to consider an upgrade.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: query plan not optimal

От
Marc Cousin
Дата:
On 19/12/2013 21:36, Kevin Grittner wrote:
> Marc Cousin <cousinmarc@gmail.com> wrote:
>
>> Then we insert missing paths. This is one of the plans that fail
>> insert into path (path)
>>     select path from batch
>>       where not exists
>>             (select 1 from path where path.path=batch.path)
>>       group by path;
> I know you said you wanted to focus on a different query, but this
> one can easily be optimized.  Right now it is checking for an
> existing row in path for each row in batch; and it only needs to
> check once for each path.  One way to write it would be:
>
> insert into path (path)
>    select path from (select distinct path from batch) b
>      where not exists
>            (select 1 from path p where p.path = b.path);
Yeah, I know, that's why I said I didn't want to focus on this one… we
already do this optimization :)

Thanks anyway :)


query plan not optimal

От
Jeff Janes
Дата:
On Thursday, December 19, 2013, Marc Cousin wrote:


Yeah, I had forgotten to set it up correctly on this test environment
(its value is correctly set in production environments). Putting it to a
few gigabytes here gives me this cost:

bacula=# explain select pathid, filename from batch join path using (path);
                                 QUERY PLAN
----------------------------------------------------------------------------
 Nested Loop  (cost=0.56..2083904.10 rows=479020 width=26)
   ->  Seq Scan on batch  (cost=0.00..11727.20 rows=479020 width=85)
   ->  Index Scan using idx_path on path  (cost=0.56..4.32 rows=1 width=16)
         Index Cond: (path = batch.path)
(4 lignes)

It still chooses the hash join though, but by a smaller margin.

This is still a tangent from your original point, but if I create index on path (path, pathid), then I can get an index only scan.  This actually is not much faster when everything is already cached, but the planner thinks it will be about 2x faster because it assumes the vm block accesses are free.  So this might be enough to tip it over for you.


And it still only will access a very small part of path (always the same
5000 records) during the query, which isn't accounted for in the cost if
I understand correctly ?

I think you are correct, that it doesn't take account of ndistinct being 10 to 100 fold less than ntuples on the outer loop, which theoretically could propagate down to the table size used in connection with effecitve_cache_size.

It seems to me the cost of the hash join is being greatly underestimated, which I think is more important than the nested loop being overestimated.  (And in my hands, the merge join is actually the winner both in the planner and in reality, but I suspect this is because all of your fake paths are lexically greater than all of the real paths)

Also, you talked earlier about cheating the planner by lowering random_page_cost.  But why is that cheating?  If caching means the cost is truly lower... 

Cheers,

Jeff

Re: query plan not optimal

От
Marc Cousin
Дата:
On 29/12/2013 19:51, Jeff Janes wrote:
> On Thursday, December 19, 2013, Marc Cousin wrote:
>
>
>
>     Yeah, I had forgotten to set it up correctly on this test environment
>     (its value is correctly set in production environments). Putting it to a
>     few gigabytes here gives me this cost:
>
>     bacula=# explain select pathid, filename from batch join path using
>     (path);
>                                      QUERY PLAN
>     ----------------------------------------------------------------------------
>      Nested Loop  (cost=0.56..2083904.10 rows=479020 width=26)
>        ->  Seq Scan on batch  (cost=0.00..11727.20 rows=479020 width=85)
>        ->  Index Scan using idx_path on path  (cost=0.56..4.32 rows=1
>     width=16)
>              Index Cond: (path = batch.path)
>     (4 lignes)
>
>     It still chooses the hash join though, but by a smaller margin.
>
>
> This is still a tangent from your original point, but if I create index
> on path (path, pathid), then I can get an index only scan.  This
> actually is not much faster when everything is already cached, but the
> planner thinks it will be about 2x faster because it assumes the vm
> block accesses are free.  So this might be enough to tip it over for you.

Yeah, still a tangent :)

Many bacula users don't have index only scans (the one I was having
trouble with for example), as they are still using an older than 9.2
PostgreSQL version.

>
>
>     And it still only will access a very small part of path (always the same
>     5000 records) during the query, which isn't accounted for in the cost if
>     I understand correctly ?
>
>
> I think you are correct, that it doesn't take account of ndistinct being
> 10 to 100 fold less than ntuples on the outer loop, which theoretically
> could propagate down to the table size used in connection with
> effecitve_cache_size.
>
> It seems to me the cost of the hash join is being greatly
> underestimated, which I think is more important than the nested loop
> being overestimated.  (And in my hands, the merge join is actually the
> winner both in the planner and in reality, but I suspect this is because
> all of your fake paths are lexically greater than all of the real paths)
Yes, probably.

>
> Also, you talked earlier about cheating the planner by lowering
> random_page_cost.  But why is that cheating?  If caching means the cost
> is truly lower...
It feels like cheating, as I feel I'm compensating for what looks like a
"bad" estimate of the cost: the nested loop is very fast, even if
nothing is cached at the beginning. We could put the *_page_cost
hardcoded to low values in bacula's code for this query, but it is not
that good to put it in postgresql.conf as we currently do, as some other
queries are suffering from those very low costs. Anyway, it would be
even better if it wasn't needed at all, hence this post :)