Обсуждение: query plan not optimal
Hi, I'm having something I feel is a bit of a limitation of the optimizer (or something I don't understand :) ). Sorry, this is a rather long mail. I have a workaround for the problem below, but I don't like cheating the optimizer for no good reason. First a little bit of context, because the example below may feel dumb if you read it with no explanation: I'm working on the bacula project, trying to further optimize data insertion into the database (we've been working on thissubject for a few years now, as backup speed is a core functionality of any backup software). Bacula is a backup software: we are storing all of the backup's metadata into the database. It means all backed up files… The two very large tables are file and path: * file contains metadata about ALL files that have been backed up and are still available in the database. It sometimescontain several billions records. * path contains the full path of a file. It is stored out of file to save (lots) of space, as most path are identical infiles. It still usually contains 20 to 50 million records, and can be as big as 20GB. So we have something like this: create table file (fileid bigserial, filename text, pathid bigint); alter table file add primary key (fileid); create table path (pathid bigserial, path text); alter table path add primary key (pathid); create unique index idx_path on path(path); I removed some columns from the file table: we store stat(), a checksum, etc. They're not needed for this example. In order to insert data efficiently, backups first store data into temp tables which look like this: create table batch (path text, filename text); (once again I removed all metadata columns) If you want to create a batch table with useful data to replicate what I'm going to show, you can try something like this: find /home -printf '%h\t%f\n' | psql -c "COPY batch FROM STDIN" bacula We analyze the batch table: in the real code, it is a temp table, so it is compulsory: analyze batch; Then we insert missing paths. This is one of the plans that fail, but we'll focus on the second one: as we are starting froman empty path table, this query won't be realistic. insert into path (path) select path from batch where not exists (select 1 from path where path.path=batch.path) group bypath; We analyze: analyze path; So now we insert into the file table. insert into file (pathid,filename) select pathid, filename from batch join path using (path); Here is the plan: bacula=# explain select pathid, filename from batch join path using (path); QUERY PLAN ----------------------------------------------------------------------- Hash Join (cost=822.25..22129.85 rows=479020 width=26) Hash Cond: (batch.path = path.path) -> Seq Scan on batch (cost=0.00..11727.20 rows=479020 width=85) -> Hash (cost=539.89..539.89 rows=22589 width=81) -> Seq Scan on path (cost=0.00..539.89 rows=22589 width=81) (5 lignes) I have this plan almost all the time. Lets add a bunch of useless records in path to be more realistic and make things worse:usually, bacula inserts data in 500,000 batches, and path is very large (millions of records), and is bigger than work_mem(it would have to be 20GB in most extreme cases), so it may trash disks heavily (many servers can be inserting atthe same time). Let's simulate this (remove indexes before, put them back afterwards :) ) insert into path (path) select generate_series(1000000,50000000); #Create a realistic path table analyze path; Here is the plan: bacula=# explain (analyze,buffers,verbose) select pathid, filename from batch join path using (path); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=1646119.66..1945643.59 rows=479020 width=26) (actual time=27275.240..36745.904 rows=479020 loops=1) Output: path.pathid, batch.filename Hash Cond: (batch.path = path.path) Buffers: shared hit=130760 read=179917 written=1823, temp read=224130 written=223876 -> Seq Scan on public.batch (cost=0.00..11727.20 rows=479020 width=85) (actual time=0.259..176.031 rows=479020 loops=1) Output: batch.filename, batch.path Buffers: shared read=6937 written=1823 -> Hash (cost=793966.96..793966.96 rows=49022696 width=16) (actual time=27274.725..27274.725 rows=49022590 loops=1) Output: path.pathid, path.path Buckets: 131072 Batches: 128 Memory Usage: 18329kB Buffers: shared hit=130760 read=172980, temp written=218711 -> Seq Scan on public.path (cost=0.00..793966.96 rows=49022696 width=16) (actual time=0.231..9650.452 rows=49022590loops=1) Output: path.pathid, path.path Buffers: shared hit=130760 read=172980 Total runtime: 36781.919 ms (15 lignes) It seems like a logical choice (and it's working quite well, but we only have 1 of these running, and my path table is stillvery small compared to real use cases) Let's force a nested loop (we don't do it that way usually, we lower the *_page_cost) set enable_hashjoin TO off; set enable_mergejoin TO off; bacula=# explain (analyze,buffers,verbose) select pathid, filename from batch join path using (path); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.56..4001768.10 rows=479020 width=26) (actual time=2.303..15371.237 rows=479020 loops=1) Output: path.pathid, batch.filename Buffers: shared hit=2403958 read=7539 -> Seq Scan on public.batch (cost=0.00..11727.20 rows=479020 width=85) (actual time=0.340..160.142 rows=479020 loops=1) Output: batch.path, batch.filename Buffers: shared read=6937 -> Index Scan using idx_path on public.path (cost=0.56..8.32 rows=1 width=16) (actual time=0.030..0.031 rows=1 loops=479020) Output: path.pathid, path.path Index Cond: (path.path = batch.path) Buffers: shared hit=2403958 read=602 Total runtime: 15439.043 ms As you can see, more than twice as fast, and a very high hit ratio on the path table, even if we start from a cold cache(I did, here, both PostgreSQL and OS). We have an excellent hit ratio because the batch table contains few differentpath (several files in a directory), and is already quite clustered, as it comes from a backup, which is of courseperformed directory by directory. What I think is the cause of the problem is that the planner doesn't take into account that we are going to fetch the exactsame values all the time in the path table, so we'll have a very good hit ratio. Maybe the n_distinct from batch.pathcould be used to refine the caching effect on the index scan ? The correlation is almost 0, but that's normal,the directories aren't sorted alphabetically, but all records for a given path are grouped together. For now, we work around this by using very low values for seq_page_cost and random_page_cost for these 2 queries. I justfeel that maybe PostgreSQL could do a bit better here, so I wanted to submit this use case for discussion. Regards Marc
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.56..4001768.10 rows=479020 width=26) (actual time=2.303..15371.237 rows=479020 loops=1)
Output: path.pathid, batch.filename
Buffers: shared hit=2403958 read=7539
-> Seq Scan on public.batch (cost=0.00..11727.20 rows=479020 width=85) (actual time=0.340..160.142 rows=479020 loops=1)
Output: batch.path, batch.filename
Buffers: shared read=6937
-> Index Scan using idx_path on public.path (cost=0.56..8.32 rows=1 width=16) (actual time=0.030..0.031 rows=1 loops=479020)
Output: path.pathid, path.path
Index Cond: (path.path = batch.path)
Buffers: shared hit=2403958 read=602
Total runtime: 15439.043 ms
As you can see, more than twice as fast, and a very high hit ratio on the path table, even if we start from a cold cache (I did, here, both PostgreSQL and OS). We have an excellent hit ratio because the batch table contains few different path (several files in a directory), and is already quite clustered, as it comes from a backup, which is of course performed directory by directory.
What is your effective_cache_size set to?
Cheers,
Jeff
On 19/12/2013 19:33, Jeff Janes wrote: > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------- > Nested Loop (cost=0.56..4001768.10 rows=479020 width=26) (actual > time=2.303..15371.237 rows=479020 loops=1) > Output: path.pathid, batch.filename > Buffers: shared hit=2403958 read=7539 > -> Seq Scan on public.batch (cost=0.00..11727.20 rows=479020 > width=85) (actual time=0.340..160.142 rows=479020 loops=1) > Output: batch.path, batch.filename > Buffers: shared read=6937 > -> Index Scan using idx_path on public.path (cost=0.56..8.32 > rows=1 width=16) (actual time=0.030..0.031 rows=1 loops=479020) > Output: path.pathid, path.path > Index Cond: (path.path = batch.path) > Buffers: shared hit=2403958 read=602 > Total runtime: 15439.043 ms > > > As you can see, more than twice as fast, and a very high hit ratio > on the path table, even if we start from a cold cache (I did, here, > both PostgreSQL and OS). We have an excellent hit ratio because the > batch table contains few different path (several files in a > directory), and is already quite clustered, as it comes from a > backup, which is of course performed directory by directory. > > > What is your effective_cache_size set to? > > Cheers, > > Jeff Yeah, I had forgotten to set it up correctly on this test environment (its value is correctly set in production environments). Putting it to a few gigabytes here gives me this cost: bacula=# explain select pathid, filename from batch join path using (path); QUERY PLAN ---------------------------------------------------------------------------- Nested Loop (cost=0.56..2083904.10 rows=479020 width=26) -> Seq Scan on batch (cost=0.00..11727.20 rows=479020 width=85) -> Index Scan using idx_path on path (cost=0.56..4.32 rows=1 width=16) Index Cond: (path = batch.path) (4 lignes) It still chooses the hash join though, but by a smaller margin. And it still only will access a very small part of path (always the same 5000 records) during the query, which isn't accounted for in the cost if I understand correctly ?
Marc Cousin <cousinmarc@gmail.com> wrote: > Then we insert missing paths. This is one of the plans that fail > insert into path (path) > select path from batch > where not exists > (select 1 from path where path.path=batch.path) > group by path; I know you said you wanted to focus on a different query, but this one can easily be optimized. Right now it is checking for an existing row in path for each row in batch; and it only needs to check once for each path. One way to write it would be: insert into path (path) select path from (select distinct path from batch) b where not exists (select 1 from path p where p.path = b.path); > So now we insert into the file table. > > insert into file (pathid,filename) > select pathid, filename from batch join path using (path); > What I think is the cause of the problem is that the planner > doesn't take into account that we are going to fetch the exact > same values all the time in the path table, so we'll have a very > good hit ratio. It kinda takes that into account for the index part of things via the effective_cache_size setting. That should normally be set to 50% to 75% of machine RAM. > Maybe the n_distinct from batch.path could be used to refine the > caching effect on the index scan ? Interesting idea. > For now, we work around this by using very low values for > seq_page_cost and random_page_cost for these 2 queries. If you are not already doing so, you might want to try setting cpu_tuple_cost to something in the 0.03 to 0.05 range. I have found that the default is too low compared to other cpu cost factors, and raising it makes the exact settings for page costs less sensitive -- that is, you get good plans over a wider range of page cost settings. I have sometimes been unable to get a good plan for a query without boosting this, regardless of what I do with other settings. Running with a development build on my 16GB development PC, I got your fast plan with your "big data" test case by making only this one adjustment from the postgresql.conf defaults: set effective_cache_size = '2GB'; I also got the fast plan if I left effective_cache_size at the default and only changed: set cpu_tuple_cost = 0.03; I know that there have been adjustments to cost calculations for use of large indexes in both minor and major releases. If a little sensible tuning of cost factors to better match reality doesn't do it for you, you might want to consider an upgrade. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 19/12/2013 21:36, Kevin Grittner wrote: > Marc Cousin <cousinmarc@gmail.com> wrote: > >> Then we insert missing paths. This is one of the plans that fail >> insert into path (path) >> select path from batch >> where not exists >> (select 1 from path where path.path=batch.path) >> group by path; > I know you said you wanted to focus on a different query, but this > one can easily be optimized. Right now it is checking for an > existing row in path for each row in batch; and it only needs to > check once for each path. One way to write it would be: > > insert into path (path) > select path from (select distinct path from batch) b > where not exists > (select 1 from path p where p.path = b.path); Yeah, I know, that's why I said I didn't want to focus on this one… we already do this optimization :) Thanks anyway :)
On Thursday, December 19, 2013, Marc Cousin wrote:
Yeah, I had forgotten to set it up correctly on this test environment
(its value is correctly set in production environments). Putting it to a
few gigabytes here gives me this cost:
bacula=# explain select pathid, filename from batch join path using (path);
QUERY PLAN
----------------------------------------------------------------------------
Nested Loop (cost=0.56..2083904.10 rows=479020 width=26)
-> Seq Scan on batch (cost=0.00..11727.20 rows=479020 width=85)
-> Index Scan using idx_path on path (cost=0.56..4.32 rows=1 width=16)
Index Cond: (path = batch.path)
(4 lignes)
It still chooses the hash join though, but by a smaller margin.
This is still a tangent from your original point, but if I create index on path (path, pathid), then I can get an index only scan. This actually is not much faster when everything is already cached, but the planner thinks it will be about 2x faster because it assumes the vm block accesses are free. So this might be enough to tip it over for you.
And it still only will access a very small part of path (always the same
5000 records) during the query, which isn't accounted for in the cost if
I understand correctly ?
I think you are correct, that it doesn't take account of ndistinct being 10 to 100 fold less than ntuples on the outer loop, which theoretically could propagate down to the table size used in connection with effecitve_cache_size.
It seems to me the cost of the hash join is being greatly underestimated, which I think is more important than the nested loop being overestimated. (And in my hands, the merge join is actually the winner both in the planner and in reality, but I suspect this is because all of your fake paths are lexically greater than all of the real paths)
Also, you talked earlier about cheating the planner by lowering random_page_cost. But why is that cheating? If caching means the cost is truly lower...
Cheers,
Jeff
On 29/12/2013 19:51, Jeff Janes wrote: > On Thursday, December 19, 2013, Marc Cousin wrote: > > > > Yeah, I had forgotten to set it up correctly on this test environment > (its value is correctly set in production environments). Putting it to a > few gigabytes here gives me this cost: > > bacula=# explain select pathid, filename from batch join path using > (path); > QUERY PLAN > ---------------------------------------------------------------------------- > Nested Loop (cost=0.56..2083904.10 rows=479020 width=26) > -> Seq Scan on batch (cost=0.00..11727.20 rows=479020 width=85) > -> Index Scan using idx_path on path (cost=0.56..4.32 rows=1 > width=16) > Index Cond: (path = batch.path) > (4 lignes) > > It still chooses the hash join though, but by a smaller margin. > > > This is still a tangent from your original point, but if I create index > on path (path, pathid), then I can get an index only scan. This > actually is not much faster when everything is already cached, but the > planner thinks it will be about 2x faster because it assumes the vm > block accesses are free. So this might be enough to tip it over for you. Yeah, still a tangent :) Many bacula users don't have index only scans (the one I was having trouble with for example), as they are still using an older than 9.2 PostgreSQL version. > > > And it still only will access a very small part of path (always the same > 5000 records) during the query, which isn't accounted for in the cost if > I understand correctly ? > > > I think you are correct, that it doesn't take account of ndistinct being > 10 to 100 fold less than ntuples on the outer loop, which theoretically > could propagate down to the table size used in connection with > effecitve_cache_size. > > It seems to me the cost of the hash join is being greatly > underestimated, which I think is more important than the nested loop > being overestimated. (And in my hands, the merge join is actually the > winner both in the planner and in reality, but I suspect this is because > all of your fake paths are lexically greater than all of the real paths) Yes, probably. > > Also, you talked earlier about cheating the planner by lowering > random_page_cost. But why is that cheating? If caching means the cost > is truly lower... It feels like cheating, as I feel I'm compensating for what looks like a "bad" estimate of the cost: the nested loop is very fast, even if nothing is cached at the beginning. We could put the *_page_cost hardcoded to low values in bacula's code for this query, but it is not that good to put it in postgresql.conf as we currently do, as some other queries are suffering from those very low costs. Anyway, it would be even better if it wasn't needed at all, hence this post :)