Обсуждение: Idea how to get rid of Bitmap Heap Scan
Hello! There are many questions on internet about whether it is possible to optimize "Bitmap Heap Scan" somehow without answer, so seems like problem is rather important. The query I need to optimize is: EXPLAIN SELECT date_trunc('day', d.created_at) AS day, COUNT(*) AS download FROM downloads d WHERE d.file_id in (select id from files where owner_id = 443) AND d.download_status != 0 AND d.created_at >= '2009-12-05' AND d.created_at < '2009-12-16' GROUP BY 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=15809.49..17126.20 rows=87781 width=8) -> Hash Semi Join (cost=5809.51..15368.11 rows=88276 width=8) Hash Cond: (d.file_id = files.id) -> Index Scan using idx_downloads_created_at on downloads d (cost=0.00..7682.73 rows=88276 width=16) Index Cond: ((created_at >= '2009-12-05 00:00:00'::timestamp without time zone) AND (created_at < '2009-12-16 00:00:00'::timestamp without time zone)) -> Hash (cost=5741.51..5741.51 rows=5440 width=8) -> Bitmap Heap Scan on files (cost=106.42..5741.51 rows=5440 width=8) Recheck Cond: (owner_id = 443) -> Bitmap Index Scan on idx_files_owner (cost=0.00..105.06 rows=5440 width=0) Index Cond: (owner_id = 443) The problem here is that we are forced to fetch "files" in Bitmap Heap Scan. But actually there is no need for the whole "files" record. The necessary data is only "files" ids. The idea is to avoid fetching data from "files" table, and get the ids from index! (probably it is a little bit tricky, but it is a performance area...) I created an index with following command: create index idx_files_owner_id ON files (owner_id, id); and even tried to remove old index to enforce postgresql to use newly created index. But postgresql still do Bitmap Heap Scan. (The other idea is to use raw_id as a primary key of "files" table to don't extend index. But I don't know whether it is possible at all or this idea have some drawbacks) I think it worth to learn postgreql to do this trick especially taking into account there are many questions about whether it is possible to optimize such a queries. If there is an known solution to this problem please provide a link to it. With best regards, Michael Mikhulya.
On Fri, 18 Dec 2009, Michael N. Mikhulya wrote: > The problem here is that we are forced to fetch "files" in Bitmap Heap Scan. > But actually there is no need for the whole "files" record. The > necessary data is only "files" ids. > > The idea is to avoid fetching data from "files" table, and get the ids > from index! (probably it is a little bit tricky, but it is a > performance area...) Unfortunately, the index does not contain enough information to accomplish this. This is due to Postgres' advanced concurrency control system. Postgres needs to fetch the actual rows from the files table in order to check whether that row is visible in the current transaction, and a Bitmap Index Scan is the fastest way to do this. You can speed this up in Postgres 8.4 by having a RAID array and setting the effective_concurrency configuration to the number of spindles in the RAID array, or by having gobs of RAM and keeping everything in cache. Matthew -- A good programmer is one who looks both ways before crossing a one-way street. Considering the quality and quantity of one-way streets in Cambridge, it should be no surprise that there are so many good programmers there.
Thank you very much. I catch the point why it is done so. But I'm curious whether it is still possible to don't fetch data from files table just because inappropriate ids (e.g. removed ones) will not produce any wrong effect just because them indirectly "checked" on downloads table? Here I mean that if we get id (from index) for file which is actually removed, then we will not find anything in downloads table. Probably my knowledge about MVCC is too little to see whole picture, so if it is not hard to you please point the "failure" scenario (when we get wrong result) or locking issue, ... Michael Mikhulya > Unfortunately, the index does not contain enough information to accomplish > this. This is due to Postgres' advanced concurrency control system. Postgres > needs to fetch the actual rows from the files table in order to check > whether that row is visible in the current transaction, and a Bitmap Index > Scan is the fastest way to do this. > > You can speed this up in Postgres 8.4 by having a RAID array and setting the > effective_concurrency configuration to the number of spindles in the RAID > array, or by having gobs of RAM and keeping everything in cache. > > Matthew > > -- > A good programmer is one who looks both ways before crossing a one-way > street. > Considering the quality and quantity of one-way streets in Cambridge, it > should be no surprise that there are so many good programmers there. >
On Fri, Dec 18, 2009 at 4:18 PM, Michael N. Mikhulya <m.mikhulya@gmail.com> wrote: > Thank you very much. I catch the point why it is done so. > > But I'm curious whether it is still possible to don't fetch data from > files table just because inappropriate ids (e.g. removed ones) will > not produce any wrong effect just because them indirectly "checked" on > downloads table? > Here I mean that if we get id (from index) for file which is actually > removed, then we will not find anything in downloads table. > Probably my knowledge about MVCC is too little to see whole picture, > so if it is not hard to you please point the "failure" scenario (when > we get wrong result) or locking issue, ... Yup this ought to be possible and fruitful, I believe Heikki already produced a partial patch to this end. If you're interested in working on it you could skim back in the logs and start with that. I don't recall any special keywords to search on but it might be in one of the threads for the "visibility map" or it might be under "index-only scans". A word of warning, in my experience the hardest part for changes like this isn't the executor changes (which in this case wouldn't be far from easy) but the planner changes to detect when this new plan would be better. -- greg
On Fri, Dec 18, 2009 at 12:29 PM, Greg Stark <gsstark@mit.edu> wrote: > A word of warning, in my experience the hardest part for changes like > this isn't the executor changes (which in this case wouldn't be far > from easy) but the planner changes to detect when this new plan would > be better. There's also the problem of making the visibility map crash-safe. I think I heard you might have some ideas on that one - has it been discussed on -hackers? ...Robert
On Sun, Dec 20, 2009 at 2:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 18, 2009 at 12:29 PM, Greg Stark <gsstark@mit.edu> wrote: >> A word of warning, in my experience the hardest part for changes like >> this isn't the executor changes (which in this case wouldn't be far >> from easy) but the planner changes to detect when this new plan would >> be better. > > There's also the problem of making the visibility map crash-safe. I > think I heard you might have some ideas on that one - has it been > discussed on -hackers? Not sure what ideas you mean. In the original poster's plan that isn't an issue. We could scan the index, perform the joins and restriction clauses, and only check the visibility on the resulting tuples which slip through them all. That would be possible even without crash-safe visibility bits. -- greg
Greg Stark <gsstark@mit.edu> writes: > In the original poster's plan that isn't an issue. We could scan the > index, perform the joins and restriction clauses, and only check the > visibility on the resulting tuples which slip through them all. That > would be possible even without crash-safe visibility bits. Yeah, this was floated years ago as being a potentially interesting approach when all the join-condition fields are indexed. You end up never having to fetch rows that don't pass the join. It certainly seems reasonably straightforward on the executor side. As Greg said, the hard part is planning it sanely. regards, tom lane
On Sun, Dec 20, 2009 at 11:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> In the original poster's plan that isn't an issue. We could scan the >> index, perform the joins and restriction clauses, and only check the >> visibility on the resulting tuples which slip through them all. That >> would be possible even without crash-safe visibility bits. > > Yeah, this was floated years ago as being a potentially interesting > approach when all the join-condition fields are indexed. You end up > never having to fetch rows that don't pass the join. > > It certainly seems reasonably straightforward on the executor side. > As Greg said, the hard part is planning it sanely. Yeah, but that seems REALLY hard. First, there's the difficulty of actually generating all the paths and costing them appropriately. A plan to perform an index scan but defer the heap fetches until later has a hidden cost associated with it: the heap fetches will cost something, but we don't know how much until we get the row estimate for the node where we choose to implement them. Without knowing that cost, it's hard to be confident in discarding other plans that are apparently more expensive. That's probably solvable by adopting a more sophisticated method for comparing costs, but that gets you to the second problem, which is doing all of this with reasonable performance. You're going to have a lot more paths than we do now, and there will be many queries for which there are only trivial cost differences between them (like any query where most or all of the joins have a selectivity of exactly 1.0, which is a very common case for me). ...Robert