Обсуждение: Idea how to get rid of Bitmap Heap Scan

Поиск
Список
Период
Сортировка

Idea how to get rid of Bitmap Heap Scan

От
"Michael N. Mikhulya"
Дата:
Hello!

There are many questions on internet about whether it is possible to
optimize "Bitmap Heap Scan" somehow without answer, so seems like
problem is rather important.

The query I need to optimize is:

EXPLAIN SELECT date_trunc('day', d.created_at) AS day, COUNT(*) AS
download FROM downloads d WHERE d.file_id in (select id from files
where owner_id = 443) AND d.download_status != 0 AND d.created_at >=
'2009-12-05' AND d.created_at < '2009-12-16' GROUP BY 1;

       QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=15809.49..17126.20 rows=87781 width=8)
   ->  Hash Semi Join  (cost=5809.51..15368.11 rows=88276 width=8)
         Hash Cond: (d.file_id = files.id)
         ->  Index Scan using idx_downloads_created_at on downloads d
(cost=0.00..7682.73 rows=88276 width=16)
               Index Cond: ((created_at >= '2009-12-05
00:00:00'::timestamp without time zone) AND (created_at < '2009-12-16
00:00:00'::timestamp without time zone))
         ->  Hash  (cost=5741.51..5741.51 rows=5440 width=8)
               ->  Bitmap Heap Scan on files  (cost=106.42..5741.51
rows=5440 width=8)
                     Recheck Cond: (owner_id = 443)
                     ->  Bitmap Index Scan on idx_files_owner
(cost=0.00..105.06 rows=5440 width=0)
                           Index Cond: (owner_id = 443)

The problem here is that we are forced to fetch "files" in Bitmap Heap Scan.
But actually there is no need for the whole "files" record. The
necessary data is only "files" ids.

The idea is to avoid fetching data from "files" table, and get the ids
from index! (probably it is a little bit tricky, but it is a
performance area...)

I created an index with following command:
create index idx_files_owner_id ON files (owner_id, id);
and even tried to remove old index to enforce postgresql to use newly
created index.
But postgresql still do Bitmap Heap Scan.

(The other idea is to use raw_id as a primary key of "files" table to
don't extend index. But I don't know whether it is possible at all or
this idea have some drawbacks)

I think it worth to learn postgreql to do this trick especially taking
into account there are many questions about whether it is possible to
optimize such a queries.

If there is an known solution to this problem please provide a link to it.

With best regards,
Michael Mikhulya.

Re: Idea how to get rid of Bitmap Heap Scan

От
Matthew Wakeling
Дата:
On Fri, 18 Dec 2009, Michael N. Mikhulya wrote:
> The problem here is that we are forced to fetch "files" in Bitmap Heap Scan.
> But actually there is no need for the whole "files" record. The
> necessary data is only "files" ids.
>
> The idea is to avoid fetching data from "files" table, and get the ids
> from index! (probably it is a little bit tricky, but it is a
> performance area...)

Unfortunately, the index does not contain enough information to accomplish
this. This is due to Postgres' advanced concurrency control system.
Postgres needs to fetch the actual rows from the files table in order to
check whether that row is visible in the current transaction, and a Bitmap
Index Scan is the fastest way to do this.

You can speed this up in Postgres 8.4 by having a RAID array and setting
the effective_concurrency configuration to the number of spindles in the
RAID array, or by having gobs of RAM and keeping everything in cache.

Matthew

--
 A good programmer is one who looks both ways before crossing a one-way street.
 Considering the quality and quantity of one-way streets in Cambridge, it
 should be no surprise that there are so many good programmers there.

Re: Idea how to get rid of Bitmap Heap Scan

От
"Michael N. Mikhulya"
Дата:
Thank you very much. I catch the point why it is done so.

But I'm curious whether it is still possible to don't fetch data from
files table just because inappropriate ids (e.g. removed ones) will
not produce any wrong effect just because them indirectly "checked" on
downloads table?
Here I mean that if we get id (from index) for file which is actually
removed, then we will not find anything in downloads table.
Probably my knowledge about MVCC is too little to see whole picture,
so if it is not hard to you please point the "failure" scenario (when
we get wrong result) or locking issue, ...

Michael Mikhulya

> Unfortunately, the index does not contain enough information to accomplish
> this. This is due to Postgres' advanced concurrency control system. Postgres
> needs to fetch the actual rows from the files table in order to check
> whether that row is visible in the current transaction, and a Bitmap Index
> Scan is the fastest way to do this.
>
> You can speed this up in Postgres 8.4 by having a RAID array and setting the
> effective_concurrency configuration to the number of spindles in the RAID
> array, or by having gobs of RAM and keeping everything in cache.
>
> Matthew
>
> --
> A good programmer is one who looks both ways before crossing a one-way
> street.
> Considering the quality and quantity of one-way streets in Cambridge, it
> should be no surprise that there are so many good programmers there.
>

Re: Idea how to get rid of Bitmap Heap Scan

От
Greg Stark
Дата:
On Fri, Dec 18, 2009 at 4:18 PM, Michael N. Mikhulya
<m.mikhulya@gmail.com> wrote:
> Thank you very much. I catch the point why it is done so.
>
> But I'm curious whether it is still possible to don't fetch data from
> files table just because inappropriate ids (e.g. removed ones) will
> not produce any wrong effect just because them indirectly "checked" on
> downloads table?
> Here I mean that if we get id (from index) for file which is actually
> removed, then we will not find anything in downloads table.
> Probably my knowledge about MVCC is too little to see whole picture,
> so if it is not hard to you please point the "failure" scenario (when
> we get wrong result) or locking issue, ...


Yup this ought to be possible and fruitful, I believe Heikki already
produced a partial patch to this end. If you're interested in working
on it you could skim back in the logs and start with that. I don't
recall any special keywords to search on but it might be in one of the
threads for the "visibility map" or it might be under "index-only
scans".

A word of warning, in my experience the hardest part for changes like
this isn't the executor changes (which in this case wouldn't be far
from easy) but the planner changes to detect when this new plan would
be better.

--
greg

Re: Idea how to get rid of Bitmap Heap Scan

От
Robert Haas
Дата:
On Fri, Dec 18, 2009 at 12:29 PM, Greg Stark <gsstark@mit.edu> wrote:
> A word of warning, in my experience the hardest part for changes like
> this isn't the executor changes (which in this case wouldn't be far
> from easy) but the planner changes to detect when this new plan would
> be better.

There's also the problem of making the visibility map crash-safe.  I
think I heard you might have some ideas on that one - has it been
discussed on -hackers?

...Robert

Re: Idea how to get rid of Bitmap Heap Scan

От
Greg Stark
Дата:
On Sun, Dec 20, 2009 at 2:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 18, 2009 at 12:29 PM, Greg Stark <gsstark@mit.edu> wrote:
>> A word of warning, in my experience the hardest part for changes like
>> this isn't the executor changes (which in this case wouldn't be far
>> from easy) but the planner changes to detect when this new plan would
>> be better.
>
> There's also the problem of making the visibility map crash-safe.  I
> think I heard you might have some ideas on that one - has it been
> discussed on -hackers?

Not sure what ideas you mean.

In the original poster's plan that isn't an issue. We could scan the
index, perform the joins and restriction clauses, and only check the
visibility on the resulting tuples which slip through them all. That
would be possible even without crash-safe visibility bits.

--
greg

Re: Idea how to get rid of Bitmap Heap Scan

От
Tom Lane
Дата:
Greg Stark <gsstark@mit.edu> writes:
> In the original poster's plan that isn't an issue. We could scan the
> index, perform the joins and restriction clauses, and only check the
> visibility on the resulting tuples which slip through them all. That
> would be possible even without crash-safe visibility bits.

Yeah, this was floated years ago as being a potentially interesting
approach when all the join-condition fields are indexed.  You end up
never having to fetch rows that don't pass the join.

It certainly seems reasonably straightforward on the executor side.
As Greg said, the hard part is planning it sanely.

            regards, tom lane

Re: Idea how to get rid of Bitmap Heap Scan

От
Robert Haas
Дата:
On Sun, Dec 20, 2009 at 11:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <gsstark@mit.edu> writes:
>> In the original poster's plan that isn't an issue. We could scan the
>> index, perform the joins and restriction clauses, and only check the
>> visibility on the resulting tuples which slip through them all. That
>> would be possible even without crash-safe visibility bits.
>
> Yeah, this was floated years ago as being a potentially interesting
> approach when all the join-condition fields are indexed.  You end up
> never having to fetch rows that don't pass the join.
>
> It certainly seems reasonably straightforward on the executor side.
> As Greg said, the hard part is planning it sanely.

Yeah, but that seems REALLY hard.  First, there's the difficulty of
actually generating all the paths and costing them appropriately.  A
plan to perform an index scan but defer the heap fetches until later
has a hidden cost associated with it: the heap fetches will cost
something, but we don't know how much until we get the row estimate
for the node where we choose to implement them.  Without knowing that
cost, it's hard to be confident in discarding other plans that are
apparently more expensive.  That's probably solvable by adopting a
more sophisticated method for comparing costs, but that gets you to
the second problem, which is doing all of this with reasonable
performance.  You're going to have a lot more paths than we do now,
and there will be many queries for which there are only trivial cost
differences between them (like any query where most or all of the
joins have a selectivity of exactly 1.0, which is a very common case
for me).

...Robert