Обсуждение: Query performance help with 'shadow table' approach.

Поиск

Список

Период

Сортировка

Query performance help with 'shadow table' approach.

От

Brian Fehrle

Дата:

13 сентября 2011 г., 19:10:36

Hi all,
     I've got a large table that has 15 million + rows in it, and a set
of queries I've been trying to speed up. The table has a primary key
column, and a couple hundred other columns.

These queries basically do a 'select max(primary_key_column) from table
group by column1, column2." Because of the group by, we would result in
a sequential scan of the entire table which proves to be costly.

Since the table has a ton of columns, I set up a smaller table that will
house a copy of some of the data that the query uses, the Primary Key
colum, and the two columns I do my 'group by' on. My application is
smart enough to update this 'shadow' table whenever the main table is
updated, so it will accurately mirror the other table. This shadow table
will also only contain one row for every column1 and column2 combination
(due to the group by), and for those rows, will have the max of the
primary key. Even with this, the 'shadow' table will have about 14
million rows, compared to the 15 million in the main table.

Here is an example query that I'm working with:
postgres=# explain select T2.pkey_sid, T2.column54, T2.column44.
T2.column67 FROM
public.mytable AS T2
JOIN public.mytable_shadow AS T3
         ON (T2.pkey_sid = T3.pkey_sid)
WHERE T3.column1 >= 1072310434 AND T3.column1 <= 1074124834;
                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------
  Hash Join  (cost=118310.65..2250928.27 rows=409600 width=8)
    Hash Cond: (t2.pkey_sid = t3.pkey_sid)
    ->  Seq Scan on mytable t2  (cost=0.00..2075725.51 rows=15394251
width=8)
    ->  Hash  (cost=113190.65..113190.65 rows=409600 width=8)
          ->  Bitmap Heap Scan on mytable_shadow t3
(cost=12473.65..113190.65 rows=409600 width=8)
                Recheck Cond: ((1072310434 <= column1) AND (column1 <=
1074124834))
                ->  Bitmap Index Scan on mytable_shadow_pkey
(cost=0.00..12371.25 rows=409600 width=0)
                      Index Cond: ((1072310434 <= column1) AND (column1
<= 1074124834))
(8 rows)

So the issue here comes in retrieving the needed data from my main
table. The resulting rows is estimated to be 409,600, and the retrieving
of the primary key's that are associated with those rows is actually
really easy. However, when we take those 409,600 rows back to the main
table to retrieve the other columns I need, the planner is just doing a
sequential scan as it's most likely going to be faster than hitting the
index then retrieving the columns I need for all 400K+ rows.

Things to note:
1. If I reduce my where clause's range, then the sequential scan turns
into an index scan, but sadly this can't always be done.
2. I have appropriate indexes where they need to be. The issue is in the
query planner not using them due to it (i assume) just being faster to
scan the whole table when the data set it needs is as large as it is.
3. Without this shadow table, my query would look _something_ like this
(The idea being, retrieve a certain set of columns from the rows with
the max(primary key) based on my group by):
select pkey_sid, column54, column44, column47\\67 from public.mytable
where pkey_sid in (select max(pkey_sid) from public.mytable group by
column1, column2);


So I need to see how I can speed this up. Is my approach misguided, or
are there other ways I can go about it? Any thoughts, suggestions, or
info would be greatly appreciated. And I tried to explain it all easily,
if I can be more clear let me know.

Thanks,
- Brian F

Re: Query performance help with 'shadow table' approach.

От

Ondrej Ivanič

Дата:

13 сентября 2011 г., 19:46:54

Hi,

On 14 September 2011 07:44, Brian Fehrle <brianf@consistentstate.com> wrote:
> 2. I have appropriate indexes where they need to be. The issue is in the
> query planner not using them due to it (i assume) just being faster to scan
> the whole table when the data set it needs is as large as it is.

Try to reduce random_page cost to 2, which biased planner towards
index scans,  (set random_page = 2 before the query; assuming that
default seq_page_cost and random_page_cost are 1 and 4 respectively)
and run "explain analyze". Sometimes is worth to disable nested loops
join (set enable_nestloop = off). Finally you can increase
default_statistics_target (or ALTER TABLE SET STATISTICS) to 100 (8.4
has this as a default) on selected columns or table (and run analyze
on that table).

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: Query performance help with 'shadow table' approach.

От

Alban Hertroys

Дата:

14 сентября 2011 г., 04:10:34

On 13 Sep 2011, at 23:44, Brian Fehrle wrote:

> These queries basically do a 'select max(primary_key_column) from table group by column1, column2." Because of the
groupby, we would result in a sequential scan of the entire table which proves to be costly. 

That seems to suggest a row where the primary key that has the max value is "special" in some way. Making them more
easilydistinguishable from "normal" rows seems like a good idea here. 

> Since the table has a ton of columns, I set up a smaller table that will house a copy of some of the data that the
queryuses, the Primary Key colum, and the two columns I do my 'group by' on. 

That's one way to distinguish these special rows from the rest. You could also mark them as special using an extra
columnand/or create an expression-based index over just those rows. 

However, especially with the below section in mind, it would appear your data could be normalised a bit more (your
splittingoff that shadow table is a step in doing so, in fact).  

I'm also wondering, does your primary key have actual meaning? It would appear to just indicate the order in which the
recordswere created (I'm assuming it's a serial type surrogate PK, and not a natural one).  

> This shadow table will also only contain one row for every column1 and column2 combination (due to the group by), and
forthose rows, will have the max of the primary key. Even with this, the 'shadow' table will have about 14 million
rows,compared to the 15 million in the main table. 

Don't (column1, column2) make up a key then? I get the feeling you should split your table in 3 sections:
Table 1: main lookup (PK: pkey_sid)
Table 2: Variation lookup (PK: (column1, column2), FK: pkey_sid)
Table 3: Data (FK: the above)

> So the issue here comes in retrieving the needed data from my main table. The resulting rows is estimated to be
409,600,and the retrieving of the primary key's that are associated with those rows is actually really easy. However,
whenwe take those 409,600 rows back to the main table to retrieve the other columns I need, the planner is just doing a
sequentialscan as it's most likely going to be faster than hitting the index then retrieving the columns I need for all
400K+rows. 

Is that estimate accurate? If not, see Ondrej's suggestions.

That is only about 1/30th of your table. I don't think a seqscan makes sense here unless your data is distributed
badly.

> Things to note:
> 1. If I reduce my where clause's range, then the sequential scan turns into an index scan, but sadly this can't
alwaysbe done. 

Does it make sense to CLUSTER your data in some sense? That would improve the data distribution issue and would
probablypush the threshold for a seqscan up some. 

Cheers,

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

Re: Query performance help with 'shadow table' approach.

От

Brian Fehrle

Дата:

14 сентября 2011 г., 15:53:48

On 09/14/2011 01:10 AM, Alban Hertroys wrote:
> On 13 Sep 2011, at 23:44, Brian Fehrle wrote:
>
>> These queries basically do a 'select max(primary_key_column) from table group by column1, column2." Because of the
groupby, we would result in a sequential scan of the entire table which proves to be costly. 
> That seems to suggest a row where the primary key that has the max value is "special" in some way. Making them more
easilydistinguishable from "normal" rows seems like a good idea here. 
>
>> Since the table has a ton of columns, I set up a smaller table that will house a copy of some of the data that the
queryuses, the Primary Key colum, and the two columns I do my 'group by' on. 
> That's one way to distinguish these special rows from the rest. You could also mark them as special using an extra
columnand/or create an expression-based index over just those rows. 
>
> However, especially with the below section in mind, it would appear your data could be normalised a bit more (your
splittingoff that shadow table is a step in doing so, in fact). 
>
> I'm also wondering, does your primary key have actual meaning? It would appear to just indicate the order in which
therecords were created (I'm assuming it's a serial type surrogate PK, and not a natural one). 
>
It isn't a serial type, and the id increment is handled by the application.
>> This shadow table will also only contain one row for every column1 and column2 combination (due to the group by),
andfor those rows, will have the max of the primary key. Even with this, the 'shadow' table will have about 14 million
rows,compared to the 15 million in the main table. 
> Don't (column1, column2) make up a key then? I get the feeling you should split your table in 3 sections:
> Table 1: main lookup (PK: pkey_sid)
> Table 2: Variation lookup (PK: (column1, column2), FK: pkey_sid)
> Table 3: Data (FK: the above)
>
(column1, column2) could possibly have multiple occurrences of the
combination. Such as, 4 rows where column1 = 54 and column2 = 86, in
these cases with multiple rows, I just want the one with the
max(primary_key).

I'm looking into options like this, but at this moment changing the base
table structure is out of the question, but adding tables along the side
to try to speed things up is ok. Im trying to not cause changes in the
application.
>> So the issue here comes in retrieving the needed data from my main table. The resulting rows is estimated to be
409,600,and the retrieving of the primary key's that are associated with those rows is actually really easy. However,
whenwe take those 409,600 rows back to the main table to retrieve the other columns I need, the planner is just doing a
sequentialscan as it's most likely going to be faster than hitting the index then retrieving the columns I need for all
400K+rows. 
> Is that estimate accurate? If not, see Ondrej's suggestions.
>
> That is only about 1/30th of your table. I don't think a seqscan makes sense here unless your data is distributed
badly.
>
Yeah the more I look at it, the more I think it's postgres _thinking_
that it's faster to do a seqential scan. I'll be playing with the
random_page_cost that Ondrej suggested, and schedule a time where I can
do some explain analyzes (production server and all).
>> Things to note:
>> 1. If I reduce my where clause's range, then the sequential scan turns into an index scan, but sadly this can't
alwaysbe done. 
> Does it make sense to CLUSTER your data in some sense? That would improve the data distribution issue and would
probablypush the threshold for a seqscan up some. 
>
> Cheers,
>
> Alban Hertroys
>
Thanks, I'll be reporting back in with my next findings.

- Brian F
> --
> If you can't see the forest for the trees,
> cut the trees and you'll see there is no forest.
>

Re: Query performance help with 'shadow table' approach.

От

Alban Hertroys

Дата:

14 сентября 2011 г., 19:14:09

On 14 Sep 2011, at 20:45, Brian Fehrle wrote:

>> That is only about 1/30th of your table. I don't think a seqscan makes sense here unless your data is distributed
badly.
>>
> Yeah the more I look at it, the more I think it's postgres _thinking_ that it's faster to do a seqential scan. I'll
beplaying with the random_page_cost that Ondrej suggested, and schedule a time where I can do some explain analyzes
(productionserver and all). 

Before you do that, turn off seqscans (there's a session option for that) and see if index scans are actually faster.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Query performance help with 'shadow table' approach.

Query performance help with 'shadow table' approach.

Re: Query performance help with 'shadow table' approach.

Re: Query performance help with 'shadow table' approach.

Re: Query performance help with 'shadow table' approach.

Re: Query performance help with 'shadow table' approach.