Обсуждение: need to repeat the same condition on joined tables in order to choose the proper plan

Поиск
Список
Период
Сортировка

need to repeat the same condition on joined tables in order to choose the proper plan

От
Svetlin Manavski
Дата:
Hi performance gurus,

One of the reasons I prefer PostgreSQL is because it does not implement hints. However I have a situation which seems like I am forced to use a hint-like statement:
 
I have 2 tables in PostgreSQL 9.0:

tcpsessions - about 4 Mrows in lab, hundreds of Mrows in production
primary key(detectorid, id)

tcpsessiondata - about 2-5 times bigger than tcpsessions
Foreign key(detectorid, sessionid) References tcpsessions(detectorid,id)
There is an index on (detectorid, sessionid)

For completeness tcpsessiondata is actually partitioned according to the official documentation but I will save you the details if that is not necessary. For the purpose of this message, all the data will be available in one child table: tcpsessiondata_default

When I run the following simple query:

select
    (createdtime / 60000000000) as timegroup,
    (sum(datafromsource)+sum(datafromdestination)) as numbytes,
    (sum(packetsfromsource)+sum(packetsfromdestination)) as numpackets
from
    tcpsessiondata SD, tcpsessions SS
where
    SD.detectorid = SS.detectorid
    and SD.sessionid = SS.id
    and SD.detectorid = 1
    and SD.sessionid >= 1001000000000::INT8 and SD.sessionid <= 2001000000000::INT8
group by
    timegroup
order by
    timegroup asc

I get the following plan:
"Sort  (cost=259126.13..259126.63 rows=200 width=32) (actual time=32526.762..32526.781 rows=20 loops=1)"
"  Output: ((sd.createdtime / 60000000000::bigint)), (((sum(sd.datafromsource) + sum(sd.datafromdestination)) / 1048576::numeric)), ((sum(sd.packetsfromsource) + sum(sd.packetsfromdestination)))"
"  Sort Key: ((sd.createdtime / 60000000000::bigint))"
"  Sort Method:  quicksort  Memory: 26kB"
"  ->  HashAggregate  (cost=259112.49..259118.49 rows=200 width=32) (actual time=32526.657..32526.700 rows=20 loops=1)"
"        Output: ((sd.createdtime / 60000000000::bigint)), ((sum(sd.datafromsource) + sum(sd.datafromdestination)) / 1048576::numeric), (sum(sd.packetsfromsource) + sum(sd.packetsfromdestination))"
"        ->  Hash Join  (cost=126553.43..252603.29 rows=520736 width=32) (actual time=22400.430..31291.838 rows=570100 loops=1)"
"              Output: sd.createdtime, sd.datafromsource, sd.datafromdestination, sd.packetsfromsource, sd.packetsfromdestination, (sd.createdtime / 60000000000::bigint)"
"              Hash Cond: (sd.sessionid = ss.id)"
"              ->  Append  (cost=0.00..100246.89 rows=520736 width=42) (actual time=2382.160..6226.906 rows=570100 loops=1)"
"                    ->  Seq Scan on appqosdata.tcpsessiondata sd  (cost=0.00..18.65 rows=1 width=42) (actual time=0.002..0.002 rows=0 loops=1)"
"                          Output: sd.createdtime, sd.datafromsource, sd.datafromdestination, sd.packetsfromsource, sd.packetsfromdestination, sd.detectorid, sd.sessionid"
"                          Filter: ((sd.sessionid >= 1001000000000::bigint) AND (sd.sessionid <= 2001000000000::bigint) AND (sd.detectorid = 1))"
"                    ->  Bitmap Heap Scan on appqosdata.tcpsessiondata_default sd  (cost=11001.37..100228.24 rows=520735 width=42) (actual time=2382.154..5278.319 rows=570100 loops=1)"
"                          Output: sd.createdtime, sd.datafromsource, sd.datafromdestination, sd.packetsfromsource, sd.packetsfromdestination, sd.detectorid, sd.sessionid"
"                          Recheck Cond: ((sd.detectorid = 1) AND (sd.sessionid >= 1001000000000::bigint) AND (sd.sessionid <= 2001000000000::bigint))"
"                          ->  Bitmap Index Scan on idx_tcpsessiondata_default_detectoridandsessionid  (cost=0.00..10871.19 rows=520735 width=0) (actual time=2351.865..2351.865 rows=574663 loops=1)"
"                                Index Cond: ((sd.detectorid = 1) AND (sd.sessionid >= 1001000000000::bigint) AND (sd.sessionid <= 2001000000000::bigint))"
"              ->  Hash  (cost=72340.48..72340.48 rows=3628356 width=10) (actual time=19878.891..19878.891 rows=3632586 loops=1)"
"                    Output: ss.detectorid, ss.id"
"                    Buckets: 8192  Batches: 64  Memory Usage: 2687kB"
"                    ->  Seq Scan on appqosdata.tcpsessions ss  (cost=0.00..72340.48 rows=3628356 width=10) (actual time=627.164..14586.202 rows=3632586 loops=1)"
"                          Output: ss.detectorid, ss.id"
"                          Filter: (ss.detectorid = 1)"
"Total runtime: 32543.224 ms"


As we can see the planner decides to go for an index scan on tcpsessiondata_default (as expected) and for a seq scan on tcpsessions. However if I add the following ugly condition to my query:
    and SS.detectorid = 1
    and SS.id >= 1001000000000::INT8 and SS.id <= 2001000000000::INT8
so that the full query now becomes:

select
    (createdtime / 60000000000) as timegroup,
    (sum(datafromsource)+sum(datafromdestination)) as numbytes,
    (sum(packetsfromsource)+sum(packetsfromdestination)) as numpackets
from
    tcpsessiondata SD, tcpsessions SS
where
    SD.detectorid = SS.detectorid
    and SD.sessionid = SS.id
    and SD.detectorid = 1
    and SD.sessionid >= 1001000000000::INT8 and SD.sessionid <= 2001000000000::INT8
    and SS.detectorid = 1
    and SS.id >= 1001000000000::INT8 and SS.id <= 2001000000000::INT8
group by
    timegroup
order by
    timegroup asc

well, now I have an index scan on tcpsessions as well and running time is 3 times less than the previous one:

"Sort  (cost=157312.59..157313.09 rows=200 width=32) (actual time=9682.748..9682.764 rows=20 loops=1)"
"  Output: ((sd.createdtime / 60000000000::bigint)), (((sum(sd.datafromsource) + sum(sd.datafromdestination)) / 1048576::numeric)), ((sum(sd.packetsfromsource) + sum(sd.packetsfromdestination)))"
"  Sort Key: ((sd.createdtime / 60000000000::bigint))"
"  Sort Method:  quicksort  Memory: 26kB"
"  ->  HashAggregate  (cost=157298.94..157304.94 rows=200 width=32) (actual time=9682.649..9682.692 rows=20 loops=1)"
"        Output: ((sd.createdtime / 60000000000::bigint)), ((sum(sd.datafromsource) + sum(sd.datafromdestination)) / 1048576::numeric), (sum(sd.packetsfromsource) + sum(sd.packetsfromdestination))"
"        ->  Hash Join  (cost=32934.67..150744.28 rows=524373 width=32) (actual time=3695.016..8370.629 rows=570100 loops=1)"
"              Output: sd.createdtime, sd.datafromsource, sd.datafromdestination, sd.packetsfromsource, sd.packetsfromdestination, (sd.createdtime / 60000000000::bigint)"
"              Hash Cond: (sd.sessionid = ss.id)"
"              ->  Append  (cost=0.00..100948.71 rows=524373 width=42) (actual time=2318.568..4799.985 rows=570100 loops=1)"
"                    ->  Seq Scan on appqosdata.tcpsessiondata sd  (cost=0.00..18.65 rows=1 width=42) (actual time=0.001..0.001 rows=0 loops=1)"
"                          Output: sd.createdtime, sd.datafromsource, sd.datafromdestination, sd.packetsfromsource, sd.packetsfromdestination, sd.detectorid, sd.sessionid"
"                          Filter: ((sd.sessionid >= 1001000000000::bigint) AND (sd.sessionid <= 2001000000000::bigint) AND (sd.detectorid = 1))"
"                    ->  Bitmap Heap Scan on appqosdata.tcpsessiondata_default sd  (cost=11080.05..100930.06 rows=524372 width=42) (actual time=2318.563..3789.844 rows=570100 loops=1)"
"                          Output: sd.createdtime, sd.datafromsource, sd.datafromdestination, sd.packetsfromsource, sd.packetsfromdestination, sd.detectorid, sd.sessionid"
"                          Recheck Cond: ((sd.detectorid = 1) AND (sd.sessionid >= 1001000000000::bigint) AND (sd.sessionid <= 2001000000000::bigint))"
"                          ->  Bitmap Index Scan on idx_tcpsessiondata_default_detectoridandsessionid  (cost=0.00..10948.96 rows=524372 width=0) (actual time=2305.322..2305.322 rows=574663 loops=1)"
"                                Index Cond: ((sd.detectorid = 1) AND (sd.sessionid >= 1001000000000::bigint) AND (sd.sessionid <= 2001000000000::bigint))"
"              ->  Hash  (cost=30159.60..30159.60 rows=185726 width=10) (actual time=1345.307..1345.307 rows=194971 loops=1)"
"                    Output: ss.detectorid, ss.id"
"                    Buckets: 8192  Batches: 4  Memory Usage: 2297kB"
"                    ->  Bitmap Heap Scan on appqosdata.tcpsessions ss  (cost=3407.46..30159.60 rows=185726 width=10) (actual time=483.572..1069.292 rows=194971 loops=1)"
"                          Output: ss.detectorid, ss.id"
"                          Recheck Cond: ((ss.id >= 1001000000000::bigint) AND (ss.id <= 2001000000000::bigint))"
"                          Filter: (ss.detectorid = 1)"
"                          ->  Bitmap Index Scan on idx_tcpsessions_id  (cost=0.00..3361.02 rows=201751 width=0) (actual time=451.242..451.242 rows=219103 loops=1)"
"                                Index Cond: ((ss.id >= 1001000000000::bigint) AND (ss.id <= 2001000000000::bigint))"
"Total runtime: 9682.905 ms"


Let me also add that if I remove the conditions on SD but keep the conditions on SS, then I get an index scan on tcpsessions BUT a seq scan on tcpsessiondata.

Let's now suppose that the index scan on both tables is the best choice as the planner itself selects it in one of the 3 cases. (It is also the faster plan as we extract only 200 000 rows out of 4 millions in this example). But I am really surprised to see that the planner needs me to explicitly specify the same condition twice like this:

    SD.detectorid = SS.detectorid
    and SD.sessionid = SS.id
    and SD.detectorid = 1
    and SD.sessionid >= 1001000000000::INT8 and SD.sessionid <= 2001000000000::INT8
    and SS.detectorid = 1
    and SS.id >= 1001000000000::INT8 and SS.id <= 2001000000000::INT8

in order to use the primary key on SS, even if it is absolutely clear that "SD.detectorid = SS.detectorid and SD.sessionid = SS.id". Well I hope you agree that repeating the same condition on SS seems very like giving a hint to use the index there. But I feel very uncomfortable to use such an ugly condition, especially knowing that I am doing it to "force an index". On the other hand I am terrified that we may go in production for a seq scan on hundreds of millions of rows just to extract 200 000.

Would you please explain that behavior and how would you suggest to proceed?

Thanks for any comments,
Svetlin Manavski



Svetlin Manavski <svetlin.manavski@gmail.com> writes:
> I am really surprised to see that the planner needs me to explicitly specify
> the same condition twice like this:

>     SD.detectorid = SS.detectorid
>     and SD.sessionid = SS.id
>     and SD.detectorid = 1
>     and SD.sessionid >= 1001000000000::INT8 and SD.sessionid <=
> 2001000000000::INT8
>     and SS.detectorid = 1
>     and SS.id >= 1001000000000::INT8 and SS.id <= 2001000000000::INT8

The planner does infer implied equalities, eg, given A = B and B = C
it will figure out that A = C.  What you are asking is for it to derive
inequalities, eg infer A < C from A = B and B < C.  That would be
considerably more work for considerably less reward, since the sort of
situation where this is helpful doesn't come up very often.  On balance
I don't believe it's a good thing for us to do: I think it would make
PG slower on average because on most queries it would just waste time
looking for this sort of situation.

(In this example, the SS.detectorid = 1 clause is in fact unnecessary,
since the planner will infer it from SD.detectorid = SS.detectorid and
SD.detectorid = 1.  But it won't infer the range conditions on SS.id
from the range conditions on SD.sessionid or vice versa.)

            regards, tom lane

Re: need to repeat the same condition on joined tables in order to choose the proper plan

От
Robert Klemme
Дата:
On 14.06.2011 18:29, Tom Lane wrote:
> Svetlin Manavski<svetlin.manavski@gmail.com>  writes:
>> I am really surprised to see that the planner needs me to explicitly specify
>> the same condition twice like this:
>
>>      SD.detectorid = SS.detectorid
>>      and SD.sessionid = SS.id
>>      and SD.detectorid = 1
>>      and SD.sessionid>= 1001000000000::INT8 and SD.sessionid<=
>> 2001000000000::INT8
>>      and SS.detectorid = 1
>>      and SS.id>= 1001000000000::INT8 and SS.id<= 2001000000000::INT8
>
> The planner does infer implied equalities, eg, given A = B and B = C
> it will figure out that A = C.  What you are asking is for it to derive
> inequalities, eg infer A<  C from A = B and B<  C.  That would be
> considerably more work for considerably less reward, since the sort of
> situation where this is helpful doesn't come up very often.  On balance
> I don't believe it's a good thing for us to do: I think it would make
> PG slower on average because on most queries it would just waste time
> looking for this sort of situation.
>
> (In this example, the SS.detectorid = 1 clause is in fact unnecessary,
> since the planner will infer it from SD.detectorid = SS.detectorid and
> SD.detectorid = 1.  But it won't infer the range conditions on SS.id
> from the range conditions on SD.sessionid or vice versa.)

Is that the same for IN?  Would it help in this particular case to use a
and SS.id in (select ... where ... > and ... < ...) or with a CTE?

Kind regards

    robert

Re: need to repeat the same condition on joined tables in order to choose the proper plan

От
Svetlin Manavski
Дата:
Thanks Tom, this explain the behavior. But is there a more elegant way to achieve the only acceptable plan (index scan on both tables) without that ugly syntax? It does seem to me like a specific syntax to help the current postgressql planner make the right decision. ( I am aware about the radical solutions which impact the rest of the connection or the entire DB )

As it comes to the generic case, I do understand deriving inequalities may be inefficient. I just want to point out that this is the case of joining and filtering on a field, which is the foreign key in one table and the primary key in the other. That should be massively common in every non trivial DB application. Maybe it does make sense to consider that specific case in the planner, doesn't it?

Thank you,
Svetlin Manavski



On Tue, Jun 14, 2011 at 5:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Svetlin Manavski <svetlin.manavski@gmail.com> writes:
> I am really surprised to see that the planner needs me to explicitly specify
> the same condition twice like this:

>     SD.detectorid = SS.detectorid
>     and SD.sessionid = SS.id
>     and SD.detectorid = 1
>     and SD.sessionid >= 1001000000000::INT8 and SD.sessionid <=
> 2001000000000::INT8
>     and SS.detectorid = 1
>     and SS.id >= 1001000000000::INT8 and SS.id <= 2001000000000::INT8

The planner does infer implied equalities, eg, given A = B and B = C
it will figure out that A = C.  What you are asking is for it to derive
inequalities, eg infer A < C from A = B and B < C.  That would be
considerably more work for considerably less reward, since the sort of
situation where this is helpful doesn't come up very often.  On balance
I don't believe it's a good thing for us to do: I think it would make
PG slower on average because on most queries it would just waste time
looking for this sort of situation.

(In this example, the SS.detectorid = 1 clause is in fact unnecessary,
since the planner will infer it from SD.detectorid = SS.detectorid and
SD.detectorid = 1.  But it won't infer the range conditions on SS.id
from the range conditions on SD.sessionid or vice versa.)

                       regards, tom lane