Обсуждение: Optimizer refuses to hash join

Поиск
Список
Период
Сортировка

Optimizer refuses to hash join

От
Stan Bielski
Дата:
Hello,

I having a great deal of difficulty getting postgres to do a hash join.
Even if I disable nestloop and mergejoin in postgres.conf, the optimizer
still refuses to select hash join. This behavior is killing my
performance.

Postgres version is 7.3.2 and relevant tables are vacuum analyzed.

Here's an overview of what I'm doing:

I have one table of network logs ordered by time values. The other table
is a set of hosts (approximately 60) that are infected by a worm. I want
to do this query on the dataset:

standb=# explain SELECT count (allflow_tv_sobig.tv_s) FROM
allflow_tv_sobig, blaster_set WHERE allflow_tv_sobig.src =
blaster_set.label AND allflow_tv_sobig.tv_s >= 1060101118::bigint and
allflow_tv_sobig.tv_s < 1060187518::bigint;
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=185785.06..185785.06 rows=1 width=32)
   ->  Merge Join  (cost=174939.71..184986.38 rows=319472 width=32)
         Merge Cond: ("outer".label = "inner".src)
         ->  Index Scan using blaster_set_x on blaster_set
(cost=0.00..3.67 rows=66 width=12)
         ->  Sort  (cost=174939.71..178073.92 rows=1253684 width=20)
               Sort Key: allflow_tv_sobig.src
               ->  Index Scan using allflow_tv_sobig_x on allflow_tv_sobig
(cost=0.00..47955.63 rows=1253684 width=20)
                     Index Cond: ((tv_s >= 1060101118::bigint) AND (tv_s <
1060187518::bigint))
(8 rows)

Basically I just want to use the smaller table as a filtering mechanism so
that I only get resulted for hosts in that table. Rather than do the
sensible thing, which is scan the list of infected hosts, then scan the
traffic table and ignore entries that aren't in the first list, the
optimizer insists on SORTING the table of network traffic according to
source address. Considering that this table is very large, these queries
are taking forever.

Doing it in a nested loop, while it doesn't require sorting, still takes a
very long time as well.

Is there anyway that I can force the optimizer to do this the right way,
aside from adding each IP manually to a disgustingly bloated 'where'
clause?


Thanks,
-S




Re: Optimizer refuses to hash join

От
Stephan Szabo
Дата:
On Tue, 27 Jul 2004, Stan Bielski wrote:

> I having a great deal of difficulty getting postgres to do a hash join.
> Even if I disable nestloop and mergejoin in postgres.conf, the optimizer
> still refuses to select hash join. This behavior is killing my
> performance.
>
> Postgres version is 7.3.2 and relevant tables are vacuum analyzed.
>
> Here's an overview of what I'm doing:
>
> I have one table of network logs ordered by time values. The other table
> is a set of hosts (approximately 60) that are infected by a worm. I want
> to do this query on the dataset:
>
> standb=# explain SELECT count (allflow_tv_sobig.tv_s) FROM
> allflow_tv_sobig, blaster_set WHERE allflow_tv_sobig.src =
> blaster_set.label AND allflow_tv_sobig.tv_s >= 1060101118::bigint and
> allflow_tv_sobig.tv_s < 1060187518::bigint;

Can you send explain analyze results for the normal case and the nested
loop case?  It's generally more useful than plain explain.

I'd also wonder if blaster_set.label is unique such that you might be able
to write the condition as an exists clause and if that's better.  If you
were running 7.4, I'd suggest IN, but that'll certainly be painful in 7.3.

Re: Optimizer refuses to hash join

От
Tom Lane
Дата:
Stan Bielski <bielski@ece.cmu.edu> writes:
> I having a great deal of difficulty getting postgres to do a hash join.
> Even if I disable nestloop and mergejoin in postgres.conf, the optimizer
> still refuses to select hash join.

Are you sure the join condition is hashjoinable?  You didn't say
anything about the datatypes involved ...

If it is, the other possibility is that you need to increase sort_mem
to accommodate the hash table.

            regards, tom lane

Re: Optimizer refuses to hash join

От
Tom Lane
Дата:
Stan Bielski <bielski@ece.cmu.edu> writes:
> On Thu, 29 Jul 2004, Tom Lane wrote:
>> Are you sure the join condition is hashjoinable?  You didn't say
>> anything about the datatypes involved ...

> My apologies. The columns that I want to join are both type 'inet'.
> Shouldn't that be hashjoinable?

Depends on your PG version.  The raw type isn't hashjoinable, because
its '=' operator ignores the inet-vs-cidr flag.  Before 7.4 the operator
was (correctly) marked not hashjoinable.  In 7.4 it was (incorrectly)
marked hashjoinable, due no doubt to momentary brain fade on my part.
For 7.5 it is hashjoinable and the join will actually work, because we
added a specialized hash function that also ignores the inet-vs-cidr flag.

If you are joining data that is all inet or all cidr (no mixtures),
then 7.4 works okay, which is why we didn't notice the bug right away.
If that's good enough for now, you could emulate the 7.4 behavior in
earlier releases by setting the oprcanhash flag in pg_operator for the
inet equality operator.

            regards, tom lane