Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?
Дата
Msg-id 7570.1288151284@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?
Список pgsql-performance
Robert Haas <robertmhaas@gmail.com> writes:
> I'm also a bit suspicious of the fact that the hash condition has a
> cast to text on both sides, which implies, to me anyway, that the
> underlying data types are not text.  That might mean that the query
> planner doesn't have very good statistics, which might mean that the
> join selectivity estimates are wackadoo, which can apparently cause
> this problem:

Um ... you're guilty of the same thing as the OP, ie not showing how
you got this example.  But I'm guessing that it was something like

create table little as select * from generate_series(1,10) a;
create table big as select * from generate_series(1,100000) a;
... wait for auto-analyze of big ...
explain select * from little, big where little.a = big.a;

Here, big is large enough to prod autovacuum into analyzing it,
whereas little isn't.  So when the planner runs, it sees

(1) big is known to have 100000 rows, and big.a is known unique;
(2) little is estimated to have many fewer rows, but nothing is
    known about the distribution of little.a.

In this situation, it's going to prefer to hash big, because hash join
behaves pretty nicely when the inner rel is uniformly distributed and
the outer not, but not nicely at all when it's the other way round.
It'll change its mind as soon as you analyze little, but it doesn't
like taking a chance on an unknown distribution.  See cost_hashjoin
and particularly estimate_hash_bucketsize.

I'm not convinced this explains Scott's results though --- the numbers
he's showing don't seem to add up even if you assume a pretty bad
distribution for the smaller rel.

            regards, tom lane

В списке pgsql-performance по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: odd postgresql performance (excessive lseek)
Следующее
От: Tom Lane
Дата:
Сообщение: Re: odd postgresql performance (excessive lseek)