Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?

Поиск
Список
Период
Сортировка
От Scott Carey
Тема Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?
Дата
Msg-id E486BD28-B70F-4E5D-B4B1-ED84EA019B80@richrelevance.com
обсуждение исходный текст
Ответ на Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?
Список pgsql-performance
On Oct 26, 2010, at 8:48 PM, Tom Lane wrote:

> Robert Haas <robertmhaas@gmail.com> writes:
>> I'm also a bit suspicious of the fact that the hash condition has a
>> cast to text on both sides, which implies, to me anyway, that the
>> underlying data types are not text.  That might mean that the query
>> planner doesn't have very good statistics, which might mean that the
>> join selectivity estimates are wackadoo, which can apparently cause
>> this problem:
>
> Um ... you're guilty of the same thing as the OP, ie not showing how
> you got this example.  But I'm guessing that it was something like
>
> create table little as select * from generate_series(1,10) a;
> create table big as select * from generate_series(1,100000) a;
> ... wait for auto-analyze of big ...
> explain select * from little, big where little.a = big.a;
>
> Here, big is large enough to prod autovacuum into analyzing it,
> whereas little isn't.  So when the planner runs, it sees
>
> (1) big is known to have 100000 rows, and big.a is known unique;
> (2) little is estimated to have many fewer rows, but nothing is
>    known about the distribution of little.a.
>
> In this situation, it's going to prefer to hash big, because hash join
> behaves pretty nicely when the inner rel is uniformly distributed and
> the outer not, but not nicely at all when it's the other way round.
> It'll change its mind as soon as you analyze little, but it doesn't
> like taking a chance on an unknown distribution.  See cost_hashjoin
> and particularly estimate_hash_bucketsize.

The type of hash table will make a difference in how it behaves with skew.  Open addressing versus linking, etc.

>
> I'm not convinced this explains Scott's results though --- the numbers
> he's showing don't seem to add up even if you assume a pretty bad
> distribution for the smaller rel.

Answering both messages partially:

The join is on varchar columns.  So no they are cast ::text because its from two slightly different varchar
declarationsto ::text. 

The small relation in this case is unique on the key.  But I think postgres doesn't know that because there is a unique
indexon: 

(id, name) and the query filters for id = 12 in the example, leaving name unique.  But postgres does not identify this
asa unique condition for the key.  However, there are only about 150 distinct values of id, so even if 'name' collides
sometimesacross ids, there can be no more than 150 values that map to one key. 

I gave a partial plan, the parent joins are sometimes anti-joins.  The general query form is two from the temp table:
An update to a main table where the unique index keys match the temp (inner join)
An insert into the main table where the unique index keys do not exist (NOT EXISTS query, results in anti-join).  The
largerelation is the one being inserted/updated into the main table.   
Both have the same subplan that takes all the time in the middle -- a hash that hashes the large table and probes from
thesmall side.  I am still completely confused on how the hashjoin is calculating costs.  In one of my examples it
seemsto be way off.   


Why does hashjoin behave poorly when the inner relation is not uniformly distributed and the outer is?  In particular
foranti-join this should be the best case scenario. 

My assumption is this:  Imagine the worst possible skew on the small table.  Every value is in the same key and there
are20,000 entries in one list under one key.  The large table is in the outer relation, and probes for every element
andhas uniform distribution, 20 million -- one thousand rows for each of 20,000 keys.   
Inner Join:
  The values that match the key join agains the list.  There is no faster way to join once the list is identified.  It
isessentially nested loops per matching key.  1000 matches each output 20,000 rows.  If the hashjoin was reversed, then
theinner relation would be the large table and the outer relation would be the small table.  There would be 20,000
matchesthat each output 1000 rows. 
Semi-Join:
  The same thing happens, but additionally rows that match nothing are emitted.  If the relation that is always kept is
thelarge one, it makes more sense to hash the small. 
Anti-Join:
  Two relations, I'll call one "select" the other is "not exists".  The "select" relation is the one we are keeping,
butonly if its key does not match any key in the "not exists" relation. 
Case 1:  The large uniform table is "select" and the small one "not exists"
  It is always optimal to have the small one as the inner hashed relation, no matter what the skew.  If the inner
relationis the "not exists" table, only the key needs to be kept in the inner relation hash, the values or number of
matchesdo not matter, so skew is irrelevant and actually makes it faster than even distribution of keys. 
Case 2:  The small relation is the "select"
  Using the large "not exists" relation as the inner relation works well, as only the existence of a key needs to be
kept. Hashing the small "select" relation works if it is small enough.  Whenever a key matches from the outer relation,
removeit from the hash, then at the end take what remains in the hash as the result.  This is also generally immune to
skew.  

Am I missing something?  Is the Hash that is built storing each tuple in an open-addressed entry, and thus sensitive to
key-valueskew? or something like that?  If the hash only has one entry per key, with a linked list of values that match
thekey, I don't see how skew is a factor for hashjoin.  I am probably missing something. 


>
>             regards, tom lane


В списке pgsql-performance по дате отправления:

Предыдущее
От: Scott Marlowe
Дата:
Сообщение: Re: CPUs for new databases
Следующее
От: Greg Smith
Дата:
Сообщение: Re: AIX slow buffer reads