Re: Avoiding hash join batch explosions with extreme skew and weird stats

Поиск
Список
Период
Сортировка
От Melanie Plageman
Тема Re: Avoiding hash join batch explosions with extreme skew and weird stats
Дата
Msg-id CAAKRu_Z-3vRCTsgnL_iXy-Y5e9-GyFTPwYdqpN+=1g860zb8eA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: Avoiding hash join batch explosions with extreme skew and weirdstats  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Melanie Plageman <melanieplageman@gmail.com>)
Список pgsql-hackers

On Thu, Sep 5, 2019 at 10:35 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Seems like a good time for me to try to summarise what I think the
main problems are here:

1.  The match-bit storage problem already discussed.  The tuples that
each process receives while reading from SharedTupleStore are
non-deterministic (like other parallel scans).  To use a bitmap-based
approach, I guess we'd need to invent some way to give the tuples a
stable identifier within some kind of densely packed number space that
we could use to address the bitmap, or take the IO hit and write all
the tuples back.  That might involve changing the way SharedTupleStore
holds data.

This I've dealt with by adding a tuplenum to the SharedTupleStore
itself which I atomically increment in sts_puttuple().
In ExecParallelHashJoinPartitionOuter(), as each worker writes tuples
to the batch files, they call sts_puttuple() and this increments the
number so each tuple has a unique number.
For persisting this number, I added the tuplenum to the meta data
section of the MinimalTuple (along with the hashvalue -- there was a
comment about this meta data that said it could be used for other
things in the future, so this seemed like a good place to put it) and
write that out to the batch file.

At the end of ExecParallelHashJoinPartitionOuter(), I make the outer
match status bitmap file. I use the final tuplenum count to determine
the number of bytes to write to it. Each worker has a file with a
bitmap which has the number of bytes required to represent the number
of tuples in that batch.

Because one worker may beat the other(s) and build the whole batch
file for a batch before the others have a chance, I also make the
outer match status bitmap file for workers who missed out in
ExecParallelHashJoinOuterGetTuple() using the final tuplenum as well.
 

2.  Tricky problems relating to barriers and flow control.  First, let
me explain why PHJ doesn't support full/right outer joins yet.  At
first I thought it was going to be easy, because, although the shared
memory hash table is read-only after it has been built, it seems safe
to weaken that only slightly and let the match flag be set by any
process during probing: it's OK if two processes clobber each other's
writes, as the only transition is a single bit going strictly from 0
to 1, and there will certainly be a full memory barrier before anyone
tries to read those match bits.  Then during the scan for unmatched,
you just have to somehow dole out hash table buckets or ranges of
buckets to processes on a first-come-first-served basis.  But.... then
I crashed into the following problem:

* You can't begin the scan for unmatched tuples until every process
has finished probing (ie until you have the final set of match bits).
* You can't wait for every process to finish probing, because any
process that has emitted a tuple might never come back if there is
another node that is also waiting for all processes (ie deadlock
against another PHJ doing the same thing), and probing is a phase that
emits tuples.

Generally, it's not safe to emit tuples while you are attached to a
Barrier, unless you're only going to detach from it, not wait at it,
because emitting tuples lets the program counter escape your control.
Generally, it's not safe to detach from a Barrier while accessing
resources whose lifetime it controls, such as a hash table, because
then it might go away underneath you.

The PHJ plans that are supported currently adhere to that programming
rule and so don't have a problem: after the Barrier reaches the
probing phase, processes never wait for each other again so they're
free to begin emitting tuples.  They just detach when they're done
probing, and the last to detach cleans up (frees the hash table etc).
If there is more than one batch, they detach from one batch and attach
to another when they're ready (each batch has its own Barrier), so we
can consider the batches to be entirely independent.

There is probably a way to make a scan-for-unmatched-inner phase work,
possibly involving another Barrier or something like that, but I ran
out of time trying to figure it out and wanted to ship a working PHJ
for the more common plan types.  I suppose PHLJ will face two variants
of this problem: (1) you need to synchronise the loops (you can't dump
the hash table in preparation for the next loop until all have
finished probing for the current loop), and yet you've already emitted
tuples, so you're not allowed to wait for other processes and they're
not allowed to wait for you, and (2) you can't start the
scan-for-unmatched-outer until all the probe loops belonging to one
batch are done.  The first problem is sort of analogous to a problem I
faced with batches in the first place, which Robert and I found a
solution to by processing the batches in parallel, and could perhaps
be solved in the same way: run the loops in parallel (if that sounds
crazy, recall that every worker has its own quota of work_mem and the
data is entirely prepartitioned up front, which is why we are able to
run the batches in parallel; in constrast, single-batch mode makes a
hash table with a quota of nparticipants * work_mem).  The second
problem is sort of analogous to the existing scan-for-unmatched-inner
problem that I haven't solved.


I "solved" these problem for now by having all workers except for one
detach from the outer batch file after finishing probing. The last
worker to arrive does not detach from the batch and instead iterates
through all of the workers' outer match status files per participant
shared mem SharedTuplestoreParticipant) and create a single unified
bitmap. All the other workers continue to wait at the barrier until
the sole remaining worker has finished with iterating through the
outer match status bitmap files.

Admittedly, I'm still fighting with this step a bit, but, my intent is
to have all the backends wait until the lone remaining worker has
created the unified bitmap, then, that worker, which is still attached
to the outer batch will scan the outer batch file and the unified
outer match status bitmap and emit unmatched tuples.

I thought that the other workers can move on and stop waiting at the
barrier once the lone remaining worker has scanned their outer match
status files. All the probe loops would be done, and the worker that
is emitting tuples is not referencing the inner side hashtable at all
and only the outer batch file and the combined bitmap.

--
Melanie Plageman

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Joe Nelson
Дата:
Сообщение: Re: Change atoi to strtol in same place
Следующее
От: Tom Lane
Дата:
Сообщение: Re: SQL-spec incompatibilities in similar_escape() and related stuff