Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Дата
Msg-id CAH2-WzkepmVTzF4EDJ7T-5urhtpZgL_ZYAV9=SsyKi2YGd_mTw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Список pgsql-hackers
On Mon, Feb 5, 2018 at 9:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> # LogicalTapeFreeze() may write out its first block when it is dirty but
> # not full, and then immediately read the first block back in from its
> # BufFile as a BLCKSZ-width block.  This can only occur in rare cases
> # where next to no tuples were written out, which is only possible with
> # parallel external tuplesorts.
>
> So, if I understand correctly what you're saying here, valgrind is
> totally cool with us writing out an only-partially-initialized block
> to a disk file, but it's got a real problem with us reading that data
> back into the same memory space it already occupies.

That's not quite it. Valgrind is cool with a BufFileWrite(), which
doesn't result in an actual write() because the buffile.c stdio-style
buffer (which isn't where the uninitialized bytes originate from)
isn't yet filled. The actual write() comes later, and that's the point
that Valgrind complains. IOW, Valgrind is cool with copying around
uninitialized memory before we do anything with the underlying values
(e.g., write(), something that affects control flow).

> I presume that it's common for the tail of the final block
> written to be uninitialized, but normally when we then go read block
> 0, that's some other, fully initialized block.

It certainly is common. In the case of logtape.c, we almost always
write out some garbage bytes, even with serial sorts. The only
difference here is the *sense* in which they're garbage: they're
uninitialized bytes, which Valgrind cares about, rather than byte from
previous writes that are left behind in the buffer, which Valgrind
does not care about.

>> It might seem like my suppression is overly broad, or not broad
>> enough, since it essentially targets LogicalTapeFreeze(). I don't
>> think it is, though, because this can occur in two places within
>> LogicalTapeFreeze() -- it can occur in the place we actually saw the
>> issue on lousyjack, from the ltsReadBlock() call within
>> LogicalTapeFreeze(), as well as a second place -- when
>> BufFileExportShared() is called. I found that you have to tweak code
>> to prevent it happening in the first place before you'll see it happen
>> in the second place.
>
> I don't quite see how that would happen, because BufFileExportShared,
> at least AFAICS, doesn't touch the buffer?

It doesn't have to -- at least not directly. Valgrind remembers that
the uninitialized memory from logtape.c buffers are poisoned -- it
"spreads". The knowledge that the bytes are poisoned is tracked as
they're copied around. You get the error on the write() from the
BufFile buffer, despite the fact that you can make the error go away
by using palloc0() instead of palloc() within logtape.c, and nowhere
else.

-- 
Peter Geoghegan


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Claudio Freire
Дата:
Сообщение: Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: WIP: BRIN multi-range indexes