Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

Поиск
Список
Период
Сортировка
От Dmitry Astapov
Тема Re: BUG #17619: AllocSizeIsValid violation in parallel hash join
Дата
Msg-id CAFQUnFiiPUOWihzjH7gevyaGGnX3tskr_3YFtNhv-AzpzY-jrg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #17619: AllocSizeIsValid violation in parallel hash join  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: BUG #17619: AllocSizeIsValid violation in parallel hash join  (Dmitry Astapov <dastapov@gmail.com>)
Список pgsql-bugs


On Thu, Sep 22, 2022 at 9:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Sep 22, 2022 at 7:46 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> (gdb) p size
> $2 = 1702125924

Thanks for the detailed report.  Hmm.  That size, on a little-endian
system, is equivalent to the byte sequence "date\0\0\0\0", which looks
pretty suspiciously like the inside of a tuple, and not its size.  We
must have got out of sync somehow.

Great catch! I added some debugging output to the BufFileRead and sts_read_tuple and it looks to me that the sequence of events immediately before the problem happens is like this:

  1. sts_read_tuple is called multiple times to read from a particular file, and a bunch of tuples are read this way
  2. sts_read_tuple is called once more. In my experiments, I always had read_ntuples_available - 2 == read_ntuples at this point, but maybe this is just a coincidence.
    1.  inside this sts_read_tuple_call, BufFileRead is called to read metadata, and succeeds (and this chunk of the file looks -- to my naive eye -- exactly like any other tuple start)
    2. Second BufFileRead call happens to get the size, and reads the correct size (I see what's in the file, I could find corresponding values in the database tables, and it all lines up). 
    3. Third BufFileRead happens to read the actual tuple. So far so good, and sts_read_tuple finishes.
  3. At this point, the position in the file (buf_file -> currOffset + post) points right after the last tuple read, but what's there does not look like the start of the next tuple to me. It seems like there was another, longer tuple written there before (with quite long rhs.payload value) and somehow the tuple that was just read was written over it, so we have "<tuple read on step 2 above><current read position here><what looks like leftovers from another tuple><a bunch of zeroes><new tuple start>"
  4. So another sts_read_tuple call happens, and it reads 4 bytes of this garbage as metadata, and the next 4 bytes as length, and this is when everything derails...
I also double-checked at the logs of the (test) database where this issue was first spotted (and where a lot of queries similar to the one I am using to trigger this issue are happening) and found out that along with "invalid memory alloc request size" we also got a couple of occasions of "Unexpected chunk in shared tuplestore temporary file. Expected overflow chunk", which looks like another error that could be potentially explained by BufFileRead's inside sts_read_tuple reading "garbage".

I'm happy to build a patched version of 13.8 (or another revision if you want me to) to collect more debug information for you if this could help.

--
D. Astapov

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: BUG #17619: AllocSizeIsValid violation in parallel hash join
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17620: JIT-Compiler causes Segmentation Fault