Re: BUG #15225: [XX000] ERROR: invalid DSA memory alloc request size1073741824 / Where: parallel worker

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: BUG #15225: [XX000] ERROR: invalid DSA memory alloc request size1073741824 / Where: parallel worker
Дата
Msg-id CAEepm=1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #15225: [XX000] ERROR: invalid DSA memory alloc request size 1073741824 / Where: parallel worker  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
On Sun, Jun 10, 2018 at 3:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> Here is a tidier version that I'd like to commit before beta2, if
>> there are no objections.  I've added this to the PostgreSQL 11 open
>> items page.
>
> That looks fine as far as it goes.

Thanks.  Pushed!

> But I noticed that the code that
> actually resizes the hashtable looks like
>
>                 /* Double the size of the bucket array. */
>                 pstate->nbuckets *= 2;
>                 size = pstate->nbuckets * sizeof(dsa_pointer_atomic);
>                 hashtable->batches[0].shared->size += size / 2;
>                 dsa_free(hashtable->area, hashtable->batches[0].shared->buckets);
>                 hashtable->batches[0].shared->buckets =
>                     dsa_allocate(hashtable->area, size);
>                 buckets = (dsa_pointer_atomic *)
>                     dsa_get_address(hashtable->area,
>                                     hashtable->batches[0].shared->buckets);
>                 for (i = 0; i < pstate->nbuckets; ++i)
>                     dsa_pointer_atomic_init(&buckets[i], InvalidDsaPointer);
>
> with no apparent provision for failure of the realloc.  So the first
> question is does the query actually fail cleanly if dsa_allocate gets
> an OOM,

If dsa_allocate() fails then an error is raised and the query fails
cleanly using the standard parallel query error propagation machinery
(as this bug report actually showed).

> and the second is whether we really want to fail the query in
> such a case, rather than plow ahead with the existing bucket array.

Right, we could use dsa_allocate_extended(..., DSA_ALLOC_NO_OOM) and
only free and replace the original array if that succeeds, but
otherwise keep using it.  On the other hand, we don't do the
equivalent for non-parallel hash joins (and it probably wouldn't be
reached much considering the widespread use of overcommit).

Unfortunately DSA_ALLOC_NO_OOM doesn't suppress all errors, in
particular not the Linux-only case where posix_fallocate() fails,
because dsm_create() calls dsm_impl_op(..., elevel=ERROR).  So,
independently of whether Parallel Hash should use that facility, I see
now that I need to fix that.  Perhaps the
DSM_CREATE_NULL_IF_MAXSEGMENTS flag to dsm_create() should be renamed
(or supplemented with another flag) and generalised to suppress the
error from posix_fallocate().  Currently the only in-tree user of
DSA_ALLOC_NO_OOM is dshash.c, which could leak a control object as a
result of this problem (it's using DSA_ALLOC_NO_OOM not because it has
a fall-back strategy for lack of memory but because it wants to clean
up something else before raising an OOM error of its own).  I will
look into that and start a new thread.

> BTW, that method of updating shared->size would do credit to
> Rube Goldberg.  Wouldn't this:
>
>                 hashtable->batches[0].shared->size = size;
>
> be both faster and less confusing?

The local variable 'size' is the size of the new bucket array, but
'hashtable->batches[0].shared->size' tracks the chunks of tuples too
so we can't just overwrite it.  We want to subtract the old bucket
array size and then add the new bucket array size, which amounts to
adding half the new size.  Yeah, that needs a comment.  Also, I
dropped the ball on point 2 of
https://wiki.postgresql.org/wiki/Parallel_Hash so the initial array
size isn't actually counted yet.  I will propose a clean-up patch for
that soon.

-- 
Thomas Munro
http://www.enterprisedb.com


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #15235: Getting failure message "Restore archive operation failed" while restoring database
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #15236: Update on 15234 - Sorry, no eMail ever arrived...