Re: [HACKERS] Parallel Hash take II

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: [HACKERS] Parallel Hash take II
Дата
Msg-id CAEepm=0XjazRX_bLCD7dSKwCQ+vS1Vczxbvpnj57U5aO7VhEWg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Parallel Hash take II  (Andres Freund <andres@anarazel.de>)
Ответы Re: [HACKERS] Parallel Hash take II  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Thu, Dec 14, 2017 at 11:45 AM, Andres Freund <andres@anarazel.de> wrote:
> +       bool            overflow;               /* Continuation of previous chunk? */
> +       char            data[FLEXIBLE_ARRAY_MEMBER];
> +} SharedTuplestoreChunk;
>
> Ah. I was thinking we could have the 'overflow' variable be an int,
> indicating the remaining length of the oversized tuple. That'd allow us
> to skip ahead to the end of the oversized tuple in concurrent processes
> after hitting it.

Right, that is a bit better as it avoids extra read-skip cycles for
multi-overflow-chunk cases.  Done that way.

> +                       if (accessor->write_pointer + accessor->sts->meta_data_size >=
> +                               accessor->write_end)
> +                               elog(ERROR, "meta-data too long");
>
> That seems more like an Assert than a proper elog? Given that we're
> calculating size just a few lines above...

It's an error because the logic is not smart enough to split the
optional meta-data and tuple size over multiple chunks.  I have added
comments there to explain.  That error can be reached by CALL
test_sharedtuplestore(1, 1, 1, 32756, 1), but 32755 is OK.  My goal
here is to support arbitrarily large tuples, not arbitrarily large
per-tuple meta-data, since for my use case I only need 4 bytes (a hash
value).  This could be improved if required by later features
(probably anyone wanting more general meta-data would want variable
sized meta-data anyway, whereas this is fixed, and it would also be
nice if oversized tuples didn't have to start at the beginning of a
new chunk).

I fixed two nearby fencepost bugs: I made the limit that triggers that
error smaller by size(uint32) and fixed a problem when small tuples
appear after an oversize tuple in a final overflow chunk (found by
hacking the test module to create mixtures of different sized tuples).

> +                       if (accessor->sts->meta_data_size > 0)
> +                               memcpy(accessor->write_pointer, meta_data,
> +                                          accessor->sts->meta_data_size);
> +                       written = accessor->write_end - accessor->write_pointer -
> +                               accessor->sts->meta_data_size;
> +                       memcpy(accessor->write_pointer + accessor->sts->meta_data_size,
> +                                  tuple, written);
>
> Also, shouldn't the same Assert() be here as well if you have it above?

No, when it comes to the tuple we just write as much of it as will
fit, and write the rest in the loop below.  Comments improved to make
that clear.

> +                       ereport(ERROR,
> +                                       (errcode_for_file_access(),
> +                                        errmsg("could not read from shared tuplestore temporary file"),
> +                                        errdetail("Short read while reading meta-data")));
>
> The errdetail doesn't follow the style guide (not a sentence ending with
> .), and seems internal-ish. I'm ok with keeping it, but perhaps we
> should change all these to be errdetail_internal()? Seems pointless to
> translate all of them.

Done.

> +               LWLockAcquire(&p->lock, LW_EXCLUSIVE);
> +               eof = p->read_page >= p->npages;
> +               if (!eof)
> +               {
> +                       read_page = p->read_page;
> +                       p->read_page += STS_CHUNK_PAGES;
> +               }
> +               LWLockRelease(&p->lock);
>
> So if we went to the world I'm suggesting, with overflow containing the
> length till the end of the tuple, this'd probably would have to look a
> bit different.

Yeah.  I almost wanted to change it back to a spinlock but now it's
grown bigger again...

> +               if (!eof)
> +               {
> +                       SharedTuplestoreChunk chunk_header;
> +
> +                       /* Make sure we have the file open. */
> +                       if (accessor->read_file == NULL)
> +                       {
> +                               char            name[MAXPGPATH];
> +
> +                               sts_filename(name, accessor, accessor->read_participant);
> +                               accessor->read_file =
> +                                       BufFileOpenShared(accessor->fileset, name);
> +                               if (accessor->read_file == NULL)
> +                                       elog(ERROR, "could not open temporary file %s", name);
>
> Isn't this more an Assert or just not anything? There's now way
> BufFileOpenShared should ever return NULL, no?

Right.  As of commit 923e8dee this can no longer return NULL (instead
it would raise an error), so I removed this redundant check.

> +                       /* If this is an overflow chunk, we skip it. */
> +                       if (chunk_header.overflow)
> +                               continue;
> +
> +                       accessor->read_ntuples = 0;
> +                       accessor->read_ntuples_available = chunk_header.ntuples;
> +                       accessor->read_bytes = offsetof(SharedTuplestoreChunk, data);
>
> Perhaps somewhere around here comment that we'll just loop around and
> call sts_read_tuple() in the next loop iteration?

Done.

-- 
Thomas Munro
http://www.enterprisedb.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: procedures and plpgsql PERFORM
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: [HACKERS] pg_upgrade failed with error - ERROR: column "a" inchild table must be marked NOT NULL