Re: [HACKERS] Subtle bug in "Simplify tape block format" commit

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
Дата
Msg-id 2044d466-beb9-b2ef-aec8-408a8ca4c5d0@iki.fi
обсуждение исходный текст
Ответ на [HACKERS] Subtle bug in "Simplify tape block format" commit  (Peter Geoghegan <pg@heroku.com>)
Ответы Re: [HACKERS] Subtle bug in "Simplify tape block format" commit  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
On 01/07/2017 12:33 AM, Peter Geoghegan wrote:
> Previously, all calls to ltsGetFreeBlock() were immediately followed
> by a corresponding call to ltsWriteBlock(); we wrote out the
> newly-allocated-in-first-pass block there and then. It's a good idea
> for a corresponding ltsWriteBlock() call to happen immediately, and
> it's *essential* for it to happen before any later block is written
> during the first write pass (when tuples are initially dumped out to
> form runs), since, as noted above ltsWriteBlock():
>
> /*
>  * Write a block-sized buffer to the specified block of the underlying file.
>  *
>  * NB: should not attempt to write beyond current end of file (ie, create
>  * "holes" in file), since BufFile doesn't allow that.  The first write pass
>  * must write blocks sequentially.
> ...

I completely missed that rule :-(.

> However, a "write beyond current end of file" now seems to actually be
> attempted at times, resulting in an arcane error during sorting. This
> is more or less because LogicalTapeWrite() doesn't heed the warning
> from ltsGetFreeBlock(), as seen here:
>
>     while (size > 0)
>     {
>         if (lt->pos >= TapeBlockPayloadSize)
>         {
>             ...
>
>             /*
>              * First allocate the next block, so that we can store it in the
>              * 'next' pointer of this block.
>              */
>             nextBlockNumber = ltsGetFreeBlock(lts);
>
>             /* set the next-pointer and dump the current block. */
>             TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;
>             ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
>             ...
>         }
>         ...
>
>     }
>
> Note that LogicalTapeWrite() now allocates each block (calls
> ltsGetFreeBlock()) one block in advance of current block, immediately
> before *current* block is written (not the next block/block just
> allocated). So, the corresponding ltsWriteBlock() call *must* happen
> at an unspecified time in the future, typically the next time control
> ends up at exactly the same point, where the block that was "next"
> becomes "current".
>
> I'm not about to argue that we should go back to following this
> ltsGetFreeBlock() rule, though; I can see why Heikki refactored
> LogicalTapeWrite() to allocate blocks a block in advance. Still, we
> need to be more careful in avoiding the underlying problem that the
> ltsGetFreeBlock() rule was intended to prevent. Attached patch 0001-*
> has logtape.c be sure to write out a tape's buffer every time
> tuplesort ends a run.

Hmm. The LogicalTapeEndWriting() function is very similar to the 
LogicalTapePause() function I had in the pause/resume patch. They both 
flush the last block to disk. The difference is that LogicalTapePause() 
also free'd the buffer, and read back the last block from the disk if 
you continued writing, while LogicalTapeEndWriting() keeps a copy of it 
in memory.

With the proposed fix (or with the pause/resume), you can only write to 
a single tape at a time. Not a problem at the moment, but something to 
consider. At least, would need more comments to make that more clear, 
and an Assert would be nice.

Alternatively, we could fix this with a small change in ltsWriteBlock(), 
see attached patch. When we're about to create a hole in the file, write 
all-zero blocks to avoid the creating hole, before the block itself. 
That's not quite as efficient as writing the actual block contents into 
the hole, which avoids having to write it later, but probably won't make 
any measurable difference in practice. I'm inclined to do that, because 
it relaxes the rules on what you're allowed to do, in what order, which 
makes this more robust in general. We coul *also* have something like 
LogicalTapeEndWriting() or LogicalTapePause(), for efficiency, but it 
doesn't seem that important.

> I have a test case.
 > ...

Thanks for the analysis!

- Heikki


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: tushar
Дата:
Сообщение: Re: [HACKERS] Parallel Index-only scan
Следующее
От: Ashutosh Bapat
Дата:
Сообщение: Re: [HACKERS] Push down more full joins in postgres_fdw