Re: WAL logging problem in 9.4.3?

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: WAL logging problem in 9.4.3?
Дата
Msg-id 55AFC302.1060805@iki.fi
обсуждение исходный текст
Ответ на Re: WAL logging problem in 9.4.3?  (Simon Riggs <simon@2ndQuadrant.com>)
Ответы Re: WAL logging problem in 9.4.3?  (Simon Riggs <simon@2ndQuadrant.com>)
Re: WAL logging problem in 9.4.3?  (Robert Haas <robertmhaas@gmail.com>)
Re: WAL logging problem in 9.4.3?  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Список pgsql-hackers
On 07/22/2015 11:18 AM, Simon Riggs wrote:
> On 10 July 2015 at 00:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> Andres Freund <andres@anarazel.de> writes:
>>> On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
>>>> Rather than reverting cab9a0656c36739f, which would re-introduce a
>>>> different performance problem, perhaps we could have COPY create a new
>>>> relfilenode when it does this.  That should be safe if the table was
>>>> previously empty.
>>
>>> I'm not convinced that cab9a0656c36739f needs to survive in that
>>> form. To me only allowing one COPY to benefit from the wal_level =
>>> minimal optimization has a significantly higher cost than
>>> cab9a0656c36739f.
>>
>> What evidence have you got to base that value judgement on?
>>
>> cab9a0656c36739f was based on an actual user complaint, so we have good
>> evidence that there are people out there who care about the cost of
>> truncating a table many times in one transaction.  On the other hand,
>> I know of no evidence that anyone's depending on multiple sequential
>> COPYs, nor intermixed COPY and INSERT, to be fast.  The original argument
>> for having this COPY optimization at all was to make restoring pg_dump
>> scripts in a single transaction fast; and that use-case doesn't care
>> about anything but a single COPY into a virgin table.
>>
>
> We have to backpatch this fix, so it must be both simple and effective.
>
> Heikki's suggestions may be best, maybe not, but they don't seem
> backpatchable.
>
> Tom's suggestion looks good. So does Andres' suggestion. I have coded both.

Thanks. For comparison, I wrote a patch to implement what I had in mind.

When a WAL-skipping COPY begins, we add an entry for that relation in a
"pending-fsyncs" hash table. Whenever we perform any action on a heap
that would normally be WAL-logged, we check if the relation is in the
hash table, and skip WAL-logging if so.

That was a simplified explanation. In reality, when WAL-skipping COPY
begins, we also memorize the current size of the relation. Any actions
on blocks greater than the old size are not WAL-logged, and any actions
on smaller-numbered blocks are. This ensures that if you did any INSERTs
on the table before the COPY, any new actions on the blocks that were
already WAL-logged by the INSERT are also WAL-logged. And likewise if
you perform any INSERTs after (or during, by trigger) the COPY, and they
modify the new pages, those actions are not WAL-logged. So starting a
WAL-skipping COPY splits the relation into two parts: the first part
that is WAL-logged as usual, and the later part that is not WAL-logged.
(there is one loose end marked with XXX in the patch on this, when one
of the pages involved in a cold UPDATE is before the watermark and the
other is after)

The actual fsync() has been moved to the end of transaction, as we are
now skipping WAL-logging of any actions after the COPY as well.

And truncations complicate things further. If we emit a truncation WAL
record in the transaction, we also make an entry in the hash table to
record that. All operations on a relation that has been truncated must
be WAL-logged as usual, because replaying the truncate record will
destroy all data even if we fsync later. But we still optimize for
"BEGIN; CREATE; COPY; TRUNCATE; COPY;" style patterns, because if we
truncate a relation that has already been marked for fsync-at-COMMIT, we
don't need to WAL-log the truncation either.


This is more invasive than I'd like to backpatch, but I think it's the
simplest approach that works, and doesn't disable any of the important
optimizations we have.

>> And what reason is there to think that this would fix all the problems?
>
> I don't think either suggested fix could be claimed to be a great solution,
> since there is little principle here, only heuristic. Heikki's solution
> would be the only safe way, but is not backpatchable.

I can't get too excited about a half-fix that leaves you with data
corruption in some scenarios.

I wrote a little test script to test all these different scenarios
(attached). Both of your patches fail with the script.

- Heikki


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jeff Janes
Дата:
Сообщение: Re: optimizing vacuum truncation scans
Следующее
От: Kevin Grittner
Дата:
Сообщение: Re: brin index vacuum versus transaction snapshots