Обсуждение: Sketch of a fix for that truncation data corruption issue

Поиск

Список

Период

Сортировка

Sketch of a fix for that truncation data corruption issue

От

Tom Lane

Дата:

10 декабря 2018 г., 23:38:55

We got another report today [1] that seems to be due to the problem
we've seen before with failed vacuum truncations leaving corrupt state
on-disk [2].  Reflecting on that some more, it seems to me that we're
never going to get to a solution that everybody finds acceptable without
some rather significant restructuring at the buffer-access level.
Since looking for a back-patchable solution has yielded no progress in
eight years, what if we just accept that we will only fix this in HEAD,
and think outside the box about how we could fix it if we're willing
to change internal APIs as much as necessary?

After doodling for awhile with that in mind, it seems like we might be
able to fix it by introducing a new buffer state "truncation pending" that
we'd apply to empty-but-dirty buffers that we don't want to write out.
Aside from fixing the data corruption issue, this sketch has the
significant benefit that we don't need to take AccessExclusiveLock
anymore to do a vacuum truncation: it seems sufficient to lock out
would-be writers of the table.  VACUUM's truncation logic would go
like this:

1. Take ShareLock on table to lock out writers but not readers.
(We might need ShareRowExclusive, not sure.)  As with existing
code that takes AccessExclusiveLock, this is a lock upgrade,
so it could fail but that's OK, we just don't truncate.

2. Scan backwards from relation EOF to determine last page to be deleted
(same as existing logic).

3. Truncate FSM and VM.  (This can be done outside critical section
because it's harmless if we fail later; the lost state can be
reconstructed, and anyway we know all the forgotten-about pages are
empty.)

4. Issue WAL truncation record, and make sure it's flushed.

5. Begin critical section, so that any hard ERROR below becomes PANIC.
(We don't really expect any error, but it's not OK for the vacuum
process to disappear without having undone any truncation-pending marks.)

6. Re-scan buffers from first to last page to be deleted, using a fetch
mode that doesn't pull in pages not already present in buffer pool.
(That way, we don't issue any reads during this phase, reducing the risk
of unwanted ERROR/PANIC.)  As we examine each buffer:
* If not empty of live tuples, panic :-(
* If clean, delete buffer.
* If dirty, mark as truncation pending.
Remember the first and last page numbers that got marked as trunc pending.

7. Issue ftruncate(s), working backwards if the truncation spans multiple
segment files.  Don't error out on syscall failure, just stop truncating
and note boundary of successful truncation.

8. Re-scan buffers from first to last trunc pending page, again skipping
anything not found in buffer pool.  Anything found should be trunc
pending, or possibly empty if a reader pulled it in concurrently.
Either delete if it's above what we successfully truncated to, or drop
the trunc-pending bit (reverting it to dirty state) if not.

9. If actual truncation boundary was different from plan, issue another
WAL record saying "oh, we only managed to truncate to here, not there".

10. End critical section.

11. Release table ShareLock.


Now, what does the "truncation pending" buffer flag do exactly?

* The buffer manager is not allowed to drop such a page from the pool,
nor to attempt to write it out; it's just in limbo till vacuum's
critical section completes.

* Readers: assume page is empty, move on.  (A seqscan could actually
stop, knowing that all later pages must be empty too.)  Or, since
the page must be empty of live tuples, readers could just process it
normally, but I'd rather they didn't.

* Writers: error, should not be able to see this state.

* Background writer: ignore and move on (must *not* try to write
dirty page, since truncate might've already happened).

* Checkpointer: I think checkpointer has to wait for the flag to go
away :-(.  We can't mark a checkpoint complete if there are dirty
trunc-pending pages.  This aspect could use more thought, perhaps.

WAL replay looks like this:

* Normal truncate record: just truncate heap, FSM, and VM to where
it says, and discard buffers above that.

* "Only managed to truncate to here" record: write out empty heap
pages to fill the space from original truncation target to actual.
This restores the on-disk situation to be equivalent to what it
was in master, assuming all the dirty pages eventually got written.

It's slightly annoying to write the second truncation record inside
the critical section, but I think we need to because if we don't,
there's a window where we don't write the second record at all and
so the post-replay situation is different from what the master had
on disk.  Note that if we do crash in the critical section, the
post-replay situation will be that we truncate to the original target,
which seems fine since nothing else could've affected the table state.
(Maybe we should issue the first WAL record inside the critical
section too? Not sure.)

One issue with not holding AEL is that there are race conditions
wherein readers might attempt to fetch pages beyond the file EOF
(for example, a seqscan that started before the truncation began
would attempt to read up to the old EOF, or a very slow indexscan
might try to follow a pointer from a since-deleted index entry).
So we would have to change things to regard that as a non-error
condition.  That might be fine, but it does seem like it's giving up
some error detection capability.  If anyone's sufficiently worried
about that, we could keep the lock level at AEL; but personally
I think reducing the lock level is worth enough to be willing to make
that compromise.

Another area that's possibly worthy of concern is whether a reader
could attempt to re-set bits in the FSM or VM for to-be-deleted pages
after the truncation of those files.  We might need some interlock to
prevent that.  Or, perhaps, just re-truncate them after the main
truncation?  Or maybe it doesn't matter if there's bogus data in
the maps.

Also, I'm not entirely sure whether there's anything in our various
replication logic that's dependent on vacuum truncation taking AEL.
Offhand I'd expect the reduced use of AEL to be a plus, but maybe
I'm missing something.

Thoughts?  Are there obvious holes in this plan?

            regards, tom lane

[1] https://www.postgresql.org/message-id/28278.1544463193@sss.pgh.pa.us
[2] https://www.postgresql.org/message-id/flat/5BBC590AE8DF4ED1A170E4D48F1B53AC@tunaPC

Re: Sketch of a fix for that truncation data corruption issue

От

Robert Haas

Дата:

11 декабря 2018 г., 08:28:11

On Tue, Dec 11, 2018 at 5:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> We got another report today [1] that seems to be due to the problem
> we've seen before with failed vacuum truncations leaving corrupt state
> on-disk [2].  Reflecting on that some more, it seems to me that we're
> never going to get to a solution that everybody finds acceptable without
> some rather significant restructuring at the buffer-access level.
> Since looking for a back-patchable solution has yielded no progress in
> eight years, what if we just accept that we will only fix this in HEAD,
> and think outside the box about how we could fix it if we're willing
> to change internal APIs as much as necessary?

+1.

> 9. If actual truncation boundary was different from plan, issue another
> WAL record saying "oh, we only managed to truncate to here, not there".

I don't entirely understand how this fix addresses the problems in
this area, but this step sounds particularly scary.  Nothing
guarantees that the second WAL record ever gets replayed.

> * "Only managed to truncate to here" record: write out empty heap
> pages to fill the space from original truncation target to actual.
> This restores the on-disk situation to be equivalent to what it
> was in master, assuming all the dirty pages eventually got written.

This is equivalent only in a fairly loose sense, right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Sketch of a fix for that truncation data corruption issue

От

Tom Lane

Дата:

11 декабря 2018 г., 09:06:08

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Dec 11, 2018 at 5:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 9. If actual truncation boundary was different from plan, issue another
>> WAL record saying "oh, we only managed to truncate to here, not there".

> I don't entirely understand how this fix addresses the problems in
> this area,

Well, the point is to not fail if an ftruncate() call fails.  The hard
part, of course, is to adequately maintain/restore consistency when
that happens.

> ... but this step sounds particularly scary.  Nothing
> guarantees that the second WAL record ever gets replayed.

I'm not following?  How would a slave not replay that record, other
than by diverging to a new timeline?  (in which case it's okay
if it doesn't have exactly the master's state)

>> * "Only managed to truncate to here" record: write out empty heap
>> pages to fill the space from original truncation target to actual.
>> This restores the on-disk situation to be equivalent to what it
>> was in master, assuming all the dirty pages eventually got written.

> This is equivalent only in a fairly loose sense, right?

Right, specifically in the sense that logically empty pages (containing no
live tuples) get replaced by physically empty pages.  We sort of do that
now when we truncate: the truncated-away pages may not be physically
empty, but whenever we next extend the relation, we'll materialize a new
physically empty page where that page had been.

There are at least two variants of the idea that seem worth studying:
one is to fill the not-successfully-truncated space with zeroes not
valid empty pages, and the other is to not re-extend the relation at all,
but just proceed as though the original truncation had succeeded fully.
My concern about the latter is mostly that a slave following the WAL
stream might see commands to write pages that are not contiguous with
what it thinks the file EOF is, and that could lead to either bogus errors
or weird situations with "holes" in files.  Maybe we could make that
work, though.  The fill-with-zeroes idea is sort of a compromise in
between the other two, and could be better or worse depending on code
details that I've not really looked into yet.  But it'd make this
situation look much like the case where we crash between smgrextend'ing
a rel and writing a valid page into the space, which works AFAIK.

Anyway, if your assumption is that WAL replay must yield bit-for-bit
the same state of the not-truncated pages that the master would have,
then I doubt we can make this work.  In that case we're back to the
type of solution you rejected eight years ago, where we have to write
out pages before truncating them away.

            regards, tom lane

Re: Sketch of a fix for that truncation data corruption issue

От

Laurenz Albe

Дата:

11 декабря 2018 г., 09:09:34

Tom Lane wrote:
> We got another report today [1] that seems to be due to the problem
> we've seen before with failed vacuum truncations leaving corrupt state
> on-disk [2].  Reflecting on that some more, [...]

This may seem heretical, but I'll say it anyway.

Why don't we do away with vacuum truncation for good?
Is that a feature that does anybody any good?
To me it has always seemed to be more a wart than a feature, like
someone just thought it was low hanging fruit without considering
all the implications.

VACUUM doesn't reclaim space, VACUUM (FULL) does.  That's the way it
(mostly) is, so why complicate matters unnecessarily?

Yours,
Laurenz Albe

Re: Sketch of a fix for that truncation data corruption issue

От

Andres Freund

Дата:

11 декабря 2018 г., 09:15:51

Hi,

On 2018-12-11 07:09:34 +0100, Laurenz Albe wrote:
> Tom Lane wrote:
> > We got another report today [1] that seems to be due to the problem
> > we've seen before with failed vacuum truncations leaving corrupt state
> > on-disk [2].  Reflecting on that some more, [...]
> 
> This may seem heretical, but I'll say it anyway.
> 
> Why don't we do away with vacuum truncation for good?
> Is that a feature that does anybody any good?
> To me it has always seemed to be more a wart than a feature, like
> someone just thought it was low hanging fruit without considering
> all the implications.

There's a lot of workloads that I've seen that'd regress. And probably a
lot more that we don't know about.  I don't see how we could go there.

Greetings,

Andres Freund

Re: Sketch of a fix for that truncation data corruption issue

От

Robert Haas

Дата:

11 декабря 2018 г., 23:38:46

On Tue, Dec 11, 2018 at 3:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > ... but this step sounds particularly scary.  Nothing
> > guarantees that the second WAL record ever gets replayed.
>
> I'm not following?  How would a slave not replay that record, other
> than by diverging to a new timeline?  (in which case it's okay
> if it doesn't have exactly the master's state)

If it's following the master, it will.  But replication can be paused
indefinitely, or a slave can be promoted to be a master.

> Anyway, if your assumption is that WAL replay must yield bit-for-bit
> the same state of the not-truncated pages that the master would have,
> then I doubt we can make this work.  In that case we're back to the
> type of solution you rejected eight years ago, where we have to write
> out pages before truncating them away.

How much have you considered the possibility that my rejection of that
approach was a stupid and wrong-headed idea?  I'm not sure I still
believe that not writing those buffers would have a meaningful
performance cost.  Truncating relations isn't that common of an
operation, and also, we could mitigate the impacts by having the scan
that identifies the truncation point also write any dirty buffers
after that point.  We'd have to recheck after upgrading our relation
lock, but odds are good that in the normal case we wouldn't add much
to the time when we hold the stronger lock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Sketch of a fix for that truncation data corruption issue

От

Tom Lane

Дата:

12 декабря 2018 г., 00:08:02

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Dec 11, 2018 at 3:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Anyway, if your assumption is that WAL replay must yield bit-for-bit
>> the same state of the not-truncated pages that the master would have,
>> then I doubt we can make this work.  In that case we're back to the
>> type of solution you rejected eight years ago, where we have to write
>> out pages before truncating them away.

> How much have you considered the possibility that my rejection of that
> approach was a stupid and wrong-headed idea?  I'm not sure I still
> believe that not writing those buffers would have a meaningful
> performance cost.

Well, if *you're* willing to entertain that possiblity, I'm on board.
That would certainly lead to a much simpler, and probably back-patchable,
fix.

> Truncating relations isn't that common of an
> operation, and also, we could mitigate the impacts by having the scan
> that identifies the truncation point also write any dirty buffers
> after that point.  We'd have to recheck after upgrading our relation
> lock, but odds are good that in the normal case we wouldn't add much
> to the time when we hold the stronger lock.

Hm, not quite following this?  We have to lock out writers before we
try to identify the truncation point.

            regards, tom lane

Re: Sketch of a fix for that truncation data corruption issue

От

Peter Geoghegan

Дата:

12 декабря 2018 г., 00:10:34

On Tue, Dec 11, 2018 at 12:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
> How much have you considered the possibility that my rejection of that
> approach was a stupid and wrong-headed idea?  I'm not sure I still
> believe that not writing those buffers would have a meaningful
> performance cost.  Truncating relations isn't that common of an
> operation, and also, we could mitigate the impacts by having the scan
> that identifies the truncation point also write any dirty buffers
> after that point.

I too suspect that it would be okay to regress truncation. Certainly,
there are workloads that totally depend on truncation for reasonable
performance, but even that doesn't necessarily imply that it consumes
too many cycles. I'm okay with imposing costs on a minority workload
provided the benefit is there, and the penalty isn't that noticeable
in realistic scenarios, to real users.

-- 
Peter Geoghegan

Re: Sketch of a fix for that truncation data corruption issue

От

Andres Freund

Дата:

12 декабря 2018 г., 00:33:41

Hi,

On 2018-12-10 15:38:55 -0500, Tom Lane wrote:
> Also, I'm not entirely sure whether there's anything in our various
> replication logic that's dependent on vacuum truncation taking AEL.
> Offhand I'd expect the reduced use of AEL to be a plus, but maybe
> I'm missing something.

It'd be a *MAJOR* plus.  One of the biggest operational headaches for
using a HS node for querying is that there'll often be conflicts due to
vacuum truncating relations (which logs an AEL), even if
hot_standby_feedback is used.  There's been multiple proposals to
allow disabling truncations just because of that.

Greetings,

Andres Freund

Re: Sketch of a fix for that truncation data corruption issue

От

Andres Freund

Дата:

12 декабря 2018 г., 02:51:37

Hi,

On 2018-12-10 15:38:55 -0500, Tom Lane wrote:
> Reflecting on that some more, it seems to me that we're never going to
> get to a solution that everybody finds acceptable without some rather
> significant restructuring at the buffer-access level.

I'm thinking about your proposal RN.  Here's what I had, previously, in
mind for fixing this, on a somewhat higher level than your
proposal. More with an angle towards getting rid of the AEL, both to
allow the truncation to happen with concurrent writers, and to avoid the
HS cancellation issues, than aiming for robustness.

I'm listing this mostly as fodder-for-thought, as I didn't aim for
robustness, it doesn't quite achieve the same guarantees you seem to be
angling for. But if we potentially are going for a HEAD only solution, I
think it's reasonable to see if we can combine ideas.

For truncation:

1) Conditionally take extension lock

2) Determine current size

3) From the back of the relation, one-by-one, probe buffer manager for
   each page.  If the page is *not* in the buffer manager, install
   buffer descriptor that's marked invalid, lock exclusively.  If *in*
   buffers, try to conditionally acquire a cleanup lock, goto 4) if not
   available.

   If more than ~128 pages afterwards, also goto 4).

4) If 3) did not lock any pages, give up.

5) Truncate FSM/VM.

6) In a critical section: WAL log truncation for all the locked pages,
   truncate pages, and drop buffers + locks.  After this every attempt
   to read in such a page would fail.

This obviously has the fairly significant drawback that truncations can
only happen in relatively small increments, about a megabyte (limited by
the number of concurrently held lwlocks). But because no locks are
required that block writes, that's much less of an issue than
previously.  I assume we'd still want an additional heuristic that
doesn't even start trying to do truncation if there's not more trailing
space - but we could just do that without a lock and recheck afterwards.

A second drawback is that during the truncation other processes would
need to wait, uninterruptibly in an lwlock no less, till the truncation
is finished, if they try to read one of the empty blocks.

One issue is that:
> One issue with not holding AEL is that there are race conditions
> wherein readers might attempt to fetch pages beyond the file EOF
> (for example, a seqscan that started before the truncation began
> would attempt to read up to the old EOF, or a very slow indexscan
> might try to follow a pointer from a since-deleted index entry).
> So we would have to change things to regard that as a non-error
> condition.  That might be fine, but it does seem like it's giving up
> some error detection capability.  If anyone's sufficiently worried
> about that, we could keep the lock level at AEL; but personally
> I think reducing the lock level is worth enough to be willing to make
> that compromise.

becomes a bit more of a prominent problem, because it's not just the
read access that now needs to return a non-fatal error when a
page-after-EOF is being read, but that doesn't strike me as a large
additional problem.

> After doodling for awhile with that in mind, it seems like we might be
> able to fix it by introducing a new buffer state "truncation pending" that
> we'd apply to empty-but-dirty buffers that we don't want to write out.
> Aside from fixing the data corruption issue, this sketch has the
> significant benefit that we don't need to take AccessExclusiveLock
> anymore to do a vacuum truncation: it seems sufficient to lock out
> would-be writers of the table.  VACUUM's truncation logic would go
> like this:

It's possible that combining your "truncation pending" flag with my
sketch above would allow to avoid the "waiting for lwlock" issue.

> 7. Issue ftruncate(s), working backwards if the truncation spans multiple
> segment files.  Don't error out on syscall failure, just stop truncating
> and note boundary of successful truncation.

ISTM on the back-branches the least we should do is to make the
vacuum truncations happen in a critical sections. That sucks, but it's
sure better than weird corruption due to on-disk state diverging from
the in-memory state.

> Another area that's possibly worthy of concern is whether a reader
> could attempt to re-set bits in the FSM or VM for to-be-deleted pages
> after the truncation of those files.  We might need some interlock to
> prevent that.  Or, perhaps, just re-truncate them after the main
> truncation?  Or maybe it doesn't matter if there's bogus data in
> the maps.

It'd might matter for the VM - if that says "all frozen", we'd be in
trouble later.

Greetings,

Andres Freund

Re: Sketch of a fix for that truncation data corruption issue

От

Robert Haas

Дата:

12 декабря 2018 г., 04:49:59

On Wed, Dec 12, 2018 at 6:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, if *you're* willing to entertain that possiblity, I'm on board.
> That would certainly lead to a much simpler, and probably back-patchable,
> fix.

I think we should, then. Simple is good.

Just thinking about this a bit, the problem with truncating first and
then writing the WAL record is that if the WAL record never makes it
to disk, any physical standbys will end up out of sync with the
master, leading to disaster. But the problem with writing the WAL
record first is that the actual operation might fail, and then
standbys will end up out of sync with the master, leading to disaster.
The obvious way to finesse that latter problem is just PANIC if
ftruncate() fails -- then we'll crash restart and retry, and if we
still can't do it, well, the DBA will have to fix that before the
system can come on line.  I'm not sure that's really all that bad --
if we can't truncate, we're kinda hosed.  How, other than a
permissions problem, does that even happen?

Your sketch upthread tries to fix it another way -- write a second
record that says essentially "never mind".  But that leads to the
master and the standby not really being in quite equivalent states.
I'm not sure whether that's really OK. If any future operation on the
master depends on some aspects of the page state that wasn't recreated
exactly on the standby, then replay will run into trouble.

I wonder if we could get away with defining a truncation event as
setting all pages beyond the truncation point to all-zeroes, with the
number of those pages that actually exist at the filesystem level as
an accidental detail.  So if the master can't ftruncate(), it's also
OK if it just zeroes all the buffers beyond that point.  But once it
emits the WAL record, it must do one or the other, or else PANIC.  The
standby has the same options.

> > Truncating relations isn't that common of an
> > operation, and also, we could mitigate the impacts by having the scan
> > that identifies the truncation point also write any dirty buffers
> > after that point.  We'd have to recheck after upgrading our relation
> > lock, but odds are good that in the normal case we wouldn't add much
> > to the time when we hold the stronger lock.
>
> Hm, not quite following this?  We have to lock out writers before we
> try to identify the truncation point.

I thought we made a tentative identification of the truncation point,
upgrade the lock, and then rechecked.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Sketch of a fix for that truncation data corruption issue

От

Andres Freund

Дата:

12 декабря 2018 г., 04:54:15

Hi,

On 2018-12-12 10:49:59 +0900, Robert Haas wrote:
> Just thinking about this a bit, the problem with truncating first and
> then writing the WAL record is that if the WAL record never makes it
> to disk, any physical standbys will end up out of sync with the
> master, leading to disaster. But the problem with writing the WAL
> record first is that the actual operation might fail, and then
> standbys will end up out of sync with the master, leading to disaster.
> The obvious way to finesse that latter problem is just PANIC if
> ftruncate() fails -- then we'll crash restart and retry, and if we
> still can't do it, well, the DBA will have to fix that before the
> system can come on line.  I'm not sure that's really all that bad --
> if we can't truncate, we're kinda hosed.  How, other than a
> permissions problem, does that even happen?

I think it's correct to panic in that situation. As you say it's really
unlikely for that to happen in normal circumstances (as long as we
handle obvious stuff like EINTR) - and added complexity to avoid it
seems very unlikely to be tested.

Greetings,

Andres Freund

Re: Sketch of a fix for that truncation data corruption issue

От

Stephen Frost

Дата:

05 января 2019 г., 22:01:27

Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2018-12-10 15:38:55 -0500, Tom Lane wrote:
> > Also, I'm not entirely sure whether there's anything in our various
> > replication logic that's dependent on vacuum truncation taking AEL.
> > Offhand I'd expect the reduced use of AEL to be a plus, but maybe
> > I'm missing something.
>
> It'd be a *MAJOR* plus.  One of the biggest operational headaches for
> using a HS node for querying is that there'll often be conflicts due to
> vacuum truncating relations (which logs an AEL), even if
> hot_standby_feedback is used.  There's been multiple proposals to
> allow disabling truncations just because of that.

Huge +1 from me here, we've seen this too.  Getting rid of the conflict
when using a HS node for querying would be fantastic.

Thanks!

Stephen

Вложения

signature.asc

Re: Sketch of a fix for that truncation data corruption issue

От

Sergei Kornilov

Дата:

05 декабря 2019 г., 16:39:43

Hello

>>  > Also, I'm not entirely sure whether there's anything in our various
>>  > replication logic that's dependent on vacuum truncation taking AEL.
>>  > Offhand I'd expect the reduced use of AEL to be a plus, but maybe
>>  > I'm missing something.
>>
>>  It'd be a *MAJOR* plus. One of the biggest operational headaches for
>>  using a HS node for querying is that there'll often be conflicts due to
>>  vacuum truncating relations (which logs an AEL), even if
>>  hot_standby_feedback is used. There's been multiple proposals to
>>  allow disabling truncations just because of that.
>
> Huge +1 from me here, we've seen this too. Getting rid of the conflict
> when using a HS node for querying would be fantastic.

One small ping... This topic has been inactive for a long time. But that would be a great improvement for any future
release.I observe such problems from time to time... (so far, we have at least a workaround with the vacuum_trunk
option)

regards, Sergei

Re: Sketch of a fix for that truncation data corruption issue

От

Alvaro Herrera

Дата:

03 апреля 2023 г., 13:34:28

On 2018-Dec-11, Tom Lane wrote:

> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Dec 11, 2018 at 3:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Anyway, if your assumption is that WAL replay must yield bit-for-bit
> >> the same state of the not-truncated pages that the master would have,
> >> then I doubt we can make this work.  In that case we're back to the
> >> type of solution you rejected eight years ago, where we have to write
> >> out pages before truncating them away.
> 
> > How much have you considered the possibility that my rejection of that
> > approach was a stupid and wrong-headed idea?  I'm not sure I still
> > believe that not writing those buffers would have a meaningful
> > performance cost.
> 
> Well, if *you're* willing to entertain that possiblity, I'm on board.
> That would certainly lead to a much simpler, and probably back-patchable,
> fix.

Hello,

Has this problem been fixed?  I was under the impression that it had
been, but I spent some 20 minutes now looking for code, commits, or
patches in the archives, and I can't find anything relevant.  Maybe it
was fixed in some different way that's not so obviously connected?

Thanks,

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Sketch of a fix for that truncation data corruption issue

От

Tom Lane

Дата:

03 апреля 2023 г., 15:40:41

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Has this problem been fixed?  I was under the impression that it had
> been, but I spent some 20 minutes now looking for code, commits, or
> patches in the archives, and I can't find anything relevant.  Maybe it
> was fixed in some different way that's not so obviously connected?

As far as I can see from a quick look at the code, nothing has been
done that would alleviate this problem: smgrtruncate still calls
DropRelationBuffers before truncating.

Have you run into a new case of it?  I don't recall having seen
many field complaints about this since 2018.

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Sketch of a fix for that truncation data corruption issue

Вложения