Обсуждение: Is pg_control file crashsafe?

Поиск
Список
Период
Сортировка

Is pg_control file crashsafe?

От
Alex Ignatov
Дата:
Hello everyone!
We have some issue with truncated pg_control file on Windows after power failure.
My questions is : 
1) Is pg_control protected from say , power crash or partial write? 
2) How PG update pg_control? By writing in it or writing in some temp file and after that rename it to pg_control to be atomic?
3) Can PG have  multiple pg_control copy to be more fault tolerant?

PS During some experiments we found that at present time there is no any method to do crash recovery with "restored" version of pg_control (based on some manipulations with pg_resetxlog ).
 Only by using pg_resetxlog and setting it parameters to values taken from wal file (pg_xlogdump)we can at least start PG and saw that PG state is at the moment of last check point. But we have no real confidence that PG is in consistent state(also docs on pg_resetxlogs told us about it too)


Alex Ignatov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: Is pg_control file crashsafe?

От
Bruce Momjian
Дата:
On Thu, Apr 28, 2016 at 09:58:00PM +0000, Alex Ignatov wrote:
> Hello everyone!
> We have some issue with truncated pg_control file on Windows after power
> failure.
> My questions is : 
> 1) Is pg_control protected from say , power crash or partial write? 
> 2) How PG update pg_control? By writing in it or writing in some temp file and
> after that rename it to pg_control to be atomic?

We write pg_controldata in one write() OS call:
   if (write(fd, buffer, PG_CONTROL_SIZE) != PG_CONTROL_SIZE)

> 3) Can PG have  multiple pg_control copy to be more fault tolerant?
> 
> PS During some experiments we found that at present time there is no any method
> to do crash recovery with "restored" version of pg_control (based on some
> manipulations with pg_resetxlog ).
>  Only by using pg_resetxlog and setting it parameters to values taken from wal
> file (pg_xlogdump)we can at least start PG and saw that PG state is at the
> moment of last check point. But we have no real confidence that PG is in
> consistent state(also docs on pg_resetxlogs told us about it too)

We have talked about improving the reliability of pg_control, but
failures are so rare we have never done anything to improve it.  I know
Tatsuo has talked about making pg_control more reliable, so I am CC'ing
him.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Is pg_control file crashsafe?

От
Alex Ignatov
Дата:

On 01.05.2016 0:55, Bruce Momjian wrote:
> On Thu, Apr 28, 2016 at 09:58:00PM +0000, Alex Ignatov wrote:
>> Hello everyone!
>> We have some issue with truncated pg_control file on Windows after power
>> failure.
>> My questions is :
>> 1) Is pg_control protected from say , power crash or partial write?
>> 2) How PG update pg_control? By writing in it or writing in some temp file and
>> after that rename it to pg_control to be atomic?
> We write pg_controldata in one write() OS call:
>
>      if (write(fd, buffer, PG_CONTROL_SIZE) != PG_CONTROL_SIZE)
>
>> 3) Can PG have  multiple pg_control copy to be more fault tolerant?
>>
>> PS During some experiments we found that at present time there is no any method
>> to do crash recovery with "restored" version of pg_control (based on some
>> manipulations with pg_resetxlog ).
>>   Only by using pg_resetxlog and setting it parameters to values taken from wal
>> file (pg_xlogdump)we can at least start PG and saw that PG state is at the
>> moment of last check point. But we have no real confidence that PG is in
>> consistent state(also docs on pg_resetxlogs told us about it too)
> We have talked about improving the reliability of pg_control, but
> failures are so rare we have never done anything to improve it.  I know
> Tatsuo has talked about making pg_control more reliable, so I am CC'ing
> him.
>
Oh! Good. Thank you!
It is rare but as we saw now it is our reality too. One of our customers 
had this issue on previous week =)

I think that rename can help a little bit. At least on some FS it is 
atomic operation.

-- 
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




Re: Is pg_control file crashsafe?

От
Tom Lane
Дата:
Alex Ignatov <a.ignatov@postgrespro.ru> writes:
> I think that rename can help a little bit. At least on some FS it is 
> atomic operation.

Writing a single sector ought to be atomic too.  I'm very skeptical that
it'll be an improvement to just move the risk from one filesystem
operation to another; especially not to one where there's not even a
terribly portable way to request fsync.
        regards, tom lane



Re: Is pg_control file crashsafe?

От
Andres Freund
Дата:
Hi,

On 2016-04-28 21:58:00 +0000, Alex Ignatov wrote:
> We have some issue with truncated pg_control file on Windows after
> power failure.My questions is : 1) Is pg_control protected from say ,
> power crash or partial write?

It should be. I think to make progress on this thread we're going to
need a bit more details about the exact corruption. Was the length of
the file change? Did the checksum fail? Did you just observe too old
contents?

Greetings,

Andres Freund



Re: Is pg_control file crashsafe?

От
Alex Ignatov
Дата:

Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


On 03.05.2016 2:21, Andres Freund wrote:
> Hi,
>
> On 2016-04-28 21:58:00 +0000, Alex Ignatov wrote:
>> We have some issue with truncated pg_control file on Windows after
>> power failure.My questions is : 1) Is pg_control protected from say ,
>> power crash or partial write?
>
> It should be. I think to make progress on this thread we're going to
> need a bit more details about the exact corruption. Was the length of
> the file change? Did the checksum fail? Did you just observe too old
> contents?
>
> Greetings,
>
> Andres Freund
>
>

Length was 0 bytes after crash. It was Windows and ntfs + ssd in raid 1. 
File zeroed after power loss.

Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Is pg_control file crashsafe?

От
Alex Ignatov
Дата:

On 03.05.2016 2:17, Tom Lane wrote:
> Alex Ignatov <a.ignatov@postgrespro.ru> writes:
>> I think that rename can help a little bit. At least on some FS it is
>> atomic operation.
>
> Writing a single sector ought to be atomic too.  I'm very skeptical that
> it'll be an improvement to just move the risk from one filesystem
> operation to another; especially not to one where there's not even a
> terribly portable way to request fsync.
>
>             regards, tom lane
>
>
pg_control is 8k long(i think it is legth of one page in default PG 
compile settings).
I also think that 8k recording can be atomic. Even if recording of one 
sector is atomic nobody can say about what sector from 8k record of 
pg_control  should be written first. It can be last sector or say sector 
number 10 from 16. That why i mentioned renaming from tmp file to 
pg_control. Renaming in FS usually is atomic operation. And after power 
loss we have either old version of pg_control or new version of it. But 
not torn pg_control file.


Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Is pg_control file crashsafe?

От
Amit Kapila
Дата:
On Wed, May 4, 2016 at 4:02 PM, Alex Ignatov <a.ignatov@postgrespro.ru> wrote:


On 03.05.2016 2:17, Tom Lane wrote:
Alex Ignatov <a.ignatov@postgrespro.ru> writes:
I think that rename can help a little bit. At least on some FS it is
atomic operation.

Writing a single sector ought to be atomic too.  I'm very skeptical that
it'll be an improvement to just move the risk from one filesystem
operation to another; especially not to one where there's not even a
terribly portable way to request fsync.

                        regards, tom lane


pg_control is 8k long(i think it is legth of one page in default PG compile settings).
I also think that 8k recording can be atomic. Even if recording of one sector is atomic nobody can say about what sector from 8k record of pg_control  should be written first. It can be last sector or say sector number 10 from 16.

The actual data written is always sizeof(ControlFileData) which should be less than one sector.  I think it is only possible that we get a torn write for pg_control, if while writing + fsyncing, the filesystem maps that data to different sectors.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Is pg_control file crashsafe?

От
Tom Lane
Дата:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Wed, May 4, 2016 at 4:02 PM, Alex Ignatov <a.ignatov@postgrespro.ru>
> wrote:
>> On 03.05.2016 2:17, Tom Lane wrote:
>>> Writing a single sector ought to be atomic too.

>> pg_control is 8k long(i think it is legth of one page in default PG
>> compile settings).

> The actual data written is always sizeof(ControlFileData) which should be
> less than one sector.

Yes.  We don't care what happens to the rest of the file as long as the
first sector's worth is updated atomically.  See the comments for
PG_CONTROL_SIZE and the code in ReadControlFile/WriteControlFile.

We could change to a different PG_CONTROL_SIZE pretty easily, and there's
certainly room to argue that reducing it to 512 or 1024 would be more
efficient.  I think the motivation for setting it at 8K was basically
"we're already assuming that 8K writes are efficient, so let's assume
it here too".  But since the file is only written once per checkpoint,
efficiency is not really a key selling point anyway.  If you could make
an argument that some other size would reduce the risk of failures,
it would be interesting --- but I suspect any such argument would be
very dependent on the quirks of a specific file system.

One point worth considering is that on most file systems, rewriting
a fraction of a page is *less* efficient than rewriting a full page,
because the kernel first has to read in the old contents to fill
the disk buffer it's going to partially overwrite with new data.
This motivates against trying to reduce the write size too much.
        regards, tom lane



Re: Is pg_control file crashsafe?

От
Amit Kapila
Дата:
On Wed, May 4, 2016 at 8:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Wed, May 4, 2016 at 4:02 PM, Alex Ignatov <a.ignatov@postgrespro.ru>
> > wrote:
> >> On 03.05.2016 2:17, Tom Lane wrote:
> >>> Writing a single sector ought to be atomic too.
>
> >> pg_control is 8k long(i think it is legth of one page in default PG
> >> compile settings).
>
> > The actual data written is always sizeof(ControlFileData) which should be
> > less than one sector.
>
> Yes.  We don't care what happens to the rest of the file as long as the
> first sector's worth is updated atomically.  See the comments for
> PG_CONTROL_SIZE and the code in ReadControlFile/WriteControlFile.
>
> We could change to a different PG_CONTROL_SIZE pretty easily, and there's
> certainly room to argue that reducing it to 512 or 1024 would be more
> efficient.  I think the motivation for setting it at 8K was basically
> "we're already assuming that 8K writes are efficient, so let's assume
> it here too".  But since the file is only written once per checkpoint,
> efficiency is not really a key selling point anyway.  If you could make
> an argument that some other size would reduce the risk of failures,
> it would be interesting --- but I suspect any such argument would be
> very dependent on the quirks of a specific file system.
>

How about using 512 bytes as a write size and perform direct writes rather than going via OS buffer cache for control file?   Alex, is the issue reproducible (to ensure that if we try to solve it in some way, do we have way to test it as well)? 
 
>
> One point worth considering is that on most file systems, rewriting
> a fraction of a page is *less* efficient than rewriting a full page,
> because the kernel first has to read in the old contents to fill
> the disk buffer it's going to partially overwrite with new data.
> This motivates against trying to reduce the write size too much.
>

Yes, you are very much right and I have observed that recently during my work on WAL Re-Writes [1].  However, I think that won't be the issue if we use direct writes for control file.


[1] - http://www.postgresql.org/message-id/CAA4eK1+=O33dZZ=jBtjXBFyD67R5dLcqFyOMj4f-qmFXBP1OOQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Is pg_control file crashsafe?

От
Tom Lane
Дата:
Amit Kapila <amit.kapila16@gmail.com> writes:
> How about using 512 bytes as a write size and perform direct writes rather
> than going via OS buffer cache for control file?

Wouldn't that fail outright under a lot of implementations of direct write;
ie the request needs to be page-aligned, for some not-very-determinate
value of page size?

To repeat, I'm pretty hesitant to change this logic.  While this is not
the first report we've ever heard of loss of pg_control, I believe I could
count those reports without running out of fingers on one hand --- and
that's counting since the last century. It will take quite a lot of
evidence to convince me that some other implementation will be more
reliable.  If you just come and present a patch to use direct write, or
rename, or anything else for that matter, I'm going to reject it out of
hand unless you provide very strong evidence that it's going to be more
reliable than the current code across all the systems we support.
        regards, tom lane



Re: Is pg_control file crashsafe?

От
Thomas Munro
Дата:
On Thu, May 5, 2016 at 4:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> How about using 512 bytes as a write size and perform direct writes rather
>> than going via OS buffer cache for control file?
>
> Wouldn't that fail outright under a lot of implementations of direct write;
> ie the request needs to be page-aligned, for some not-very-determinate
> value of page size?
>
> To repeat, I'm pretty hesitant to change this logic.  While this is not
> the first report we've ever heard of loss of pg_control, I believe I could
> count those reports without running out of fingers on one hand --- and
> that's counting since the last century. It will take quite a lot of
> evidence to convince me that some other implementation will be more
> reliable.  If you just come and present a patch to use direct write, or
> rename, or anything else for that matter, I'm going to reject it out of
> hand unless you provide very strong evidence that it's going to be more
> reliable than the current code across all the systems we support.

I'm not sure how those ideas address the reported problem anyway: the
*length* was unexpectedly zero after a crash.  UpdateControlFile
doesn't change the length of the control file, since it doesn't
specify O_TRUNC or O_APPEND and it always writes the same size.  So it
seems like a pretty weird failure mode affecting filesystem metadata
(which I wouldn't expect to change anyway, but I would expect to be
journaled if it did), not a file-contents-atomicity problem.  Whether
or not the page cache is involved in a write to a preallocated file
doesn't seem relevant to a case of unexpected truncation, and the
atomic rename trick doesn't seem relevant either unless someone with
expert knowledge of NTFS could explain how a crash could lead to
truncation in the first place, and how rename would help.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Is pg_control file crashsafe?

От
Amit Kapila
Дата:
On Thu, May 5, 2016 at 11:52 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Thu, May 5, 2016 at 4:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Amit Kapila <amit.kapila16@gmail.com> writes:
> >> How about using 512 bytes as a write size and perform direct writes rather
> >> than going via OS buffer cache for control file?
> >
> > Wouldn't that fail outright under a lot of implementations of direct write;
> > ie the request needs to be page-aligned, for some not-very-determinate
> > value of page size?
> >

Right, it should be atleast page size.

>
> > To repeat, I'm pretty hesitant to change this logic.  While this is not
> > the first report we've ever heard of loss of pg_control, I believe I could
> > count those reports without running out of fingers on one hand --- and
> > that's counting since the last century. It will take quite a lot of
> > evidence to convince me that some other implementation will be more
> > reliable.  If you just come and present a patch to use direct write, or
> > rename, or anything else for that matter, I'm going to reject it out of
> > hand unless you provide very strong evidence that it's going to be more
> > reliable than the current code across all the systems we support.
>
> I'm not sure how those ideas address the reported problem anyway: the
> *length* was unexpectedly zero after a crash.  UpdateControlFile
> doesn't change the length of the control file, since it doesn't
> specify O_TRUNC or O_APPEND and it always writes the same size.  So it
> seems like a pretty weird failure mode affecting filesystem metadata
> (which I wouldn't expect to change anyway, but I would expect to be
> journaled if it did), not a file-contents-atomicity problem.  Whether
> or not the page cache is involved in a write to a preallocated file
> doesn't seem relevant to a case of unexpected truncation, and the
> atomic rename trick doesn't seem relevant either unless someone with
> expert knowledge of NTFS could explain how a crash could lead to
> truncation in the first place, and how rename would help.
>

I think the real reason for truncation is not known or not discussed here.  It seems to me that the ideas are being discussed on the mere speculation that current way of writing can lead to corruption in certain cases.  I think it would be better to first dig into the actual reason of problem.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Is pg_control file crashsafe?

От
Andres Freund
Дата:
On 2016-05-05 00:32:29 -0400, Tom Lane wrote:
> To repeat, I'm pretty hesitant to change this logic.  While this is not
> the first report we've ever heard of loss of pg_control, I believe I could
> count those reports without running out of fingers on one hand --- and
> that's counting since the last century. It will take quite a lot of
> evidence to convince me that some other implementation will be more
> reliable.  If you just come and present a patch to use direct write, or
> rename, or anything else for that matter, I'm going to reject it out of
> hand unless you provide very strong evidence that it's going to be more
> reliable than the current code across all the systems we support.

https://lwn.net/SubscriberLink/686150/9697c313930fbe99/ :

"Jeff Moyer pointed out that sector tearing can happen on block devices
like SSDs, which is not what users expect. "
"Actually, what I said was that sector tearing doesn't usually happen on
SSDs due to the nature of the FTL. Traditional storage, however, never
guaranteed sector atomicity, but it usually does provide it."

FWIW, at the LSF/MM session Robert and I attended I talked to a Seagate
and a WD (IIRC) employee, and there answer echoed the second comment
from above. It's unlikely, but entirely possible to get torn sectors
after power outages. What's worse, if you get one it's entirely possible
that future *reads* will not just return torn contents, but an error.

Greetings,

Andres Freund



Re: Is pg_control file crashsafe?

От
Greg Stark
Дата:
<p dir="ltr">On 5 May 2016 12:32 am, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>
wrote:<br/> ><br /> > To repeat, I'm pretty hesitant to change this logic.  While this is not<br /> > the
firstreport we've ever heard of loss of pg_control, I believe I could<br /> > count those reports without running
outof fingers on one hand --- and<br /> > that's counting since the last century. It will take quite a lot of<br />
>evidence to convince me that some other implementation will be more<br /> > reliable.  If you just come and
presenta patch to use direct write, or<br /> > rename, or anything else for that matter, I'm going to reject it out
of<br/> > hand unless you provide very strong evidence that it's going to be more<br /> > reliable than the
currentcode across all the systems we support.<p dir="ltr">One thing we could do without much worry of being less
reliablewould be to keep two copies of pg_control. Write one, fsync, then write to the other and fsync that one.<p
dir="ltr">Oraclekeeps a copy of the old control file so that you can always go back to an older version if a hardware
orsoftware bug currupts it. But they keep a lot more data in their control file and they can be quite large. 

Re: Is pg_control file crashsafe?

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> One thing we could do without much worry of being less reliable would be to
> keep two copies of pg_control. Write one, fsync, then write to the other
> and fsync that one.

Hmm, interesting thought.  Without knowing more about the filesystem
problem that the OP had, it's hard to tell whether this would have saved
us; but in principle it sounds like it would be more reliable.
        regards, tom lane



Re: Is pg_control file crashsafe?

От
Alex Ignatov
Дата:
On 06.05.2016 0:42, Greg Stark wrote:
> On 5 May 2016 12:32 am, "Tom Lane" <tgl@sss.pgh.pa.us
> <mailto:tgl@sss.pgh.pa.us>> wrote:
>  >
>  > To repeat, I'm pretty hesitant to change this logic.  While this is not
>  > the first report we've ever heard of loss of pg_control, I believe I
> could
>  > count those reports without running out of fingers on one hand --- and
>  > that's counting since the last century. It will take quite a lot of
>  > evidence to convince me that some other implementation will be more
>  > reliable.  If you just come and present a patch to use direct write, or
>  > rename, or anything else for that matter, I'm going to reject it out of
>  > hand unless you provide very strong evidence that it's going to be more
>  > reliable than the current code across all the systems we support.
>
> One thing we could do without much worry of being less reliable would be
> to keep two copies of pg_control. Write one, fsync, then write to the
> other and fsync that one.
>
> Oracle keeps a copy of the old control file so that you can always go
> back to an older version if a hardware or software bug currupts it. But
> they keep a lot more data in their control file and they can be quite large.
>
Oracle can create more then one copy of control file. They are the same, 
not old copy and current. And their advise is just to store this copies 
on separate storage to be more fault tolerant.

PS By the way on my initial post about "is pg_control safe" i wrote in p 
3. some thoughts about multiple copies of pg_control file. Glad to see 
identity of views on this issue



Re: Is pg_control file crashsafe?

От
Alex Ignatov
Дата:
On 05.05.2016 7:16, Amit Kapila wrote:
> On Wed, May 4, 2016 at 8:03 PM, Tom Lane <tgl@sss.pgh.pa.us
> <mailto:tgl@sss.pgh.pa.us>> wrote:
>  >
>  > Amit Kapila <amit.kapila16@gmail.com
> <mailto:amit.kapila16@gmail.com>> writes:
>  > > On Wed, May 4, 2016 at 4:02 PM, Alex Ignatov
> <a.ignatov@postgrespro.ru <mailto:a.ignatov@postgrespro.ru>>
>  > > wrote:
>  > >> On 03.05.2016 2:17, Tom Lane wrote:
>  > >>> Writing a single sector ought to be atomic too.
>  >
>  > >> pg_control is 8k long(i think it is legth of one page in default PG
>  > >> compile settings).
>  >
>  > > The actual data written is always sizeof(ControlFileData) which
> should be
>  > > less than one sector.
>  >
>  > Yes.  We don't care what happens to the rest of the file as long as the
>  > first sector's worth is updated atomically.  See the comments for
>  > PG_CONTROL_SIZE and the code in ReadControlFile/WriteControlFile.
>  >
>  > We could change to a different PG_CONTROL_SIZE pretty easily, and there's
>  > certainly room to argue that reducing it to 512 or 1024 would be more
>  > efficient.  I think the motivation for setting it at 8K was basically
>  > "we're already assuming that 8K writes are efficient, so let's assume
>  > it here too".  But since the file is only written once per checkpoint,
>  > efficiency is not really a key selling point anyway.  If you could make
>  > an argument that some other size would reduce the risk of failures,
>  > it would be interesting --- but I suspect any such argument would be
>  > very dependent on the quirks of a specific file system.
>  >
>
> How about using 512 bytes as a write size and perform direct writes
> rather than going via OS buffer cache for control file?   Alex, is the
> issue reproducible (to ensure that if we try to solve it in some way, do
> we have way to test it as well)?
>
>  >
>  > One point worth considering is that on most file systems, rewriting
>  > a fraction of a page is *less* efficient than rewriting a full page,
>  > because the kernel first has to read in the old contents to fill
>  > the disk buffer it's going to partially overwrite with new data.
>  > This motivates against trying to reduce the write size too much.
>  >
>
> Yes, you are very much right and I have observed that recently during my
> work on WAL Re-Writes [1].  However, I think that won't be the issue if
> we use direct writes for control file.
>
>
> [1] -
> http://www.postgresql.org/message-id/CAA4eK1+=O33dZZ=jBtjXBFyD67R5dLcqFyOMj4f-qmFXBP1OOQ@mail.gmail.com
>
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com/>

Hi!
No issue happened only once. Also any attempts to reproduce it is not 
successful yet