WAL Recovery Bug in 7.2.3

Поиск
Список
Период
Сортировка
От Josh Berkus
Тема WAL Recovery Bug in 7.2.3
Дата
Msg-id 200301161355.26027.josh@agliodbs.com
обсуждение исходный текст
Список pgsql-bugs
Affects: 7.2.3 (possibly down to 7.1.0)
Frequency:  Very Rare
Effect When Occurring:   Database corruption
Difficulty of Fix:  Trivial
Certianty of Diagnosis:  about 50% according to Tom

This bug was analyzed by Tom Lane; I'm just writing it up.

In 7.2.3 (and possibly in earlier versions) these two rows in xlog.c are ou=
t=20
of order:
        FlushBufferPool();
        CheckPointCLOG();

Per Tom's analysis:
"7.2.* does checkpoint operations in the wrong order: CreateCheckPoint
does
        FlushBufferPool();
        CheckPointCLOG();
        ... create and write checkpoint WAL record ...
The reason this is the wrong order is that CheckPointCLOG() only issues
write()s of dirty pg_clog pages; it does not fsync them.  Thus, it is
possible that the checkpoint WAL record will be flushed to disk while
the clog page writes are still hanging about in kernel disk cache.
If there is a system crash before the kernel gets around to sync'ing
the dirty clog pages to disk, then we lose --- on restart, the WAL logic
will only replay WAL entries after the latest checkpoint, and so any
transaction commits occurring before the checkpoint would fail to be
re-marked in pg_clog."

As an error scenario, this seems rather farfetched; Postgres would have to =
be=20
killed, a second time, while in recovery mode at a moment between=20
FlushBufferPool() and CheckPointCLOG().   A remote enough possibility to=20
ignore.   Except that it seems to have happened twice, to two different=20
users.

The scenario under which this bug becomes critical is this:

1) In the middle of a large UPDATE statement, the Postgres server loses pow=
er=20
from a general power outage or local building short.
2) This server is not buffered by a UPS.
3) Due to work on the power system or weather damage, power comes back on,=
=20
then off after a few minutes, cycling off-and-on 4-5 times (this is not=20
farfetched; during the California "power crisis" I saw it happen several=20
times).
4) This has the possible effect of repeatedly downing Postgres while it is =
in=20
recovery mode.=20=20=20
5) Sooner or later, the up-down effect "gets lucky" and postgres goes down=
=20
while FlushBufferCache() is finishing up.
6) The user ends up with two versions of one or more of their records marke=
d=20
as valid by Postgres.  Per Tom's analysis of one such problem:

"Well, here's what I've found so far.  The two tuples in question have
header data like so (as printed by pg_filedump):

 Item  28 -- Length:  248  Offset: 7944 (0x1f08)  Flags: USED
  OID: 487894  CID: min(0) max(0)  XID: min(9776912) max(17920315)
  Block Id: 4664  linp Index: 1   Attributes: 31   Size: 36
  infomask: 0x0903 (HASNULL|HASVARLENA|XMIN_COMMITTED|XMAX_INVALID)=20
  t_bits: [0]: 0xff [1]: 0xfe [2]: 0xdf [3]: 0x7d=20

 Item   2 -- Length:  248  Offset: 7944 (0x1f08)  Flags: USED
  OID: 487894  CID: min(0) max(0)  XID: min(9777615) max(10180711)
  Block Id: 4666  linp Index: 1   Attributes: 31   Size: 36
  infomask: 0x2903 (HASNULL|HASVARLENA|XMIN_COMMITTED|XMAX_INVALID|UPDATED)=
=20
  t_bits: [0]: 0xff [1]: 0xfe [2]: 0xdf [3]: 0x7d=20

What appears to have happened is this: transaction 9776912 created the
row initially (the first of the two items is evidently the first
incarnation of the row, since it does not have HEAP_UPDATED set).  A
little while later, transaction 9777615 updated the row, creating the
second tuple.  Our problem is that both tuples appear to be committed
good --- both have XMIN_COMMITTED set.

Digging into the pg_clog data, I find that 9776912 is shown as
"committed", as expected.  But 9777615 is shown as "in progress" --- the
clog entry has not been marked as either committed or aborted!"


Since this bug has been fixed in 7.3.1, it's not cirtical to release a patc=
h.=20=20=20
HOWEVER, given the triviality of the fix ... simply swapping those two line=
s=20
in xlog.c ... does everyone think it would be a good idea to post a notice =
on=20
the lists of the issue and the fix?

While it easy enough to tell users, "Upgrade, or get a UPS" this is not=20
practical for everyone.

And is there any potential issue with swapping those two lines?


--=20
-Josh Berkus
 Aglio Database Solutions
 San Francisco

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Bug #876: Bugs Inserts Arrays
Следующее
От: "bigapple"
Дата:
Сообщение: permission leak