Обсуждение: WAL shortcoming causes missing-pg_clog-segment problem

Поиск
Список
Период
Сортировка

WAL shortcoming causes missing-pg_clog-segment problem

От
Tom Lane
Дата:
I think I've identified a primary cause for the "no such pg_clog file"
problem that we've seen reported several times.

A look at htup.h shows that the WAL only stores the low 8 bits of a
tuple's t_infomask (see xl_heap_header struct).  There is some fooling
around in heapam.c's WAL redo routines to try to reconstitute some of
the high-order bits, for example this:
           htup->t_infomask = HEAP_XMAX_INVALID | xlhdr.mask;

But this is implicitly assuming that we can reconstruct the
XMIN_COMMITTED bit at will.  That was true when the WAL code was
written, but with 7.2's ability to recycle allegedly-no-longer-needed
pg_clog data, we cannot simply drop commit status bits.

The only scenario I've been able to identify in which this actually
causes a failure is when VACUUM FULL moves an old tuple and then shortly
afterwards (before the next checkpoint) there is a crash.  Post-crash,
the tuple move will be redone from WAL, and the moved tuple will be
inserted with zeroed-out commit status bits.  When we next examine the
tuple, we have to try to retrieve its commit status from pg_clog ... but
it's not there anymore.

As far as I can see, the only realistic solution is to store the full 16
bits of t_infomask in the WAL.  We could do this without increasing WAL
size by dropping the t_hoff field from xl_heap_header --- t_hoff is
computable given the number of attributes and the HASNULL/HASOID bits,
both of which are available.  (Actually, we could save some space now
by getting rid of t_oid in xl_heap_header; it's not necessary given that
OID isn't in the fixed tuple header anymore.)

This will require a WAL format change of course.  Fortunately we can do
that without forcing a complete initdb (people will have to run
pg_resetxlog if they want to update a 7.3beta2 database without initdb).

I see no way to fix the problem in the context of 7.2.  Perhaps we
should put out a bulletin warning people to avoid VACUUM FULL in 7.2,
or at least to do CHECKPOINT as soon as possible after one.
        regards, tom lane


HOLD ON BETA2

От
Bruce Momjian
Дата:
[ Subject changed.]

Marc, please hold on announcing beta2 until we get this resolved.  Thanks.

---------------------------------------------------------------------------

Tom Lane wrote:
> I think I've identified a primary cause for the "no such pg_clog file"
> problem that we've seen reported several times.
> 
> A look at htup.h shows that the WAL only stores the low 8 bits of a
> tuple's t_infomask (see xl_heap_header struct).  There is some fooling
> around in heapam.c's WAL redo routines to try to reconstitute some of
> the high-order bits, for example this:
> 
>             htup->t_infomask = HEAP_XMAX_INVALID | xlhdr.mask;
> 
> But this is implicitly assuming that we can reconstruct the
> XMIN_COMMITTED bit at will.  That was true when the WAL code was
> written, but with 7.2's ability to recycle allegedly-no-longer-needed
> pg_clog data, we cannot simply drop commit status bits.
> 
> The only scenario I've been able to identify in which this actually
> causes a failure is when VACUUM FULL moves an old tuple and then shortly
> afterwards (before the next checkpoint) there is a crash.  Post-crash,
> the tuple move will be redone from WAL, and the moved tuple will be
> inserted with zeroed-out commit status bits.  When we next examine the
> tuple, we have to try to retrieve its commit status from pg_clog ... but
> it's not there anymore.
> 
> As far as I can see, the only realistic solution is to store the full 16
> bits of t_infomask in the WAL.  We could do this without increasing WAL
> size by dropping the t_hoff field from xl_heap_header --- t_hoff is
> computable given the number of attributes and the HASNULL/HASOID bits,
> both of which are available.  (Actually, we could save some space now
> by getting rid of t_oid in xl_heap_header; it's not necessary given that
> OID isn't in the fixed tuple header anymore.)
> 
> This will require a WAL format change of course.  Fortunately we can do
> that without forcing a complete initdb (people will have to run
> pg_resetxlog if they want to update a 7.3beta2 database without initdb).
> 
> I see no way to fix the problem in the context of 7.2.  Perhaps we
> should put out a bulletin warning people to avoid VACUUM FULL in 7.2,
> or at least to do CHECKPOINT as soon as possible after one.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


HOLD ON BETA2

От
Bruce Momjian
Дата:
Oops, I see beta2's on the web site with yesterday's date.  Have they
been announced?

-rw-r--r--  1 70  70   1072573 Sep 25 10:15 postgresql-test-7.3b2.tar.gz

---------------------------------------------------------------------------

Tom Lane wrote:
> I think I've identified a primary cause for the "no such pg_clog file"
> problem that we've seen reported several times.
> 
> A look at htup.h shows that the WAL only stores the low 8 bits of a
> tuple's t_infomask (see xl_heap_header struct).  There is some fooling
> around in heapam.c's WAL redo routines to try to reconstitute some of
> the high-order bits, for example this:
> 
>             htup->t_infomask = HEAP_XMAX_INVALID | xlhdr.mask;
> 
> But this is implicitly assuming that we can reconstruct the
> XMIN_COMMITTED bit at will.  That was true when the WAL code was
> written, but with 7.2's ability to recycle allegedly-no-longer-needed
> pg_clog data, we cannot simply drop commit status bits.
> 
> The only scenario I've been able to identify in which this actually
> causes a failure is when VACUUM FULL moves an old tuple and then shortly
> afterwards (before the next checkpoint) there is a crash.  Post-crash,
> the tuple move will be redone from WAL, and the moved tuple will be
> inserted with zeroed-out commit status bits.  When we next examine the
> tuple, we have to try to retrieve its commit status from pg_clog ... but
> it's not there anymore.
> 
> As far as I can see, the only realistic solution is to store the full 16
> bits of t_infomask in the WAL.  We could do this without increasing WAL
> size by dropping the t_hoff field from xl_heap_header --- t_hoff is
> computable given the number of attributes and the HASNULL/HASOID bits,
> both of which are available.  (Actually, we could save some space now
> by getting rid of t_oid in xl_heap_header; it's not necessary given that
> OID isn't in the fixed tuple header anymore.)
> 
> This will require a WAL format change of course.  Fortunately we can do
> that without forcing a complete initdb (people will have to run
> pg_resetxlog if they want to update a 7.3beta2 database without initdb).
> 
> I see no way to fix the problem in the context of 7.2.  Perhaps we
> should put out a bulletin warning people to avoid VACUUM FULL in 7.2,
> or at least to do CHECKPOINT as soon as possible after one.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: HOLD ON BETA2

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Oops, I see beta2's on the web site with yesterday's date.  Have they
> been announced?

> -rw-r--r--  1 70  70   1072573 Sep 25 10:15 postgresql-test-7.3b2.tar.gz

No, but they've been out there for more than a day.  I think it's too
late to retract beta2.  We could fix this problem and then do a quick
beta3, though...
        regards, tom lane


Re: HOLD ON BETA2

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Oops, I see beta2's on the web site with yesterday's date.  Have they
> > been announced?
> 
> > -rw-r--r--  1 70  70   1072573 Sep 25 10:15 postgresql-test-7.3b2.tar.gz
> 
> No, but they've been out there for more than a day.  I think it's too
> late to retract beta2.  We could fix this problem and then do a quick
> beta3, though...

Yes, that's the only solution.  Let's not announce beta2 and make things
worse.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073