Обсуждение: Tracking down log segment corruption

Поиск
Список
Период
Сортировка

Tracking down log segment corruption

От
Charles Duffy
Дата:
Howdy, all.

I have a log-shipping replication environment (based on PostgreSQL
8.3.4) using pg_lesslog+LZOP for compression of archived segments (kept
around long-term for possible use doing PITR). The slave came out of
synchronization recently, restoring a series of segments and then
failing with a SIGABORT (and doing the same after each automated restart):

[2-1] LOG:  starting archive recovery
[3-1] LOG:  restore_command = '/opt/extropy/postgres/bin/restore-segment
/exports/pgwal/segments.recoveryq %f %p %r'
[4-1] LOG:  restored log file "00000001000000140000000B" from archive
[5-1] LOG:  automatic recovery in progress
[6-1] LOG:  restored log file "000000010000001400000009" from archive
[7-1] LOG:  redo starts at 14/9127270
[8-1] LOG:  restored log file "00000001000000140000000A" from archive
[9-1] LOG:  restored log file "00000001000000140000000B" from archive
[10-1] LOG:  restored log file "00000001000000140000000C" from archive
[11-1] LOG:  restored log file "00000001000000140000000D" from archive
[12-1] LOG:  restored log file "00000001000000140000000E" from archive
[13-1] LOG:  restored log file "00000001000000140000000F" from archive
[14-1] LOG:  restored log file "000000010000001400000010" from archive
[15-1] WARNING:  specified item offset is too large
[15-2] CONTEXT:  xlog redo insert_upper: rel 1663/16384/17763; tid 2960/89
[16-1] PANIC:  btree_insert_redo: failed to add item
[16-2] CONTEXT:  xlog redo insert_upper: rel 1663/16384/17763; tid 2960/89
[1-1] LOG:  startup process (PID 17310) was terminated by signal 6: Aborted
[2-1] LOG:  aborting startup due to startup process failure

Replacing only 000000010000001400000010 with a pristine (never processed
with pg_compresslog) copy made no difference, but doing so for all of
the involved segments permitted the slave to pick up where it left off
and continue replaying.

It seems clear, then, that pg_lesslog was responsible in some way for
this corruption, and that there is some germane difference between the
original and { pg_compresslog | pg_decompresslog } version of one of the
involved segments. I've tried using xlogdump
[http://xlogviewer.projects.postgresql.org/] on both versions to look
for differing output, but it segfaults even with the "good" log segments.

Does anyone have suggestions as to how I should go about tracking down
the root cause of this issue?

Thanks!

Re: Tracking down log segment corruption

От
Gordon Shannon
Дата:
I just got ran into the same problem.  Both servers are running 8.4.3, and
the standby server had been running for 2 days, processing many thousands of
logs successfully.  Here's my error:

4158   2010-05-02 11:12:09 EDT [26445]LOG:  restored log file
"0000000100003C77000000C3" from archive
4158   2010-05-02 11:12:09 EDT [26446]LOG:  restored log file
"0000000100003C77000000C4" from archive
4158   2010-05-02 11:12:09 EDT [26447]WARNING:  specified item offset is too
large
4158   2010-05-02 11:12:09 EDT [26448]CONTEXT:  xlog redo insert: rel
48777166/22362/48778276; tid 2/2
4158   2010-05-02 11:12:09 EDT [26449]PANIC:  btree_insert_redo: failed to
add item
4158   2010-05-02 11:12:09 EDT [26450]CONTEXT:  xlog redo insert: rel
48777166/22362/48778276; tid 2/2
4151   2010-05-02 11:12:09 EDT [1]LOG:  startup process (PID 4158) was
terminated by signal 6: Aborted
4151   2010-05-02 11:12:09 EDT [2]LOG:  terminating any other active server
processes

Any suggestions?


Charles Duffy-5 wrote:
>
>
> [14-1] LOG:  restored log file "000000010000001400000010" from archive
> [15-1] WARNING:  specified item offset is too large
> [15-2] CONTEXT:  xlog redo insert_upper: rel 1663/16384/17763; tid 2960/89
> [16-1] PANIC:  btree_insert_redo: failed to add item
> [16-2] CONTEXT:  xlog redo insert_upper: rel 1663/16384/17763; tid 2960/89
> [1-1] LOG:  startup process (PID 17310) was terminated by signal 6:
> Aborted
> [2-1] LOG:  aborting startup due to startup process failure
>
>

--
View this message in context: http://old.nabble.com/Tracking-down-log-segment-corruption-tp21121136p28427922.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Tracking down log segment corruption

От
Tom Lane
Дата:
Gordon Shannon <gordo169@gmail.com> writes:
> I just got ran into the same problem.  Both servers are running 8.4.3, and
> the standby server had been running for 2 days, processing many thousands of
> logs successfully.  Here's my error:

> 4158   2010-05-02 11:12:09 EDT [26445]LOG:  restored log file
> "0000000100003C77000000C3" from archive
> 4158   2010-05-02 11:12:09 EDT [26446]LOG:  restored log file
> "0000000100003C77000000C4" from archive
> 4158   2010-05-02 11:12:09 EDT [26447]WARNING:  specified item offset is too
> large
> 4158   2010-05-02 11:12:09 EDT [26448]CONTEXT:  xlog redo insert: rel
> 48777166/22362/48778276; tid 2/2
> 4158   2010-05-02 11:12:09 EDT [26449]PANIC:  btree_insert_redo: failed to
> add item
> 4158   2010-05-02 11:12:09 EDT [26450]CONTEXT:  xlog redo insert: rel
> 48777166/22362/48778276; tid 2/2
> 4151   2010-05-02 11:12:09 EDT [1]LOG:  startup process (PID 4158) was
> terminated by signal 6: Aborted
> 4151   2010-05-02 11:12:09 EDT [2]LOG:  terminating any other active server
> processes

Hmm ... AFAICS the only way to get that message when the incoming TID's
offsetNumber is only 2 is for the index page to be completely empty
(not zeroes, else PageAddItem's sanity check would have triggered,
but valid and empty).  What that smells like is a software bug, like
failing to emit a WAL record in a case where it was necessary.  Can you
identify which index this was?  (Look for relfilenode 48778276 in the
database with OID 22362.)  If so, can you give us any hints about
unusual things that might have been done with that index?

> Any suggestions?

As far as recovering goes, there's probably not much you can do except
resync the standby from scratch.  But it would be nice to get to the
bottom of the problem, so that we can fix the bug.  Have you got an
archive of this xlog segment and the ones before it, and would you be
willing to let a developer look at them?

            regards, tom lane

Re: Tracking down log segment corruption

От
Gordon Shannon
Дата:


On Sun, May 2, 2010 at 11:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:


Hmm ... AFAICS the only way to get that message when the incoming TID's
offsetNumber is only 2 is for the index page to be completely empty
(not zeroes, else PageAddItem's sanity check would have triggered,
but valid and empty).  What that smells like is a software bug, like
failing to emit a WAL record in a case where it was necessary.  Can you
identify which index this was?  (Look for relfilenode 48778276 in the
database with OID 22362.)  If so, can you give us any hints about
unusual things that might have been done with that index?

Interesting. There is no pg_class entry for 22362.  There is, however, an entry for that filenode. It's an index I created Sat AM, about 6AM.

select oid, * from pg_class where relfilenode=48778276;
-[ RECORD 1 ]--+---------------------------
oid            | 48777488
relname        | cts_20100501_topic_date_nk
relnamespace   | 2200
reltype        | 0
relowner       | 16412
relam          | 403
relfilenode    | 48778276
reltablespace  | 48777166
relpages       | 2476
reltuples      | 58879
reltoastrelid  | 0
reltoastidxid  | 0
relhasindex    | f
relisshared    | f
relistemp      | f
relkind        | i
relnatts       | 2
relchecks      | 0
relhasoids     | f
relhaspkey     | f
relhasrules    | f
relhastriggers | f
relhassubclass | f
relfrozenxid   | 0
relacl         | null
reloptions     | null

Possibly relevant facts:

- The WSB server went active on Friday around 3:30PM
- On Friday evening, I added about 11 tablespaces.  I noted the new files on the WSB, no problems.
- On Sat morning, I created a partitioned table cts_20100501 (inherits from another table) and 4 indexes.
- This morning, I was doing some table maintenance on the master and discovered I had created this table and its indexes in the wrong tablespace.
I wanted the table in ts29, but had it in ts30.  Vice versa for the indexes.  So I moved them. This is from my command history:
   
alter index cts_20100501_natural_uk set tablespace ts30;
alter index cts_20100501_pkey set tablespace ts30;
alter index cts_20100501_topic_date_nk set tablespace ts30;
alter index cts_20100501_updated_nk set tablespace ts30;
alter table cts_20100501 set tablespace ts29;

These commands worked fine on the master, yet this seems suspiciously relevant.


> Any suggestions?

As far as recovering goes, there's probably not much you can do except
resync the standby from scratch.  But it would be nice to get to the
bottom of the problem, so that we can fix the bug.  Have you got an
archive of this xlog segment and the ones before it, and would you be
willing to let a developer look at them?


Before I received your reply, I had already started the re-sync, and unfortunately already deleted the wal logs in question.
If it happens again, I will certainly keep them, and would be happy to share them.

Regards,
Gordon


Re: Tracking down log segment corruption

От
Tom Lane
Дата:
Gordon Shannon <gordo169@gmail.com> writes:
> Interesting. There is no pg_class entry for 22362.

No, this would be a pg_database row with that OID.  But it looks like
you found the relevant index anyway.

> ... There is, however, an
> entry for that filenode. It's an index I created Sat AM, about 6AM.
> ...
> - This morning, I was doing some table maintenance on the master and
> discovered I had created this table and its indexes in the wrong tablespace.
> I wanted the table in ts29, but had it in ts30.  Vice versa for the
> indexes.  So I moved them. This is from my command history:

> alter index cts_20100501_natural_uk set tablespace ts30;
> alter index cts_20100501_pkey set tablespace ts30;
> alter index cts_20100501_topic_date_nk set tablespace ts30;
> alter index cts_20100501_updated_nk set tablespace ts30;
> alter table cts_20100501 set tablespace ts29;

> These commands worked fine on the master, yet this seems suspiciously
> relevant.

Yeah, perhaps so.  What time did the failure on the standby occur (how
long after you did those moves)?  Is it reasonable to assume that this
was the first subsequent access to the index?

            regards, tom lane

Re: Tracking down log segment corruption

От
Gordon Shannon
Дата:


On Sun, May 2, 2010 at 12:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
No, this would be a pg_database row with that OID.  But it looks like
you found the relevant index anyway.

Yup, realized that on second reading.
 
> These commands worked fine on the master, yet this seems suspiciously
> relevant.


Yeah, perhaps so.  What time did the failure on the standby occur (how
long after you did those moves)?  Is it reasonable to assume that this
was the first subsequent access to the index?


Bingo.  Yes it is reasonable.  It was 25 seconds between my altering the index in question and the server crash.

My local commands (in MDT, plus my machine is 15 sec ahead of the server):

09:10:52> alter index cts_20100501_natural_uk set tablespace ts30;
ALTER INDEX
Time: 787.790 ms
09:11:41> alter index cts_20100501_pkey set tablespace ts30;
ALTER INDEX
Time: 468.526 ms
09:11:51> alter index cts_20100501_topic_date_nk set tablespace ts30;
ALTER INDEX
Time: 385.322 ms
09:11:59> alter index cts_20100501_updated_nk set tablespace ts30;
ALTER INDEX
Time: 963.150 ms
09:12:10> alter table cts_20100501 set tablespace ts29;
ALTER TABLE

And from the wsb log (times in EDT):

4158   2010-05-02 11:12:09 EDT [26446]LOG:  restored log file "0000000100003C77000000C4" from archive
4158   2010-05-02 11:12:09 EDT [26447]WARNING:  specified item offset is too large
4158   2010-05-02 11:12:09 EDT [26448]CONTEXT:  xlog redo insert: rel 48777166/22362/48778276; tid 2/2
4158   2010-05-02 11:12:09 EDT [26449]PANIC:  btree_insert_redo: failed to add item


Re: Tracking down log segment corruption

От
Tom Lane
Дата:
Gordon Shannon <gordo169@gmail.com> writes:
> Bingo.  Yes it is reasonable.  It was 25 seconds between my altering the
> index in question and the server crash.

Sounds like we have a smoking gun.  Could you show all your non-default
postgresql.conf settings on the master?  I'm wondering about
full_page_writes in particular, but might as well gather all the
relevant data while we're at it.  Easiest way is:

select name,setting from pg_settings where source not in ('default','override');

            regards, tom lane

Re: Tracking down log segment corruption

От
Gordon Shannon
Дата:


On Sun, May 2, 2010 at 12:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Gordon Shannon <gordo169@gmail.com> writes:
> Bingo.  Yes it is reasonable.  It was 25 seconds between my altering the
> index in question and the server crash.

Sounds like we have a smoking gun.  Could you show all your non-default
postgresql.conf settings on the master?  I'm wondering about
full_page_writes in particular, but might as well gather all the
relevant data while we're at it.  Easiest way is:

select name,setting from pg_settings where source not in ('default','override');

                       regards, tom lane

 archive_command                 | cp %p /var/lib/pgsql/wal/%f.wrk; mv /var/lib/pgsql/wal/%f.wrk /var/lib/pgsql/wal/%f
 archive_mode                    | on
 authentication_timeout          | 20
 autovacuum_analyze_scale_factor | 0.05
 autovacuum_freeze_max_age       | 1500000000
 autovacuum_vacuum_cost_delay    | -1
 autovacuum_vacuum_scale_factor  | 0.1
 checkpoint_segments             | 128
 checkpoint_warning              | 300
 commit_siblings                 | 1
 cpu_index_tuple_cost            | 0.001
 cpu_operator_cost               | 0.0005
 cpu_tuple_cost                  | 0.003
 DateStyle                       | ISO, MDY
 deadlock_timeout                | 5000
 default_text_search_config      | pg_catalog.english
 effective_cache_size            | 6291456
 lc_messages                     | en_US.UTF-8
 lc_monetary                     | en_US.UTF-8
 lc_numeric                      | en_US.UTF-8
 lc_time                         | en_US.UTF-8
 listen_addresses                | *
 log_autovacuum_min_duration     | 0
 log_destination                 | stderr
 log_directory                   | /var/log/postgres
 log_filename                    | pg%d.log
 log_line_prefix                 | %p %u %r %t [%l]
 log_min_duration_statement      | 180000
 log_rotation_age                | 1440
 log_rotation_size               | 0
 log_temp_files                  | 0
 log_timezone                    | US/Eastern
 log_truncate_on_rotation        | on
 logging_collector               | on
 maintenance_work_mem            | 8388608
 max_connections                 | 500
 max_stack_depth                 | 2048
 port                            | 5432
 search_path                     | public, archive, _slony_cluster
 shared_buffers                  | 2359296
 standard_conforming_strings     | on
 synchronous_commit              | off
 temp_tablespaces                | ts27,ts28,ts29,ts30,ts31,ts32,ts33,ts34,ts35,ts36,ts37
 TimeZone                        | US/Eastern
 timezone_abbreviations          | Default
 track_functions                 | pl
 vacuum_freeze_min_age           | 500000000
 vacuum_freeze_table_age         | 1300000000
 wal_buffers                     | 1024
 work_mem                        | 204800

Re: Tracking down log segment corruption

От
Tom Lane
Дата:
Gordon Shannon <gordo169@gmail.com> writes:
> [ corruption on a standby slave after an ALTER SET TABLESPACE operation ]

Found it, I think.  ATExecSetTableSpace transfers the copied data to the
slave by means of XLOG_HEAP_NEWPAGE WAL records.  The replay function
for this (heap_xlog_newpage) is failing to pay any attention to the
forkNum field of the WAL record.  This means it will happily write FSM
and visibility-map pages into the main fork of the relation.  So if the
index had any such pages on the master, it would immediately become
corrupted on the slave.  Now indexes don't have a visibility-map fork,
but they could have FSM pages.  And an FSM page would have the right
header information to look like an empty index page.  So dropping an
index FSM page into the main fork of the index would produce the
observed symptom.

I'm not 100% sure that this is what bit you, but it's clearly a bug and
AFAICS it could produce the observed symptoms.

This is a seriously, seriously nasty data corruption bug.  The only bit
of good news is that ALTER SET TABLESPACE seems to be the only operation
that can emit XLOG_HEAP_NEWPAGE records with forkNum different from
MAIN_FORKNUM, so that's the only operation that's at risk.  But if you
do do that, not only are standby slaves going to get clobbered, but the
master could get corrupted too if you were unlucky enough to have a
crash and replay from WAL shortly after completing the ALTER.  And it's
not only indexes that are at risk --- tables could get clobbered the
same way.

My crystal ball says there will be update releases in the very near
future.

            regards, tom lane

Re: Tracking down log segment corruption

От
Gordon Shannon
Дата:
Sounds like you're on it.  Just wanted to share one additional piece, in case it helps.

Just before the ALTER INDEX SET TABLESPACE was issued, there were some writes to the table in question inside a serializable transaction. The transaction committed at 11:11:58 EDT, and consisted of, among a couple thousand writes to sibling tables, 4 writes (unknown combination of inserts and updates) to cts_20100501, which definitely effected the index in question.

In any case, I will cease and desist from ALTER SET TABLESPACE for a while!.

Thanks!
Gordon

Between 11:11:56 and 11:11:58 EDT (11 sec before the crash), there were

On Sun, May 2, 2010 at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Found it, I think.  ATExecSetTableSpace transfers the copied data to the
slave by means of XLOG_HEAP_NEWPAGE WAL records.  The replay function
for this (heap_xlog_newpage) is failing to pay any attention to the
forkNum field of the WAL record.  This means it will happily write FSM
and visibility-map pages into the main fork of the relation.  So if the
index had any such pages on the master, it would immediately become
corrupted on the slave.  Now indexes don't have a visibility-map fork,
but they could have FSM pages.  And an FSM page would have the right
header information to look like an empty index page.  So dropping an
index FSM page into the main fork of the index would produce the
observed symptom.

I'm not 100% sure that this is what bit you, but it's clearly a bug and
AFAICS it could produce the observed symptoms.

This is a seriously, seriously nasty data corruption bug.  The only bit
of good news is that ALTER SET TABLESPACE seems to be the only operation
that can emit XLOG_HEAP_NEWPAGE records with forkNum different from
MAIN_FORKNUM, so that's the only operation that's at risk.  But if you
do do that, not only are standby slaves going to get clobbered, but the
master could get corrupted too if you were unlucky enough to have a
crash and replay from WAL shortly after completing the ALTER.  And it's
not only indexes that are at risk --- tables could get clobbered the
same way.

My crystal ball says there will be update releases in the very near
future.

                       regards, tom lane


Re: Tracking down log segment corruption

От
Tom Lane
Дата:
Gordon Shannon <gordo169@gmail.com> writes:
> In any case, I will cease and desist from ALTER SET TABLESPACE for a while!.

Here's the applied patch, if you are interested in testing it.

            regards, tom lane

Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.277.2.1
diff -c -r1.277.2.1 heapam.c
*** src/backend/access/heap/heapam.c    24 Aug 2009 02:18:40 -0000    1.277.2.1
--- src/backend/access/heap/heapam.c    2 May 2010 22:19:04 -0000
***************
*** 4195,4202 ****
       * Note: the NEWPAGE log record is used for both heaps and indexes, so do
       * not do anything that assumes we are touching a heap.
       */
!     buffer = XLogReadBuffer(xlrec->node, xlrec->blkno, true);
      Assert(BufferIsValid(buffer));
      page = (Page) BufferGetPage(buffer);

      Assert(record->xl_len == SizeOfHeapNewpage + BLCKSZ);
--- 4195,4204 ----
       * Note: the NEWPAGE log record is used for both heaps and indexes, so do
       * not do anything that assumes we are touching a heap.
       */
!     buffer = XLogReadBufferExtended(xlrec->node, xlrec->forknum, xlrec->blkno,
!                                     RBM_ZERO);
      Assert(BufferIsValid(buffer));
+     LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
      page = (Page) BufferGetPage(buffer);

      Assert(record->xl_len == SizeOfHeapNewpage + BLCKSZ);