Обсуждение: Corrupted index on 9.0.3 streaming hot standby

Поиск

Список

Период

Сортировка

Corrupted index on 9.0.3 streaming hot standby

От

Jakub Ouhrabka

Дата:

25 февраля 2011 г., 10:20:40

Hi,

we've found that we have corrupted index on 9.0.3 streaming hot standby.
Master works ok. There is one more non-streaming standby which is ok as
well. Platform is 64bit Linux.

Database cluster and this database were created on 9.0.2 and than
upgraded to 9.0.3. We are not aware of any crash on either master or
streaming standby but we didn't investigate it deeply yet.

For details about corrupted index see below. The table and index in
question are mostly read-only (several queries per second) writes happen
only few times a day.

We've backed up whole cluster and recreated it.

Shall we investigate it further? How? Is it possible that we make some
mistake when doing initial backup which caused corruption? Is there a
way to check other indexes?

Thanks,

Kuba

set enable_bitmapscan to on ;

explain analyze  select * from tXX where colXX = '18';
                                                          QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on tXX  (cost=4.49..96.56 rows=30 width=102) (actual
time=0.018..0.018 rows=0 loops=1)
    Recheck Cond: (colXX = '18'::text)
    ->  Bitmap Index Scan on tXX_colXX_idx  (cost=0.00..4.48 rows=30
width=0) (actual time=0.015..0.015 rows=0 loops=1)
          Index Cond: (colXX = '18'::text)
  Total runtime: 0.053 ms

set enable_bitmapscan to off ;

explain analyze  select * from tXX where colXX = '18';
                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------
  Seq Scan on tXX  (cost=0.00..736.65 rows=30 width=102) (actual
time=1.579..8.727 rows=30 loops=1)
    Filter: (colXX = '18'::text)
  Total runtime: 8.766 ms

Re: Corrupted index on 9.0.3 streaming hot standby

От

Alvaro Herrera

Дата:

25 февраля 2011 г., 14:54:11

Excerpts from Jakub Ouhrabka's message of vie feb 25 08:20:33 -0300 2011:

> For details about corrupted index see below. The table and index in
> question are mostly read-only (several queries per second) writes happen
> only few times a day.
>
> We've backed up whole cluster and recreated it.
>
> Shall we investigate it further? How? Is it possible that we make some
> mistake when doing initial backup which caused corruption? Is there a
> way to check other indexes?

Do you still have the WAL files?

--
Ãlvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Corrupted index on 9.0.3 streaming hot standby

От

Jakub Ouhrabka

Дата:

25 февраля 2011 г., 15:46:37

Hi,

> Do you still have the WAL files?

what do you mean exactly? We don't have full history of WAL files from
creation of hot streaming standby. They are recycled. We have set of WAL
files from the point we discovered the corrupted index and stopped the
cluster. But it was probably days after the index was corrupted...

Kuba

Re: Corrupted index on 9.0.3 streaming hot standby

От

Robert Haas

Дата:

03 марта 2011 г., 14:24:35

2011/2/25 Jakub Ouhrabka <kuba@comgate.cz>:
> Hi,
>
> we've found that we have corrupted index on 9.0.3 streaming hot standby.
> Master works ok. There is one more non-streaming standby which is ok as
> well. Platform is 64bit Linux.
>
> Database cluster and this database were created on 9.0.2 and than upgraded
> to 9.0.3. We are not aware of any crash on either master or streaming
> standby but we didn't investigate it deeply yet.
>
> For details about corrupted index see below. The table and index in question
> are mostly read-only (several queries per second) writes happen only few
> times a day.
>
> We've backed up whole cluster and recreated it.
>
> Shall we investigate it further? How? Is it possible that we make some
> mistake when doing initial backup which caused corruption? Is there a way to
> check other indexes?

The obvious way this could happen is if there were a system crash on
the standby.  In theory that should be OK too, but if fsync isn't
working properly due to a settings or disk controller configuration
problem or similar, then it might not be.

See also http://wiki.postgresql.org/wiki/Reliable_Writes

If there hasn't been a system crash on the standby, then it's harder
to explain.  It'd be interesting to compare the disk blocks in the
index on the standby with the disk blocks in the index on the master
and figure out which ones are different and in what way.  pg_filedump
might be useful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Corrupted index on 9.0.3 streaming hot standby

От

Jakub Ouhrabka

Дата:

03 марта 2011 г., 16:02:17

Hi Robert,

 > If there hasn't been a system crash on the standby, then it's harder
 > to explain.  It'd be interesting to compare the disk blocks in the
 > index on the standby with the disk blocks in the index on the master
 > and figure out which ones are different and in what way.  pg_filedump
 > might be useful.

I think this is the case. We can even reproduce it: take another backup
of uncorrupted master and the slave is again corrupted.

The strange thing is that this only affects streaming replication
standby, not wal files shipping standby. Maybe we're doing something
wrong...

Could the reason for the strange behaviour (see previous mail) be
something different than corrupted index?

We'll try to reproduce it in testing environment and possibly create
reproducible test case.  I'll definitely report back...

Regards,

Kuba

Re: Corrupted index on 9.0.3 streaming hot standby

От

Robert Haas

Дата:

03 марта 2011 г., 16:09:35

On Thu, Mar 3, 2011 at 11:39 AM, Jakub Ouhrabka
<jakub.ouhrabka@comgate.cz> wrote:
> Hi Robert,
>
>> If there hasn't been a system crash on the standby, then it's harder
>> to explain. =A0It'd be interesting to compare the disk blocks in the
>> index on the standby with the disk blocks in the index on the master
>> and figure out which ones are different and in what way. =A0pg_filedump
>> might be useful.
>
> I think this is the case. We can even reproduce it: take another backup of
> uncorrupted master and the slave is again corrupted.

Well, in that case, I'd *really* like to see you compare the two
files.  Maybe you could reproduce the problem, ideally stop both
servers (or at least CHECKPOINT), and then for each block in the
affected index run:

pg_filedump -i -R $BLOCKNUMBER $FILE > b.$BLOCKNUMBER

...on the master and on the standby.  Then diff the master version of
each file with the standby version and see what pops out.

> The strange thing is that this only affects streaming replication standby,
> not wal files shipping standby. Maybe we're doing something wrong...

Maybe, but I can't think what would cause this.

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Corrupted index on 9.0.3 streaming hot standby

От

Jakub Ouhrabka

Дата:

07 марта 2011 г., 13:26:15

Hi,

for the archives: the root cause is locale. Both linux/debian servers
were set to the same locale (cs_CZ.UTF-8) but there is slightly
different definition of this locale on those systems - once numbers are
before letters and another time numbers are after letters. Then the
index appears to be broken...

So definitely it's not a postgresql bug. It'd be nice if there are some
safeguards against this mismatch, although I don't know how to do it...

Sorry for the noise.

Kuba

Re: Corrupted index on 9.0.3 streaming hot standby

От

Robert Haas

Дата:

07 марта 2011 г., 16:14:42

On Mon, Mar 7, 2011 at 9:15 AM, Jakub Ouhrabka
<jakub.ouhrabka@comgate.cz> wrote:
> for the archives: the root cause is locale. Both linux/debian servers were
> set to the same locale (cs_CZ.UTF-8) but there is slightly different
> definition of this locale on those systems - once numbers are before letters
> and another time numbers are after letters. Then the index appears to be
> broken...
>
> So definitely it's not a postgresql bug. It'd be nice if there are some
> safeguards against this mismatch, although I don't know how to do it...

Wow.  That sucks.  But thanks for reporting back on what happened.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Corrupted index on 9.0.3 streaming hot standby