Re: Fwd: Data corruption after restarting replica

Поиск
Список
Период
Сортировка
От Adrian Klaver
Тема Re: Fwd: Data corruption after restarting replica
Дата
Msg-id 54E50347.9020906@aklaver.com
обсуждение исходный текст
Ответ на Fwd: Data corruption after restarting replica  (Novák, Petr <novakp@avast.com>)
Ответы Re: Fwd: Data corruption after restarting replica  (Novák, Petr <novakp@avast.com>)
Список pgsql-general
On 02/16/2015 02:44 AM, Novák, Petr wrote:
> Hello,
>
> sorry for posting to second list, but as I've received  no reply
> there, I'm trying my luck here.
>
> Thanks
> Petr
>
>
> ---------- Forwarded message ----------
> From: Novák, Petr <novakp@avast.com>
> Date: Tue, Feb 10, 2015 at 12:49 PM
> Subject: Data corruption after restarting replica
> To: pgsql-bugs@postgresql.org
>
>
> Hi all,
>
> we're experiencing data corruption after switching streamed replica to primary.
> This is not the first time I've encountered this issue, so I'l try to
> describe it in more detail.
>
> For this particular cluster we have 6 servers in two datacenters (3 in
> each). There are two instances running on each server, each with its
> own port and datadir. On the first two servers in each datacenter one
> instance is primary and the other is replica for the primary from the
> other server. Third server holds two offsite replicas from the other
> datacenter (for DR purposes)
>
> Each replica was set up by taking pg_basebackup from primary
> (pg_basebackup -h <hostname> -p 5430 -D /data2/basebackup -P -v -U
> <user> -x -c fast). Then directories from initdb were replaced with
> the ones from basebackup (only the configuration files remained) and
> the replica started and was successfully connected to primary. It was
> running with no problem keeping up with the primary. We were
> experiencing some connection problem between the two datacenters, but
> replication didn't break.
>
> Then we needed to take one datacenter offline due to hardware
> maintenance. So I've switched the applications down, verified that no
> more clients were connected to primary, then shut the primary down and
> restarted replica without recovery.conf and the application were
> started using the new db with no problem. Other replica even
> successfully reconnected to this new primary.

What other replica?

>
> Few hours from the switch lines appeared in the server log (which
> didn't appear before), indicating a corruption:
>
> ERROR:  index "account_username_key" contains unexpected zero page at
> block 1112135
> ERROR:  right sibling's left-link doesn't match: block 476354 links to
> 1062443 instead of expected 250322 in index "account_pkey"
>
> ..and many more reporting corruption in several other indexes.

What happened to the primary you shut down?

>
> The issue was resolved by creating new indexes and dropping the
> affected ones, although there were already some duplicities in the
> data, that has to be resolved, as some of the indexes were unique.
>
> This particular case uses Postgres 9.1.14 on both primary and replica.
> But I've experienced similar behavior on 9.2.9. OS Centos 6.6 in all
> cases. This may mean, that there can be something wrong with our
> configuration or the replication setup steps, but I've set up another
> instance using the same steps with no problem.
>
> Fsync related setting are at their defaults. Data directories are on
> RAID10 arrays, with BBUs. Filesystem is ext4 mounted with nobarrier
> option.
>
> Database is fairly large ~120GB with several 50mil+ tables, lots of
> indexes and FK constraints. It is mostly queried,
> updates/inserts/deletes are only several rows/s.
>
> Any help will be appreciated.
>
> Petr Novak
>
> System Engineer
> Avast s.r.o.
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


В списке pgsql-general по дате отправления:

Предыдущее
От: Adrian Klaver
Дата:
Сообщение: Re: Starting new cluster from base backup
Следующее
От: Guillaume Drolet
Дата:
Сообщение: Re: Starting new cluster from base backup