Re: Bug in visibility map WAL-logging

Поиск
Список
Период
Сортировка
От Matheus de Oliveira
Тема Re: Bug in visibility map WAL-logging
Дата
Msg-id CAJghg4+8gneMTtSw7R==5b9T719mqfsCAzar2qcHpYo5odcD2A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Bug in visibility map WAL-logging  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы Re: Bug in visibility map WAL-logging  (Matheus de Oliveira <matioli.matheus@gmail.com>)
Re: Bug in visibility map WAL-logging  (Andres Freund <andres@2ndquadrant.com>)
Список pgsql-hackers

On Tue, Jan 7, 2014 at 5:36 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 01/07/2014 07:15 PM, Matheus de Oliveira wrote:
Hi folks,


On Fri, Dec 13, 2013 at 9:47 AM, Heikki Linnakangas <hlinnakangas@vmware.com
wrote:

lazy_vacuum_page() does this:

1. START_CRIT_SECTION()
2. Remove dead tuples from the page, marking the itemid's unused.
3. MarkBufferDirty
4. if all remaining tuples on the page are visible to everyone, set the
all-visible flag on the page, and call visibilitymap_set() to set the VM
bit.
5 visibilitymap_set() writes a WAL record about setting the bit in the
visibility map.
6. write the WAL record of removing the dead tuples.
7. END_CRIT_SECTION().

See the problem? Setting the VM bit is WAL-logged first, before the
removal of the tuples. If you stop recovery between the two WAL records,
the page's VM bit in the VM map will be set, but the dead tuples are still
on the page.

This bug was introduced in Feb 2013, by commit
fdf9e21196a6f58c6021c967dc5776a16190f295, so it's present in 9.3 and master.

The fix seems quite straightforward, we just have to change the order of
those two records. I'll go commit that.



With a lot of help from Andres on IRC (thanks a lot man), I think (he
thinks actually, =P ) I may ran into the issue this bug can cause. I'm
sending this e-mail to (1) confirm if my issue is really caused by that bug
and if that is the case, (2) let you guys know the problems it can cause.

Scenario:

Master: 9.3.1 (I know, trying to go to 9.3.2)
Slave: 9.3.2

The slave was running with hot_standby=off (because of other bug Andres is
working on), but it stopped working with the following log lines:

2014-01-07 12:38:04 BRST [22668]: [7-1] user=,db= WARNING:  page 1076 of
relation base/883916/151040222 is uninitialized
2014-01-07 12:38:04 BRST [22668]: [8-1] user=,db= CONTEXT:  xlog redo
visible: rel 1663/883916/151040222; blk 1076
2014-01-07 12:38:04 BRST [22668]: [9-1] user=,db= PANIC:  WAL contains
references to invalid pages
2014-01-07 12:38:04 BRST [22668]: [10-1] user=,db= CONTEXT:  xlog redo
visible: rel 1663/883916/151040222; blk 1076
2014-01-07 12:38:04 BRST [22665]: [3-1] user=,db= LOG:  startup process
(PID 22668) was terminated by signal 6: Aborted

(Complete log at https://pgsql.privatepaste.com/343f3190fe).

I used pg_xlogdump to (try to) find the block the error relates to:

$ pg_xlogdump pg_xlog/00000001000002BC0000002B 00000001000002BC0000007F |
grep -n -C5 '883916/151040222; blk 1076'

2088220 ... Heap2 ...  24/   192, ... lsn: 2BC/46AB8B20 ... desc: clean:
rel 1663/883916/151040222; blk 1073 remxid 107409880
2088221 ... Heap2 ...  20/    52, ... lsn: 2BC/46AB8BE0 ... desc: visible:
rel 1663/883916/151040222; blk 1074
2088222 ... Heap2 ...  24/   128, ... lsn: 2BC/46AB8C18 ... desc: clean:
rel 1663/883916/151040222; blk 1074 remxid 107409880
> ...

Hmm. The xlogdump indeed shows that the order of 'clean' and 'visible' is incorrect, but I don't immediately see how that could cause the PANIC.

Well... I also didn't understand if it could cause the PANIC. If I got right, it could only cause a visibility map bit with wrong value (e.g. first change the bit but fails to mark the tuple, in case of a failure in between - which seems that not happened). Is that right?
 
Why is the page uninitialized in the standby? If VACUUM is removing some dead tuples from it, it certainly should exist and be correctly initialized.


That I don't know for sure...
 
How did you set up the standby? Did you initialize it from an offline backup of the master's data directory, perhaps? The log shows that the startup took the the "crash recovery first, then start archive recovery" path, because there was no backup label file. In that mode, the standby assumes that the system is consistent after replaying all the WAL in pg_xlog, which is correct if you initialize from an offline backup or atomic filesystem snapshot, for example. But "WAL contains references to invalid pages" could also be a symptom of an inconsistent base backup, cause by incorrect backup procedure. In particular, I have to ask because I've seen it before: you didn't delete backup_label from the backup, did you?

Well, I cannot answer this right now, but makes all sense and is possible.

I used a script created by someone else that does pg_start_backup + rsync + pg_stop_backup, but in fact I didn't look at this script to see if it is doing something nasty, as removing backup_label. I'll be able to check it tomorrow and so I'll come back to give a definitive answer.

@andres, if it is really removing backup_label it could also cause that other issue we saw on Monday, right? (yes I did run the same script).

BTW, is there any situation that backup_label should be removed? I see no reason for doing this, and also have not yet saw someone doing it, so I didn't even thought that could be it.

Thank you guys for your attention.

Best regards,
--
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Gierth
Дата:
Сообщение: Re: WITHIN GROUP patch
Следующее
От: Robert Haas
Дата:
Сообщение: Re: dynamic shared memory and locks