Re: Some pgq table rewrite incompatibility with logical decoding?

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Some pgq table rewrite incompatibility with logical decoding?
Дата
Msg-id 68498409-2b63-8aa6-4d03-63c85393ed10@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Some pgq table rewrite incompatibility with logical decoding?  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Some pgq table rewrite incompatibility with logical decoding?  (Jeremy Finzel <finzelj@gmail.com>)
Список pgsql-hackers
Hi,

On 08/29/2018 12:01 AM, Tomas Vondra wrote:
> On 08/28/2018 07:41 PM, Jeremy Finzel wrote:
>>         Jeremy, are you able to reproduce the issue locally, using pgq?
>>         That would be very valuable.
>>
>>
>> Tomas et al:
>>
>> We have hit this error again, and we plan to snapshot the database as to
>> be able to do whatever troubleshooting we can.  If someone could provide
>> me guidance as to what exactly you would like me to do, please let me
>> know.  I am able to provide an xlog dump and also debugging information
>> upon request.
>>
>> This is actually a different database system that also uses skytools,
>> and the exact same table (pgq.event_58_1) is again the cause of the
>> relfilenode error.  I did a point-in-time recovery to a point after this
>> relfilenode appears using pg_xlogdump, and verified this was the table
>> that appeared, then disappeared.
>>
> 
> Great!
> 
> Can you attach to the decoding process using gdb, and set a breakpoint
> to the elog(ERROR) at reorderbuffer.c:1599, and find out at which LSN /
> record it fails?
> 
> https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/replication/logical/reorderbuffer.c#L1599
> 
> If it fails too fast, making it difficult to attach gdb before the
> crash, adding the LSN to the log message might be easier.
> 
> Once we have the LSN, it would be useful to see the pg_xlogdump
> before/around that position.
> 
> Another interesting piece of information would be to know the contents
> of the relmapper cache (which essentially means stepping through
> RelationMapFilenodeToOid or something like that).
> 

Have you managed to get the backtrace, or investigate where exactly it
fails (which LSN etc.)? We've managed to get a backtrace for "our"
failure, and it'd be interesting to compare those.

Attached is a subset of pg_waldump, for the two relevant transactions.
2554301859 is the transaction doing VACUUM FULL on the user table (so
essentially the pgq table), and 2554301862 (with a single 2554301862
subxact, likely due to exception handled in plpgsql function) is the
transaction that fails during decoding - on the very first WAL record
after the 2554301859 commit.

In reality there are records from additional transactions intermixed,
but those are irrelevant I believe. It's 2554301862 that fails, because
it commits first (and thus gets decoded first).

At first I thought this might be related to the "could not read block 3"
issue discussed in another thread, but if you say upgrade to 9.6.10
fixes this, that seems unlikely (because that issue is still there).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: patch to allow disable of WAL recycling
Следующее
От: Tom Lane
Дата:
Сообщение: Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes