Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

Поиск
Список
Период
Сортировка
От Alvaro Herrera
Тема Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Дата
Msg-id 20150603190447.GE133018@postgresql.org
обсуждение исходный текст
Ответ на Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
Andres Freund wrote:
> On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote:

> > One idea I had was: what if the oldestMulti pointed to another multi
> > earlier in the same 0046 file, so that it is read-as-zeroes (and the
> > file is created), and then a subsequent multixact truncate tries to read
> > a later page in the file.  In SlruPhysicalReadPage() this would give a
> > change for open() to not fail, and then read() can fail as above.
> > However, we would have an earlier LOG message about "reading as zeroes".
> >
> > Really, the whole question of how this code goes past the open() failure
> > in SlruPhysicalReadPage baffles me.  I don't see any possible way for
> > the file to be created ...
>
> Wouldn't a previous WAL record zeroing another part of that segment
> explain this? A zero sized segment pretty much would lead to this error,
> right? Or were you able to check how things look after the failure?

But why would there be a previous WAL record zeroing another part of
that segment?  Note that this segment is very old -- hasn't been written
in quite a while, it's certainly not in slru buffers anymore.

> > 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
> > 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file "00000001000073DD000000AD" from archive
> > 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status of transaction 4624559
> > 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from file "pg_multixact/offsets/0046" at
offset147456: Success. 
> > 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) exited with exit code 1
> > 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to startup process failure
>
> From this isn't not entirely clear where this error was triggered from.

Well, reading code, it seems reasonable that to assume that replay of
the checkpoint record I mentioned leads to that error message when the
file exists but is not long enough to contain the given offset.  There
are not MultiXact wal records in the segment.  Also note that there's no
other "restored log file" message after the "entering standby mode"
message.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Следующее
От: Andreas Seltenreich
Дата:
Сообщение: Re: [PATCH] Add error handling to byteaout.