Re: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows
Дата
Msg-id 20191004200605.yqcmn75otebwcvyj@alap3.anarazel.de
обсуждение исходный текст
Ответ на BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows  (PG Bug reporting form <noreply@postgresql.org>)
Ответы Re: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows  (Andres Freund <andres@anarazel.de>)
Re: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-bugs
Hi,

Thanks for the report!

On 2019-10-04 19:28:28 +0000, PG Bug reporting form wrote:
> We just moved our production cluster from pg 11.5 to pg 12.0 under windows
> using pg_dump/initdb/pg_restore.
>
> When we activated the replication slots by
>
> SELECT * FROM pg_create_physical_replication_slot('sam_repli_3');
>
> and tried restarting the server, we got a PANIC in error log:
>
> CPS PRD 2019-10-04 19:10:07 CEST  00000  1:> LOG:  database system was shut
> down at 2019-10-04 19:10:02 CEST
> CPS PRD 2019-10-04 19:10:07 CEST  XX000  2:> PANIC:  could not fsync file
> "pg_replslot/sam_repli_3/state": Bad file descriptor
> CPS PRD 2019-10-04 19:10:07 CEST  00000  6:> LOG:  startup process (PID
> 4028) was terminated by exception 0xC0000409
> CPS PRD 2019-10-04 19:10:07 CEST  00000  7:> HINT:  See C include file
> "ntstatus.h" for a description of the hexadecimal value.
> CPS PRD 2019-10-04 19:10:07 CEST  00000  8:> LOG:  aborting startup due to
> startup process failure
> CPS PRD 2019-10-04 19:10:07 CEST  00000  9:> LOG:  database system is shut
> down
>
> We use the EDB distribution from the website under Windows Server 2019
> (September 2019 patch level).
>
> select version ();
>                           version
> ------------------------------------------------------------
>  PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
> (1 Zeile)
>
> This seems to me like a fatal bug which makes the streaming replication
> unusable under Windows x64 /pg12.
>
> The same configuration worked flawlessly under pg 11.x until pg 11.5
>
> By searching on google we encountered a similar error from 2015 under pg
> 9.4.1 reported under BUG #13287:
>
> https://www.postgresql.org/message-id/flat/20150514105514.2691.67352%40wrigleys.postgresql.org

Uh, Michael? You just reintroduced this bug in

commit 82a5649fb9dbef12d04cd24799be6bf298d889a6
Author: Michael Paquier <michael@paquier.xyz>
Date:   2019-03-09 08:50:55 +0900

    Tighten use of OpenTransientFile and CloseTransientFile

    This fixes two sets of issues related to the use of transient files in
    the backend:
    1) OpenTransientFile() has been used in some code paths with read-write
    flags while read-only is sufficient, so switch those calls to be
    read-only where necessary.  These have been reported by Joe Conway.

You pretty much entirely reverted:

commit dfbaed459754e71e01bb0cc90a12802bba3f9786
Author: Andres Freund <andres@anarazel.de>
Date:   2015-04-28 00:12:38 +0200

    Use a fd opened for read/write when syncing slots during startup.

    Some operating systems, including the reporter's windows, return EBADFD
    or similar when fsync() is invoked on a O_RDONLY file descriptor.
    Unfortunately RestoreSlotFromDisk() does exactly that; which causes
    failures after restarts in at least some scenarios.

    If you hit the bug the error message will be something like
    ERROR: could not fsync file "pg_replslot/$name/state": Bad file descriptor

    Simply use O_RDWR instead of O_RDONLY when opening the relevant file
    descriptor to fix the bug.  Unfortunately I have no way of verifying the
    fix, but we've seen similar problems in the past.

    This bug goes back to 9.4 where slots were introduced. Backpatch
    accordingly.

    Reported-By: Patrice Drolet
    Bug: #13143:
    Discussion: 20150424101006.2556.60897@wrigleys.postgresql.org

I realize I perhaps should have added a comment explaining this, but
this is far from the only location that knows we have to know open fds
r/w to be able to fsync them.

What were you even trying to fix by changing this?


Seems also pretty clear that we need a few animals running with fsync
enabled. Not sure how we best can write test infrastructure to make it
easy to set that for all tests. Guess I best start a thread about it on
-hackers.

Greetings,

Andres Freund



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: BUG #16036: Segmentation fault while doing an update
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #16040: PL/PGSQL RETURN QUERY statement never uses a parallel plan