Re: "PANIC: could not open critical system index 2662" - twice

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: "PANIC: could not open critical system index 2662" - twice
Дата
Msg-id 20230508210400.ke4n5flpaxljtchd@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: "PANIC: could not open critical system index 2662" - twice  (Evgeny Morozov <postgresql3@realityexists.net>)
Ответы Re: "PANIC: could not open critical system index 2662" - twice
Re: "PANIC: could not open critical system index 2662" - twice
Список pgsql-general
Hi,

On 2023-05-08 20:27:14 +0000, Evgeny Morozov wrote:
> On 8/05/2023 9:47 pm, Andres Freund wrote:
> > Did you have any occasions where CREATE or DROP DATABASE was interrupted?
> > Either due the connection being terminated or a crash?
>
> I've uploaded an edited version of the PG log for the time as
> https://objective.realityexists.net/temp/log-extract-2023-05-02.txt
> (test_behavior_638186279733138190 and test_behavior_638186280406544656
> are the two DBs that got corrupted).
>
> I cannot see any crash in the test logs or the PG logs, but whether it
> was interrupted is less clear. I don't know whether the the tests ran
> successfully up to the point where they tried to drop the DBs (I've
> since added logging to show that next time), but DROP DATABASE did not
> return after 30 seconds and the client library (Npgsql) then tried to
> cancel the requests. We then tried to drop the DB again, with the same
> results in both cases. After the second attempts timed out we closed the
> connections anyway - so maybe that was the interruption?

Yep, that is the interruption.  I suspect that something was not processing
interrupts, which then lead the WaitForProcSignalBarrier() in dropdb() to
block.

Are you using any extensions? Do you have any chance to figure out what
statements were running concurrently with the DROP DATABASE?


Seems we figured out what the problem is... Just need to figure out how to fix
it.

I think the minimal fix will entail moving the commit of the transaction into
dropdb(). We need to commit before executing DropDatabaseBuffers(), as the
database is inconsistent after that point.  Luckily DROP DATABASE already
can't be executed in a transaction, so that's easily possible.

That means we'd loose track of the files for the database in case of a crash,
but that's surely better than a corrupt database. And it's already the case
for CREATE DATABASE.


Additionally we probably need to prevent processing cancellations between
DropDatabaseBuffers() and remove_dbtablespaces(). But we can't just stop
accepting interrupts, because that'd break WaitForProcSignalBarrier(). Ah,
what fun.

Unfortunately QueryCancelHoldoffCount is not sufficient for this purpose - we
don't just need to prevent cancellations, we also can't accept termintions
either...

For a bit I thought we might be able to just reorder operations, moving the
WaitForProcSignalBarrier() earlier. Unfortunately I don't think that works,
because until DropDatabaseBuffers() has executed, backends might write such
buffers back, we need to do WaitForProcSignalBarrier() after that.


I've toyed with the idea of splitting DropDatabaseBuffers() into two. In a
first pass, mark all buffers as IO_IN_PROGRESS (plus perhaps some other
flag). That'd prevent new IO from being started for those buffers. Then
WaitForProcSignalBarrier(), to make sure nobody has an open FD for those
files. Then actually drop all those buffers and unlink the files. That now can
happen with interrupts held, without any chance of being blocked, afaict.  In
case of being cancelled during WaitForProcSignalBarrier(), AbortBufferIO()
would remove IO_IN_PROGRESS, and everything would be fine.

I don't like the idea of WaitForProcSignalBarrier() while having buffers
marked as in-progress. I don't *immediately* see a deadlock danger, but I'd be
unsurprised if there were some.

But perhaps a similar approach could be the solution? My gut says that the
rought direction might allow us to keep dropdb() a single transaction.

Greetings,

Andres Freund



В списке pgsql-general по дате отправления:

Предыдущее
От: Evgeny Morozov
Дата:
Сообщение: Re: "PANIC: could not open critical system index 2662" - twice
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: "PANIC: could not open critical system index 2662" - twice