Обсуждение: Accidental removal of a file causing various problems

Поиск

Список

Период

Сортировка

Accidental removal of a file causing various problems

От

Pavan Deolasee

Дата:

24 августа 2018 г., 21:14:38

Hello All,

One of our customers reported a situation where one of the many segments backing a table went missing. They don't know why that happened and we couldn't determine if this could be a PostgreSQL bug or FS bug or simply an operator error.

The very first errors seen in the logs were:

WARNING: could not write block 27094010 of base/56972584/56980980

DETAIL: Multiple failures --- write error might be permanent.

ERROR: could not open file "base/56972584/56980980.69" (target block 27094010): previous segment is only 12641 blocks

CONTEXT: writing block 27094010 of relation base/56972584/56980980

As I said, while we don't know why the file "base/56972584/56980980.69" went missing, what happens thereafter is also very strange:

1. The user soon found out that they can no longer connect to any database in the cluster. Not just the one to which the affected table belonged, but no other database in the cluster. The affected table is a regular user table (actually a toast table).

2. So they restarted the database server. While that fixed the connection problem, they started seeing toast errors on the table to which the missing file belonged to. The missing file was recreated at the database restart, but of course it was filled in with all zeroes, causing data corruption.

3. To make things worse, the corruption then got propagated to the standbys too. We don't know if the original file removal was replicated to the standby, but it seems unlikely.

I've a test case that reproduce all of these effects if a backend file is forcefully removed, but I haven't had enough time to figure out if 1) the file removal is a bug in PostgreSQL and 2) whether any of the observed side effects point to a potential bug.

Is this worth pursuing? Or are these side effects are well understood and known? IMHO even if we accept that we can't do much about a missing file, it seems quite odd that both 1 and 3 happens.

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Accidental removal of a file causing various problems

От

Tom Lane

Дата:

24 августа 2018 г., 21:46:39

Pavan Deolasee <pavan.deolasee@gmail.com> writes:
> 1. The user soon found out that they can no longer connect to any database
> in the cluster. Not just the one to which the affected table belonged, but
> no other database in the cluster. The affected table is a regular user
> table (actually a toast table).

Please define "can no longer connect".  What happened *exactly*?
How long did it take to start failing like that (was this perhaps a
shutdown-because-of-impending-wraparound situation)?

> 2. So they restarted the database server. While that fixed the connection
> problem, they started seeing toast errors on the table to which the missing
> file belonged to. The missing file was recreated at the database restart,
> but of course it was filled in with all zeroes, causing data corruption.

Doesn't seem exactly surprising, if some toast data went missing.

> 3. To make things worse, the corruption then got propagated to the standbys
> too. We don't know if the original file removal was replicated to the
> standby, but it seems unlikely.

This is certainly unsurprising.

> I've a test case that reproduce all of these effects if a backend file is
> forcefully removed,

Let's see it.

Note that this:

> WARNING:  could not write block 27094010 of base/56972584/56980980
> DETAIL:  Multiple failures --- write error might be permanent.
> ERROR:  could not open file "base/56972584/56980980.69" (target block
> 27094010): previous segment is only 12641 blocks
> CONTEXT:  writing block 27094010 of relation base/56972584/56980980

does not say that the .69 file is missing.  It says that .68 (or, maybe,
some even-earlier segment) was smaller than 1GB, which is a different
matter.  Still data corruption, but I don't think I believe it was a
stray "rm".

Oh, and what PG version are we talking about?

            regards, tom lane

Re: Accidental removal of a file causing various problems

От

Pavan Deolasee

Дата:

24 августа 2018 г., 22:26:17

On Sat, Aug 25, 2018 at 12:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Pavan Deolasee <pavan.deolasee@gmail.com> writes:
> 1. The user soon found out that they can no longer connect to any database
> in the cluster. Not just the one to which the affected table belonged, but
> no other database in the cluster. The affected table is a regular user
> table (actually a toast table).

Please define "can no longer connect". What happened *exactly*?
How long did it take to start failing like that (was this perhaps a
shutdown-because-of-impending-wraparound situation)?

The errors were simply about about the missing file. See attached reproduction script that I created while studying this complaint. It will throw errors such as:

psql: FATAL: could not open file "base/12669/16387": No such file or directory

CONTEXT: writing block 207724 of relation base/12669/16387

Now of course, the file is really missing. But the user was quite surprised that they couldn't connect to any database, even though mishap happened to a user table in one of their reporting databases. To add to their misery, while restart fixed this error and opened their other databases for regular operation, it caused the toast corruption.

(The original report obviously complained about whatever was the missing segment)

> 2. So they restarted the database server. While that fixed the connection
> problem, they started seeing toast errors on the table to which the missing
> file belonged to. The missing file was recreated at the database restart,
> but of course it was filled in with all zeroes, causing data corruption.

Doesn't seem exactly surprising, if some toast data went missing.

My concern is about recreating a zero-filled file, without even any warnings. Is that OK? Is that necessary to deal with a common scenario?

> 3. To make things worse, the corruption then got propagated to the standbys
> too. We don't know if the original file removal was replicated to the
> standby, but it seems unlikely.

This is certainly unsurprising.

Again my worry is that we might have corrupted a otherwise good standby by recreating a zero-filled file and later inserting new data in those blocks. I wonder if we could have prevented that by requiring an administrative intervention, instead of happily recreating the file and then overwriting it.

> I've a test case that reproduce all of these effects if a backend file is
> forcefully removed,

Let's see it.

Attached.

Note that this:

> WARNING: could not write block 27094010 of base/56972584/56980980
> DETAIL: Multiple failures --- write error might be permanent.
> ERROR: could not open file "base/56972584/56980980.69" (target block
> 27094010): previous segment is only 12641 blocks
> CONTEXT: writing block 27094010 of relation base/56972584/56980980

does not say that the .69 file is missing. It says that .68 (or, maybe,
some even-earlier segment) was smaller than 1GB, which is a different
matter. Still data corruption, but I don't think I believe it was a
stray "rm".

Hmm, interesting. It's a kinda old report, but somehow I remember doing analysis and concluding that an entire segment went missing and not some blocks in an intermediate segment. I might be wrong though. Will recheck again.

Oh, and what PG version are we talking about?

I think this is reproducible on all versions I have tested so far, including master.

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Вложения

repro.sh

Re: Accidental removal of a file causing various problems

От

Alvaro Herrera

Дата:

24 августа 2018 г., 22:33:02

On 2018-Aug-25, Pavan Deolasee wrote:

> The errors were simply about about the missing file. See attached
> reproduction script that I created while studying this complaint. It will
> throw errors such as:
> 
> psql: FATAL:  could not open file "base/12669/16387": No such file or
> directory
> CONTEXT:  writing block 207724 of relation base/12669/16387
> 
> Now of course, the file is really missing. But the user was quite surprised
> that they couldn't connect to any database, even though mishap happened to
> a user table in one of their reporting databases.

Hmm, that sounds like there's a bunch of dirty pages waiting to be
written to that nonexistant file, and the error prevents the starting
backend from acquiring a free page on which to read something from disk
for another relation.

> To add to their misery, while restart fixed this error and opened
> their other databases for regular operation, it caused the toast
> corruption.

Sounds like the right data was in shared buffers, but it was lost in the
shutdown.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Accidental removal of a file causing various problems

От

Andres Freund

Дата:

24 августа 2018 г., 22:38:15

Hi,

On 2018-08-25 00:56:17 +0530, Pavan Deolasee wrote:
> > Oh, and what PG version are we talking about?
> >
> 
> I think this is reproducible on all versions I have tested so far,
> including master.

The point is not where you can cause trouble by explicitly deleting
files - that'll always screw up things - but where you encountered this
in the wild.

Greetings,

Andres Freund

Re: Accidental removal of a file causing various problems

От

Tom Lane

Дата:

24 августа 2018 г., 22:44:37

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2018-Aug-25, Pavan Deolasee wrote:
>> Now of course, the file is really missing. But the user was quite surprised
>> that they couldn't connect to any database, even though mishap happened to
>> a user table in one of their reporting databases.

> Hmm, that sounds like there's a bunch of dirty pages waiting to be
> written to that nonexistant file, and the error prevents the starting
> backend from acquiring a free page on which to read something from disk
> for another relation.

Perhaps so --- but wouldn't this require that every buffer in shared
buffers now belong to the corrupted file?  Or have we broken the
allocation algorithm such that the same buffer keeps getting handed
out to every request?

I'm starting to wonder if this type of scenario needs to be considered
alongside the truncation corruption issues we're discussing nearby.
What do you do given a persistent failure to write a dirty block?
It's hard to see how you get to an answer that doesn't result in
(a) corrupted data or (b) a stuck database, neither of which is
pleasant.  But I think right now our behavior will lead to (b),
which is what this is reporting --- until you do stop -m immediate,
and then likely you've got (a).

            regards, tom lane

Re: Accidental removal of a file causing various problems

От

Tom Lane

Дата:

24 августа 2018 г., 22:45:47

Andres Freund <andres@anarazel.de> writes:
> The point is not where you can cause trouble by explicitly deleting
> files - that'll always screw up things - but where you encountered this
> in the wild.

Actually, I think the main point is given that we've somehow got into
a situation like that, how do we get out again?

            regards, tom lane

Re: Accidental removal of a file causing various problems

От

Pavan Deolasee

Дата:

04 сентября 2018 г., 08:13:58

On Sat, Aug 25, 2018 at 1:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Actually, I think the main point is given that we've somehow got into
a situation like that, how do we get out again?

I and Alvaro discussed this off-list a bit and we came up with couple of ideas.

1. Reserve some buffers in the shared buffers for system critical functionality. As this case shows, failure to write blocks populated the entire shared buffers with bad blocks and thus making the database completely inaccessible, even for remedial actions. So the idea is to leave aside say first 100 (or some such number) of blocks for system catalogs and allocate buffers from the remaining pool for user tables. Since will at least help in cases where one bad user table does not bring down the entire cluster. Of course, this may not help if the system catalogs themselves are unwritable. But that's probably a major issue anyways.

2. Provide either an automatic or manual way to evict unwritable buffers to a spillover file or set of files. The buffer pool can then be rescued from the critical situation and the DBA can manually inspect the spillover files to take any corrective action, if needed and if feasible. My idea was to create a shadow relfilenode and write buffers to their logical location. Alvaro though thinks that writing one block per file (relfilenode/fork/block) is a better idea since that provides an easy way for DBA to take action. Irrespective of whether we pick one file per block or per relfilenode, a more interesting question is: should this be automatic or require administrative action?

Does either of the ideas sound interesting enough for further work?

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Accidental removal of a file causing various problems

Вложения