Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)

Поиск
Список
Период
Сортировка
От Benedikt Grundmann
Тема Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Дата
Msg-id CADbMkNPP0xA2S+b625my3YznBcToZkCaVktTaaP4H5ZX_WLdSw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Список pgsql-hackers
It's on the production database and the streaming replica.  But not on the snapshot.

production
-rw------- 1 postgres postgres 312778752 May 13 21:28 /database/postgres/base/16416/291498116.3

streaming replica
-rw------- 1 postgres postgres 312778752 May 13 23:50 /database/postgres/base/16416/291498116.3

Is there a way to find out what the file contains?

We just got some more information.  All of the following was done / seen in the logs of the snapshot database.

After we saw this we run a vacuum full on the table we suspect to be backed by this file.  This happened:


WARNING:  concurrent insert in progress within table "js_equity_daily_diff"

Over and over again.  So we killed the VACUUM full by hitting ctrl-c again and again.  After that the logs contained:

fgrep ERROR postgresql-2013-05-14.csv  | head
2013-05-14 00:23:16.028 EDT,"postgres_prod","proddb_testing",15698,"172.24.65.103:54536",5191bc31.3d52,3,"SELECT",2013-05-14 00:23:13 EDT,2/3330,0,ERROR,58P01,"could not open file ""base/16416/291498116.3"" (target block 431006): No such file or directory",,,,,"SQL statement ""select max(equity_date) from js_equity_daily_diff""

The above was the original error that made us investigate.

2013-05-14 09:21:47.121 EDT,"postgres_prod","proddb_testing",21002,"172.27.41.24:36815",51923a67.520a,38756,"VACUUM",2013-05-14 09:21:43 EDT,2/21611,645995272,ERROR,57014,"canceling statement due to user request",,,,,,"vacuum full js_equity_daily_diff",,,"psql"
2013-05-14 09:22:04.700 EDT,"postgres_prod","proddb_testing",21008,"172.27.41.24:36831",51923a75.5210,115909,"VACUUM",2013-05-14 09:21:57 EDT,2/21618,645995273,ERROR,57014,"canceling statement due to user request",,,,,,"vacuum full js_equity_daily_diff",,,"psql"

Than us cancelling the vacuum full

2013-05-14 09:22:13.947 EDT,,,30911,,51919d78.78bf,1,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/31266090",,,,,"writing block 0 of relation base/16416/291498116",,,,""
2013-05-14 09:22:14.964 EDT,,,30911,,51919d78.78bf,2,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/31266090",,,,,"writing block 0 of relation base/16416/291498116",,,,""


And after that these started appearing in logs (and they get repeated every second now:

[root@nyc-dbc-001 pg_log]# fgrep ERROR postgresql-2013-05-14.csv  | tail -n 2
2013-05-14 09:47:43.301 EDT,,,30911,,51919d78.78bf,3010,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of relation base/16416/291498116",,,,""
2013-05-14 09:47:44.317 EDT,,,30911,,51919d78.78bf,3012,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of relation base/16416/291498116",,,,""

There are no earlier ERROR's in the logs.

2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1868,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of relation base/16416/291498116",,,,""
2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1869,,2013-05-13 22:12:08 EDT,,0,WARNING,58030,"could not write block 0 of base/16416/291498116","Multiple failures --- write error might be permanent.",,,,,,,,""

The disk is not full nor are there any messages in the kernel logs.

Cheers,

Bene



On Tue, May 14, 2013 at 9:27 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 14.05.2013 14:57, Benedikt Grundmann wrote:
Today we have seen this on our testing database instance:

ERROR:  could not open file "base/16416/291498116.3" (target block 431006):
No such file or directory

That database get's created by rsyncing the LVM snapshot of the standby,
which is a readonly backup of proddb
using streaming replication.

We do not put the standby database into backup mode before rsyncing the LVM
snapshot, trusting postgres ability to recover
from crash.  Now we are not sure anymore...  Thoughts?  It's worth noting
that we have been using this method of taking a backup
for a long long time, but we only recently (after 9.2 came out we switched
to doing a streaming replica and then to take the
"frozen snapshot" backup method from the streaming replica).

Let me know if there is more information you need, or if this is a known
problem.

That certainly should work. I'd suggest that you narrow down the problem a bit more. Is the file present in the original system? Anything else that might be related in the logs?

- Heikki

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Move unused buffers to freelist
Следующее
От: Jeff Janes
Дата:
Сообщение: Re: Incomplete description of pg_start_backup?