Re: How to deal with corrupted database?

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: How to deal with corrupted database?
Дата
Msg-id 4EBB38C7.6030707@ringerc.id.au
обсуждение исходный текст
Ответ на Re: How to deal with corrupted database?  ("Ruslan A. Bondar" <fsat@list.ru>)
Ответы Re: How to deal with corrupted database?  ("Ruslan A. Bondar" <fsat@list.ru>)
Список pgsql-admin
On 09/11/11 21:37, Ruslan A. Bondar wrote:

> This database isn't mission critical, so if you want - I can
experiment on this.

Brilliant! It's rare for people to be able to investigate issues like
this, most of the time they just have to get up and running ASAP and
often destroy data required to usefully investigate in the process.

> First issue was some kind of deadlock (concurrent insert and concurrent delete on a table) I saw them wile reindexing
thedatabase.  
> Also mesages like this were in dmesg:

> [3681001.529385]  [<c128d717>] ? __mutex_lock_common+0xe8/0x13b
> [3681001.529401]  [<c128d779>] ? __mutex_lock_slowpath+0xf/0x11
> [3681001.529416]  [<c128d80a>] ? mutex_lock+0x17/0x24
> [3681001.529429]  [<c128d80a>] ? mutex_lock+0x17/0x24
> [3681001.529444]  [<c10bc2a3>] ? generic_file_llseek+0x17/0x44
> [3681001.529458]  [<c10bc28c>] ? generic_file_llseek+0x0/0x44
> [3681001.529473]  [<c10bb145>] ? vfs_llseek+0x30/0x34
> [3681001.529487]  [<c10bc1a1>] ? sys_llseek+0x3a/0x7a
> [3681001.529501]  [<c1008efc>] ? syscall_call+0x7/0xb

OK, so it was hung waiting for a lock within the VFS layer. That's
rather odd at best.

Your system details are:

OS: Linux Debian Sid
postgres version: 9.0.4
uname: Linux udb 2.6.32-5-xen-686 #1 SMP Tue Oct 19 17:09:04 UTC 2010
i686 GNU/Linux

so you're running a Xen kernel (guest? Or host? I'm assuming guest) on a
very fresh kernel on a testing version of Debian. I won't be entirely
shocked if this is a kernel issue.

I don't see any obvious deadlock reports on a search for

  "vfs_llseek" OR "generic_file_llseek" deadlock

but

  "vfs_llseek" OR "generic_file_llseek" mutex_lock

finds:

  https://bugzilla.redhat.com/show_bug.cgi?id=716991 (unrelated?)

and more interestingly:

  http://postgresql.1045698.n5.nabble.com/Load-Increase-td4269457.html

... which unfortunately doesn't give any OS/kernel info, but is another
recent report.

There have been some recent changes in file system locking:

  http://lwn.net/Articles/448038/

so I'd appreciate it if you could pass this report on to the kernel
folks involved in case they want to investigate further.



> So I've stopped software caused these inserts and deletes, but reindexing shows same warnings. I've restarted
postgresqlserver. 

How did you restart PostgreSQL?

If there were backends hung in the vfs, did the eventually terminate by
themselves? If not, did you terminate them yourself? How? With a signal
(kill)? Which signal? Some other way?

> Postgres restarted successfully, but the database became unaccessible. Filesystem is clean. File base/16387/86057840
existsbut is zero length. File pg_subtrans/00F2 does not exists. 

Hm, ok. I'm a bit suspicious of the deadlock in the kernel. It isn't
necessarily a kernel issue, but given the system you're running on I
won't be too surprised if it is either. There's a fairly good chance the
trigger for this was a kernel issue munching your data.

Are you able to reproduce this issue with another instance of PostgreSQL
running with a freshly created database cluster (initdb) ?

> 2011-11-09 16:25:04 MSK FATAL:  xlog flush request 171/1B1374E0 is not satisfied --- flushed only to 171/19C26010
> 2011-11-09 16:25:04 MSK CONTEXT:  writing block 0 of relation base/16385/86064815_vm
> 2011-11-09 16:25:04 MSK LOG:  startup process (PID 3570) exited with exit code 1
> 2011-11-09 16:25:04 MSK LOG:  aborting startup due to startup process failure

I don't know enough about Pg's guts to suggest how to proceed from here.
Maybe a pg_resetxlog might get you up and running (albeit with potential
data damage) but I'm not sure.

--
Craig Ringer

В списке pgsql-admin по дате отправления:

Предыдущее
От: Steve Crawford
Дата:
Сообщение: Re: setting timezone
Следующее
От: Craig Ringer
Дата:
Сообщение: Re: database not using indexes