Обсуждение: [ADMIN] Fixing OID directory names after a fsck

Поиск
Список
Период
Сортировка

[ADMIN] Fixing OID directory names after a fsck

От
Richard Neill
Дата:
Hi all,

Our virtual server (running 9.4) suffered a disk corruption outage due 
to a flaky SAN. As a result, all the contents of /var/lib/postgresql 
ended up (along with lots of other things) in lost+found with names like 
#1835289.

By looking for directories containing PG_VERSION (and checking the 
version number), I found 9  directories named like: 
./#1835987/PG_VERSION. These look similar to other top-level dirs within
main/base, and "strings" shows contents that does indeed look like I've 
identified the right thing.

So, I made a clean install of Postgres 9.4 on a new machine (Ubuntu 
17.04), and copied the directories into 
/var/lib/postgresql/9.4/main/base. I now have:

root@treacle:/var/lib/postgresql/9.4/main/base# ls
1  12172  12177  16384  1828647  1835009  1835289  1835317  1835987 
1843977  1844229  1844901  1958920

where the first 3 directories are the system defaults, 16384 is a test 
that I created, and 1828647 - 1958920 are the recovered top-level 
database directories (but with the wrong names). However, Postgresql 
doesn't recognise them.
From what I've been able to understand, if I could find out what the 
database OIDs used to be, I could just rename the 1828647 etc to the 
correct names, and it would work. But I'm not sure how to do this... all 
the documents are about how to find the file if you know the OID, rather 
than how to set the OID to match the file.  I tried the "obvious" step 
of creating a new database called "rescue", getting 16384, stopping 
postgresql, renaming one of my backups to 16384, and restarting - but 
that doesn't help.

Can anyone here give me a pointer?

Thanks very much for your help,

Richard

P.S. I've just joined, and I think I'm posting this to the right place
but if not, please let me know.


-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: [ADMIN] Fixing OID directory names after a fsck

От
Tom Lane
Дата:
Richard Neill <postgresql@richardneill.org> writes:
> Our virtual server (running 9.4) suffered a disk corruption outage due 
> to a flaky SAN. As a result, all the contents of /var/lib/postgresql 
> ended up (along with lots of other things) in lost+found with names like 
> #1835289.

> By looking for directories containing PG_VERSION (and checking the 
> version number), I found 9  directories named like: 
> ./#1835987/PG_VERSION. These look similar to other top-level dirs within
> main/base, and "strings" shows contents that does indeed look like I've 
> identified the right thing.

> So, I made a clean install of Postgres 9.4 on a new machine (Ubuntu 
> 17.04), and copied the directories into 
> /var/lib/postgresql/9.4/main/base. I now have:

> root@treacle:/var/lib/postgresql/9.4/main/base# ls
> 1  12172  12177  16384  1828647  1835009  1835289  1835317  1835987 
> 1843977  1844229  1844901  1958920

> where the first 3 directories are the system defaults, 16384 is a test 
> that I created, and 1828647 - 1958920 are the recovered top-level 
> database directories (but with the wrong names). However, Postgresql 
> doesn't recognise them.

This is unlikely to work unless you can also recover the pg_clog
contents, which unfortunately might be pretty difficult to identify.

(In any case, putting the cluster back into production is way too
scary.  If you can start it, with autovacuum off, and dump the data,
I'd recommend doing that and reloading.)
        regards, tom lane


-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: [ADMIN] Fixing OID directory names after a fsck

От
Richard Neill
Дата:
Hi Tom,

Thanks for your help - and speedy reply.

>> So, I made a clean install of Postgres 9.4 on a new machine (Ubuntu
>> 17.04), and copied the directories into
>> /var/lib/postgresql/9.4/main/base. I now have:
> 
>> root@treacle:/var/lib/postgresql/9.4/main/base# ls
>> 1  12172  12177  16384  1828647  1835009  1835289  1835317  1835987
>> 1843977  1844229  1844901  1958920
> 
>> where the first 3 directories are the system defaults, 16384 is a test
>> that I created, and 1828647 - 1958920 are the recovered top-level
>> database directories (but with the wrong names). However, Postgresql
>> doesn't recognise them.
> 
> This is unlikely to work unless you can also recover the pg_clog
> contents, which unfortunately might be pretty difficult to identify.

I can't find it. However, the failure was in the middle of the night, 
and I think there's a pretty decent chance that postgres was idle at the 
time - there was "probably" no transaction in progress, and the WAL 
should have committed. The DBs concerned were all for our wikis, so the 
structure will also be comparatively simple (and some risk of 
data-errors would be tolerable in this context). There is ~ 400 MB of 
data in total, almost all of it in one of the directories.

So, I think it's worth a try - but I'm not sure how to proceed next.

> (In any case, putting the cluster back into production is way too
> scary.  If you can start it, with autovacuum off, and dump the data,
> I'd recommend doing that and reloading.)

Definitely! I was going to do that anyway. The point of the new machine 
was just to ensure that Postgres itself is known to be in a sane state. 
But as soon as I can get anything to work, I'll do a dump and restore.

Thanks very much,

Richard

> 
>             regards, tom lane
> 


-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: [ADMIN] Fixing OID directory names after a fsck

От
Tom Lane
Дата:
Richard Neill <postgresql@richardneill.org> writes:
>> This is unlikely to work unless you can also recover the pg_clog
>> contents, which unfortunately might be pretty difficult to identify.

> I can't find it. However, the failure was in the middle of the night, 
> and I think there's a pretty decent chance that postgres was idle at the 
> time - there was "probably" no transaction in progress, and the WAL 
> should have committed.

That's all fine, but it only means that the right transactions would
be marked committed in pg_clog.  There's no very good reason to suppose
that that marking has propagated back to the heap tuples in all cases.

Actually though, I suspect a big chunk of your problem is that even
tuples that are hinted committed will seem to be "in the future".
You'd need to also recover pg_control, or at least reset it to something
approximating the current XID counter, before stuff is likely to look
sane.

If you don't have pg_control, you might be able to dig around in the most
recently modified tables and guess at the current XID by looking for the
largest XIDs you can find.

But, TBH, this is getting way past what you're likely to successfully
pull off if you've never done it before.  There are folks who do PG
data recovery professionally [not me], and if your data is worth money
to you, you'd be best off hiring someone who's done this before.  If
it's not, chalk this up to a learning experience, and set up a better
backup system.
        regards, tom lane


-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin