Обсуждение: Fwd: corrupted files

Поиск
Список
Период
Сортировка

Fwd: corrupted files

От
Klaus Ita
Дата:
Sorry for cross-posting, i read that pg-bug was not the right place for this email

Hi list!

depressed me gets error messages like these:

2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> ERROR:  could not access status of transaction 8393477
2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> DETAIL:  Could not open file "pg_clog/0008": No such file or directory.

combined with the error output of queries that  do not work.

I looked in pg_clog and correct, 0008 is missing.

On this linux machine on (3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 GNU/Linux) I am using xfs on raid1 on a megacli raid controller with 16 disks, no battery, this is why write through is enabled, no cacheing.

I quite extensively created indices in transactions and removed those within these transactions to do fast deletes (foreign key constraints) before i got the error???

Now it might be that the memory on the server is corrupt? dunno, but i think it's the only 'cheap' part in the whole game.

* tried to get one of the warm standby's up but one complains about not being the same pg cluster as the 'wal files'. the other hot standby won't start for some locale reason.
(it's not that I did not have backups ;) ).

the cluster is 'working', i get the error around 1/sec but the other clients seem fine, so it's really only a few tables that are corrupted. I cannot really take down the machine as it's quite a busy few million queries a day cluster.

before the current error, i got some error that XXXXX.1 was missing which was (luckily) an index file that i could recreate via 'reindex', but i fear we're now at a table / transaction corruption which i cannot just 'rewrite'.

I would not at all mind just discarding all those transactions that have accumulated in pg_clog

postgres@pgmaster:~/9.1/main/pg_clog$ ls -alrt | wc -l
180



Is there any way, even with data loss to get rid of those transactions and just let the cluster behave again? It's serving some web-apps for users so some minor data loss will not be the issue.








quite desperate...



postgres@[local]:5432 [postgres] # select version();
                                           version                                            
----------------------------------------------------------------------------------------------
 PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit
(1 row)



Customized options:

#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------

#custom_variable_classes = ''           # list of custom variable class names

listen_addresses = '*'          # what IP address(es) to listen on;
max_connections = 320                   # (change requires restart)
timezone = 'Etc/UTC'

shared_buffers = 2GB                    # min 128kB
maintenance_work_mem = 250MB
checkpoint_completion_target = 0.9 
effective_cache_size = 20GB
effective_io_concurrency = 6            # 1-1000. 0 disables prefetching

archive_mode    = on

archive_command = '/opt/postgres_archive_command.pl --file_path=%p --file_name=%f --work_dir=/var/tmp/ --destination_hosts=va-pg-backups@dx.ipv6.ex.net --destination_sftp_hosts=u671@ipv6.u71.y --destination_hosts=va-pg-backups@y7.ipv6.ex.net'

max_wal_senders   = 3   # max number of walsender processes
wal_keep_segments = 50  # in logfile segments, 16MB each; 0 disables





thx in advance,

klaus


Re: Fwd: corrupted files

От
raghu ram
Дата:

On Tue, Jul 30, 2013 at 4:07 AM, Klaus Ita <klaus@worstofall.com> wrote:
Sorry for cross-posting, i read that pg-bug was not the right place for this email

Hi list!

depressed me gets error messages like these:

2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> ERROR:  could not access status of transaction 8393477
2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> DETAIL:  Could not open file "pg_clog/0008": No such file or directory.

combined with the error output of queries that  do not work.

I looked in pg_clog and correct, 0008 is missing.



You can recreate a missed "pg_clog" file with below command:

dd if=/dev/zero of=~/9.1/main/pg_clog/0008  bs=256k count=1 (To make the uncommitted record as they haven't been committed.)

and then try to start the cluster.

Thanks & Regards
Raghu Ram

Re: Fwd: corrupted files

От
Klaus Ita
Дата:
Hi!

Thank you, I actually tried that and it seems that only lead to even more corrupted data. I am currently trying to recover the 'hot-standby' host that is also unhappy about one of the wal_files. I am looking at the wal with less and see only data i do not care about in it (mostly session-logging/statistics data).

I am trying to remember, there was a tool that plotted the contents of the wal_files in a more readable format ...

lg,k


On Tue, Jul 30, 2013 at 8:23 AM, raghu ram <raghuchennuru@gmail.com> wrote:

On Tue, Jul 30, 2013 at 4:07 AM, Klaus Ita <klaus@worstofall.com> wrote:
Sorry for cross-posting, i read that pg-bug was not the right place for this email

Hi list!

depressed me gets error messages like these:

2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> ERROR:  could not access status of transaction 8393477
2013-07-29 20:57:09 UTC <xaxos_mailer%xaxos_de> DETAIL:  Could not open file "pg_clog/0008": No such file or directory.

combined with the error output of queries that  do not work.

I looked in pg_clog and correct, 0008 is missing.



You can recreate a missed "pg_clog" file with below command:

dd if=/dev/zero of=~/9.1/main/pg_clog/0008  bs=256k count=1 (To make the uncommitted record as they haven't been committed.)

and then try to start the cluster.

Thanks & Regards
Raghu Ram


Re: Fwd: corrupted files

От
bricklen
Дата:
On Mon, Jul 29, 2013 at 11:50 PM, Klaus Ita <klaus@worstofall.com> wrote:
I am trying to remember, there was a tool that plotted the contents of the wal_files in a more readable format ...

xlogdump?

Re: Fwd: corrupted files

От
Klaus Ita
Дата:
Yes, that's it!

thank you! It turned out that really there was a corruption in the main pg server which was 'virally' propagated to

1. streaming replica
1. replaying wal receiver
1. old backup that tried to replay the wal's

I really thought with a master and 3 backups i'd be safe.

lg,k




On Tue, Jul 30, 2013 at 5:13 PM, bricklen <bricklen@gmail.com> wrote:
On Mon, Jul 29, 2013 at 11:50 PM, Klaus Ita <klaus@worstofall.com> wrote:
I am trying to remember, there was a tool that plotted the contents of the wal_files in a more readable format ...

xlogdump?


Re: Fwd: corrupted files

От
bricklen
Дата:

On Tue, Jul 30, 2013 at 8:18 AM, Klaus Ita <klaus@worstofall.com> wrote:

thank you! It turned out that really there was a corruption in the main pg server which was 'virally' propagated to

1. streaming replica
1. replaying wal receiver
1. old backup that tried to replay the wal's

I really thought with a master and 3 backups i'd be safe.


Physical corruption in the master, or logical?

Re: Fwd: corrupted files

От
Klaus Ita
Дата:
i guess logical, caused by whatever. i really cannot say, the wal files all *look* ok, still, they lead to a situation that's a definite dead end.
we did have a hard-drive failure (one in 13) at the time, but due to raid5 + hot spare no data should have been corrupted. i mean it's an lsi controller, ... not fond of it, but it's not bad stuff.

lg,k


On Tue, Jul 30, 2013 at 5:29 PM, bricklen <bricklen@gmail.com> wrote:

On Tue, Jul 30, 2013 at 8:18 AM, Klaus Ita <klaus@worstofall.com> wrote:

thank you! It turned out that really there was a corruption in the main pg server which was 'virally' propagated to

1. streaming replica
1. replaying wal receiver
1. old backup that tried to replay the wal's

I really thought with a master and 3 backups i'd be safe.


Physical corruption in the master, or logical?