Обсуждение: 8.3.5 broken after power fail
Now I'm sure that the following configuration can destroy your database: * Linux host with XEN, XFS filesystem with "nobarrier", RAID controller with battery backed cache. * XEN vm with XFS filesystem with "nobarrier" with postgresql * your daughter with 3.5 years switching off the power supply of the server I guess one shouldn't use "nobarrier" on a XEN XFS domU machine. Now, can somebody help me fixing this error? I did "vacuum analyze verbose" and it stopped here: INFO: vacuuming "pg_toast.pg_toast_1281127" ERROR: could not access status of transaction 2299723776 DETAIL: Could not open file "pg_clog/0891": No such file or directory. What can I do? mfg zmi -- // Michael Monnerie, Ing.BSc ----- http://it-management.at // Tel: 0660 / 415 65 31 .network.your.ideas. // PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import" // Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4 // Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4
Вложения
Στις Tuesday 17 February 2009 10:54:52 ο/η Michael Monnerie έγραψε: > * your daughter with 3.5 years switching off the power supply of the > server 1st line of defense is to set your system to shutdown normally when the power button is pressed. 2nd line of defense is to get your self a decent UPS unit My daughter does this all the time on our family FreeBSD box. No probs. Also at work at more than 20 tanker vessels running 7.4.2, the captains do that on a constant basis and PgSQL always has survived (more than the rest of the system anyways..) > What can I do? > > mfg zmi -- Achilleas Mantzios
2009/2/17 Achilleas Mantzios <achill@matrix.gatewaynet.com>: > Στις Tuesday 17 February 2009 10:54:52 ο/η Michael Monnerie έγραψε: > >> * your daughter with 3.5 years switching off the power supply of the >> server > 1st line of defense is to set your system to shutdown normally when the power button is pressed. > 2nd line of defense is to get your self a decent UPS unit > > My daughter does this all the time on our family FreeBSD box. No probs. > > Also at work at more than 20 tanker vessels running 7.4.2, the captains do that on a constant > basis and PgSQL always has survived (more than the rest of the system anyways..) Those are all good to have. But no UPS is a replacement for hard drives / RAID controllers / file systems that don't lie about fsync. Nothing makes your database shine like being the only one in the hosting center that survives sudden catastrophic power failure. >> What can I do? tether your daughter to the other side of the room? I'm not sure which parts of those mount options are dangerous or not. I use ext3 stock with noatime. And a battery backed RAID. Smaller slower work group / station controllers (i.e. 5 year old server conrollers) go for pretty cheap and give pretty good performance with 2 or 4 drives. AS for fixing it, I believe the answer involves creating a clog file full of zeros 16Meg or so, and pg_reset_xlog. Don't count on all your data being there or all your FK-PK type relationships to be correct, Immediately dump it, initdb and reload your data, fixing it as you go.
> 1st line of defense is to set your system to shutdown normally when > the power button is pressed. 2nd line of defense is to get your self > a decent UPS unit I have both. She managed to switch off behind the UPS :-) > Also at work at more than 20 tanker vessels running 7.4.2, the > captains do that on a constant basis and PgSQL always has survived > (more than the rest of the system anyways..) Even with a XEN vm running PostgreSQL? Both on XFS with nobarrier mount option? Try it, it's fun ;-) mfg zmi -- // Michael Monnerie, Ing.BSc ----- http://it-management.at // Tel: 0660 / 415 65 31 .network.your.ideas. // PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import" // Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4 // Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4
Вложения
> tether your daughter to the other side of the room? I'm not sure > which parts of those mount options are dangerous or not. I use ext3 > stock with noatime. And a battery backed RAID. Smaller slower work > group / station controllers (i.e. 5 year old server conrollers) go > for pretty cheap and give pretty good performance with 2 or 4 drives. I know the problem is with XFS and mount option "nobarrier" within a XEN vm. Just last week I discussed with the XFS devs if that would be save to use with a battery backed RAID. They said "depends on the hypervisor". So now I proofed (not that I wanted!) that within XEN, XFS should always have "barrier" on. > AS for fixing it, I believe the answer involves creating a clog file > full of zeros 16Meg or so, and pg_reset_xlog. Don't count on all > your data being there or all your FK-PK type relationships to be > correct, Immediately dump it, initdb and reload your data, fixing it > as you go. OK, that's a start. Trying... still an error: pg_dump: Fehlermeldung vom Server: ERROR: missing chunk number 0 for toast value 1460201 in pg_toast_1281127 pg_dump: Die Anweisung war: COPY public.dbmail_messageblks (messageblk_idnr, physmessage_id, messageblk, blocksize, is_header) TO stdout; What now? mfg zmi -- // Michael Monnerie, Ing.BSc ----- http://it-management.at // Tel: 0660 / 415 65 31 .network.your.ideas. // PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import" // Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4 // Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4
"Michael Monnerie" <michael.monnerie@is.it-management.at> schrieb: > pg_dump: Fehlermeldung vom Server: ERROR: missing chunk number 0 for > toast value 1460201 in pg_toast_1281127 > pg_dump: Die Anweisung war: COPY public.dbmail_messageblks > (messageblk_idnr, physmessage_id, messageblk, blocksize, is_header) TO > stdout; The file is there: # ls -l base/16386/1281127 -rw------- 1 postgres postgres 417447936 17. Feb 12:14 base/16386/1281127 What can I do about the missing chunk? I need the data in there... mfg zmi
Στις Tuesday 17 February 2009 15:53:33 ο/η Michael Monnerie έγραψε: > "Michael Monnerie" <michael.monnerie@is.it-management.at> schrieb: > > pg_dump: Fehlermeldung vom Server: ERROR: missing chunk number 0 for > > toast value 1460201 in pg_toast_1281127 > > pg_dump: Die Anweisung war: COPY public.dbmail_messageblks > > (messageblk_idnr, physmessage_id, messageblk, blocksize, is_header) TO > > stdout; > > The file is there: > # ls -l base/16386/1281127 > -rw------- 1 postgres postgres 417447936 17. Feb 12:14 base/16386/1281127 > did you reindex pg_toast_1281127? take a look at http://archives.free.net.ph/message/20080924.191644.d692f468.el.html > What can I do about the missing chunk? I need the data in there... > > mfg zmi > > > -- Achilleas Mantzios
"Achilleas Mantzios" <achill@matrix.gatewaynet.com> schrieb: > did you reindex pg_toast_1281127? > take a look at > http://archives.free.net.ph/message/20080924.191644.d692f468.el.html REINDEX TABLE pg_toast.pg_toast_1281127; That runs now since 33 CPU minutes, using 100% with no visible I/O happening. top says: 29687 postgres 20 0 605m 187m 3024 R 100 15.4 32:58.65 postmaster And "strace -p 29687" displays nothing. Could it be the process hangs? It shouldn't take that long I guess, but I'm not sure how to find out if it's borked or really working... ideas? mfg zmi
Στις Tuesday 17 February 2009 17:11:41 ο/η Michael Monnerie έγραψε: > "Achilleas Mantzios" <achill@matrix.gatewaynet.com> schrieb: > > did you reindex pg_toast_1281127? > > take a look at > > http://archives.free.net.ph/message/20080924.191644.d692f468.el.html > > REINDEX TABLE pg_toast.pg_toast_1281127; > > That runs now since 33 CPU minutes, using 100% with no visible I/O > happening. top says: > 29687 postgres 20 0 605m 187m 3024 R 100 15.4 32:58.65 postmaster > > And "strace -p 29687" displays nothing. Could it be the process hangs? It > shouldn't take that long I guess, but I'm not sure how to find out if it's > borked or really working... ideas? iostat systat vmstat report anything? > > mfg zmi > > > -- Achilleas Mantzios
"Achilleas Mantzios" <achill@matrix.gatewaynet.com> schrieb: >> That runs now since 33 CPU minutes, using 100% with no visible I/O >> happening. top says: >> 29687 postgres 20 0 605m 187m 3024 R 100 15.4 32:58.65 postmaster >> >> And "strace -p 29687" displays nothing. Could it be the process hangs? It >> shouldn't take that long I guess, but I'm not sure how to find out if it's >> borked or really working... ideas? > > iostat systat vmstat report anything? Sorry, I wasn't detailed enough. With "no visible I/O" I meant I looked with iostat. After 45 CPU minutes I tried to stop it - no chance. Stop postgres - no chance. Wow, a really hard problem. Had to reboot the VM. Can I somehow dump that table without using an index? Here's the info about \d dbmail_messageblks Tabelle »public.dbmail_messageblks« Spalte | Typ | Attribute -----------------+----------+------------------------------------------------------------------ messageblk_idnr | bigint | not null default nextval('dbmail_messageblk_idnr_seq'::regclass) physmessage_id | bigint | messageblk | bytea | not null blocksize | bigint | not null default (0)::bigint is_header | smallint | not null default (0)::smallint Indexe: »dbmail_messageblks_pkey« PRIMARY KEY, btree (messageblk_idnr) »dbmail_messageblks_1« UNIQUE, btree (physmessage_id, messageblk_idnr) CLUSTER »dbmail_messageblks_physmessage_is_header_idx« btree (physmessage_id, is_header) Fremdschlüssel-Constraints: »dbmail_messageblks_physmessage_id_fkey« FOREIGN KEY (physmessage_id) REFERENCES dbmail_physmessage(id) ON UPDATE CASCADE ON DELETE CASCADE This works: COPY public.dbmail_messageblks (messageblk_idnr, physmessage_id, blocksize, is_header) TO stdout; This does NOT work: COPY public.dbmail_messageblks (messageblk_idnr, physmessage_id, messageblk, blocksize, is_header) TO stdout; So I'm sure the "messageblk" toast got toasted too much. Now I need a way to get out all values that are readable and piss on the rest. I looked into the output of the first and second COPY, it should be messageblk_idnr=3904492 that's broken. I tried select * from dbmail_messageblks where messageblk_idnr not in (3904492); That helps. Slowly I may be able to find all broken entries. Is there a more automated way? mfg zmi
"Achilleas Mantzios" <achill@matrix.gatewaynet.com> schrieb: >> That runs now since 33 CPU minutes, using 100% with no visible I/O >> happening. top says: >> 29687 postgres 20 0 605m 187m 3024 R 100 15.4 32:58.65 postmaster >> >> And "strace -p 29687" displays nothing. Could it be the process hangs? It >> shouldn't take that long I guess, but I'm not sure how to find out if it's >> borked or really working... ideas? > > iostat systat vmstat report anything? Sorry, I wasn't detailed enough. With "no visible I/O" I meant I looked with iostat. After 45 CPU minutes I tried to stop it - no chance. Stop postgres - no chance. Wow, a really hard problem. Had to reboot the VM. Can I somehow dump that table without using an index? Here's the info about \d dbmail_messageblks Tabelle »public.dbmail_messageblks« Spalte | Typ | Attribute -----------------+----------+------------------------------------------------------------------ messageblk_idnr | bigint | not null default nextval('dbmail_messageblk_idnr_seq'::regclass) physmessage_id | bigint | messageblk | bytea | not null blocksize | bigint | not null default (0)::bigint is_header | smallint | not null default (0)::smallint Indexe: »dbmail_messageblks_pkey« PRIMARY KEY, btree (messageblk_idnr) »dbmail_messageblks_1« UNIQUE, btree (physmessage_id, messageblk_idnr) CLUSTER »dbmail_messageblks_physmessage_is_header_idx« btree (physmessage_id, is_header) Fremdschlüssel-Constraints: »dbmail_messageblks_physmessage_id_fkey« FOREIGN KEY (physmessage_id) REFERENCES dbmail_physmessage(id) ON UPDATE CASCADE ON DELETE CASCADE This works: COPY public.dbmail_messageblks (messageblk_idnr, physmessage_id, blocksize, is_header) TO stdout; This does NOT work: COPY public.dbmail_messageblks (messageblk_idnr, physmessage_id, messageblk, blocksize, is_header) TO stdout; So I'm sure the "messageblk" toast got toasted too much. Now I need a way to get out all values that are readable and piss on the rest. I looked into the output of the first and second COPY, it should be messageblk_idnr=3904492 that's broken. I tried select * from dbmail_messageblks where messageblk_idnr not in (3904492); That helps. Slowly I may be able to find all broken entries. Is there a more automated way? mfg zmi
"Michael Monnerie" <michael.monnerie@is.it-management.at> schrieb: > select * from dbmail_messageblks where messageblk_idnr not in (3904492); > > That helps. Slowly I may be able to find all broken entries. Is there a more > automated way? Now I already have select * from dbmail_messageblks where messageblk_idnr not in (3904492,3904495,3904496,3904499,3904500,3904501,3904503); ERROR: unexpected chunk number 1 (expected 0) for toast value 1460210 in pg_toast_1281127 Is there a way to find out what "toast value 1460210" stands for? I'd like to find the connection between messageblk_idnr and that toast value, as it's not easy to find that. mfg zmi
"Michael Monnerie" <michael.monnerie@is.it-management.at> schrieb: > Now I already have > select * from dbmail_messageblks where messageblk_idnr not in > (3904492,3904495,3904496,3904499,3904500,3904501,3904503); > ERROR: unexpected chunk number 1 (expected 0) for toast value 1460210 in > pg_toast_1281127 > > Is there a way to find out what "toast value 1460210" stands for? I'd like > to find the connection between messageblk_idnr and that toast value, as it's > not easy to find that. My problem is that there seems to be a jump in messageblk_idnr, and I cannot find the relation to the missing toast value. Is there a way to find this? Or, I'd need to use something like COPY .. WHERE .. to exclude the broken toasts and see where it stops next. SELECT only shows the error with the toast value number, I'd like to see what it the last row that is processed then I could find the next and know it's messageblk_idnr. Is that possible? mfg zmi
News! The server even crashes when accessing certain messageblks: The "physmessage" table contains mails with date, mails are stored in pieces in "messageblks" (see ER model at http://www.dbmail.org/dokuwiki/doku.php?id=er-model ). I try to find first and last message with a problem, here the last: select substring(m.messageblk,0,5) from dbmail_messageblks m join dbmail_physmessage p on (p.id=m.physmessage_id) where p.id >= 1831634; But that crashes the server completely: 2009-02-17 23:48:43 CET pid=5481 1/2 idle: LOG: statement: select substring(m.messageblk,0,5) from dbmail_messageblks m join dbmail_physmessage p on (p.id=m.physmessage_id) where p.id >= 1831634; 2009-02-17 23:48:44 CET pid=5482 /usr/bin/postmaster: LOG: connection received: host=[local] 2009-02-17 23:48:44 CET pid=5413 : LOG: server process (PID 5481) was terminated by signal 11: Segmentation fault 2009-02-17 23:48:44 CET pid=5413 : LOG: terminating any other active server processes 2009-02-17 23:48:44 CET pid=5413 : LOG: all server processes terminated; reinitializing 2009-02-17 23:48:44 CET pid=5483 -1/0 : LOG: database system was interrupted; last known up at 2009-02-17 23:48:32 CET 2009-02-17 23:48:44 CET pid=5483 -1/0 : LOG: database system was not properly shut down; automatic recovery in progress 2009-02-17 23:48:44 CET pid=5483 -1/0 : LOG: record with zero length at E/382481B8 2009-02-17 23:48:44 CET pid=5483 -1/0 : LOG: redo is not required 2009-02-17 23:48:44 CET pid=5483 -1/0 : LOG: checkpoint starting: shutdown immediate 2009-02-17 23:48:44 CET pid=5483 -1/0 : LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.003 s, sync=0.000 s, total=0.013 s 2009-02-17 23:48:44 CET pid=5486 -1/0 : LOG: autovacuum launcher started 2009-02-17 23:48:44 CET pid=5413 : LOG: database system is ready to accept connections Damn, that crash really busted my db. I'd only like to have the rest of the data from that table... mfg zmi
I have the impression I'm asking the wrong things. Still I try. Could I do something about this? psql:/tmp/x:119: ERROR: invalid page header in block 973698 of relation "pg_toast_1281127" It seems that toast table got truncated after host+XEN crash and XFS recovery. Would adding some zeroes to that file help? If yes, how could I find out what the file is named? mfg zmi
Look into zero damaged pages setting. On Wed, Feb 18, 2009 at 2:35 PM, Michael Monnerie <michael.monnerie@is.it-management.at> wrote: > I have the impression I'm asking the wrong things. Still I try. > Could I do something about this? > > psql:/tmp/x:119: ERROR: invalid page header in block 973698 of relation > "pg_toast_1281127" > > It seems that toast table got truncated after host+XEN crash and XFS > recovery. Would adding some zeroes to that file help? If yes, how could I > find out what the file is named? > > mfg zmi > > > > -- > Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-admin > -- When fascism comes to America, it will be the intolerant selling it as diversity.
Hi Everybody, I am comparing selection results between postgres clusters on RHEL4 and Windows Server 2003. The query is: select count(*)from schema.table where route < 'Interstate 12' on the RHEL machine the result has: 102 records on the windows machine the result has: 586 records the entire table has 687 records On both clusters the column route is of type character varying (40) On both clusters the server encoding is UTF8 - which I can see in pgAdminIII or by querying pg_database I would like to figure out if the locale on both clusters is the same, because if the locales are different that would explainthe different result sets for the same query. Does anyone know how to check the locale of a cluster? The documentation has yielded this " You can check the LC_COLLATE and LC_CTYPE settings of a database with the contrib/pg_controldatautility program." I do not have a contrib folder on either machine, I suppose that I could go install it but is there an easier way? Thank you, Sincerely, Kasia
Kasia Tuszynska <ktuszynska@esri.com> writes: > Does anyone know how to check the locale of a cluster? "show lc_collate" has worked since 7.4. > The documentation has yielded this " You can check the LC_COLLATE and LC_CTYPE settings of a database with the contrib/pg_controldatautility program." The docs have not said that since 7.3. I really hope that is not what you are running ... especially not on the Windows machine. regards, tom lane
On Mittwoch 18 Februar 2009 Scott Marlowe wrote: > Look into zero damaged pages setting. Thanks for the hint Scott, I tried it and did: select * into mb from dbmail_messageblks; WARNING: invalid page header in block 973075 of relation "pg_toast_1281127"; zeroing out page ERROR: missing chunk number 0 for toast value 1623370 in pg_toast_1281127 That missing chunks still stop the query. I found out what messageblks are destroyed, and can get out all the rest. There was something interesting: As the data in that table is about 10GB and therefore time consuming to read, I tried reading only small parts of that big "messageblk" field of type "bytea", which is the longest field. So I tried reading every row with char_length(messageblk,5) and on success noted that block as good. But upon reading there still could be errors, meaning you must really test to read the full messageblk field. Just if someone else needs to perform such a test. Also a question: Because I must read all data, the psql client runs out of memory, trying to cache all the 10GB from that table. I circumvented this with selecting only parts of the table all the time. Is there a smart way to do such a select without caching the results in memory? Is that what temporary tables and "select into" are made for? I just want to know the recommended way for doing huge queries. mfg zmi -- // Michael Monnerie, Ing.BSc ----- http://it-management.at // Tel: 0660 / 415 65 31 .network.your.ideas. // PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import" // Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4 // Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4
Вложения
I managed to recover the data that was still readable. About 650 messageblock entries got lost. What makes me nervous a bit is that postgres kept running despite (partially) being destroyed. It should really have shutdown itself after the first problem was found. That database is for mails, and I would understand to have lost some from the time before the power loss, but I even lost entries from *after* the crash. That means the error happened after the system was up again and happily did it's work, but lost entries to a table. Could there be a better way to check the db at crash recovery startup time? If someone is interested, I have a full postgres log with every single command done to the database. mfg zmi -- // Michael Monnerie, Ing.BSc ----- http://it-management.at // Tel: 0660 / 415 65 31 .network.your.ideas. // PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import" // Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4 // Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4
Вложения
On Sat, Feb 21, 2009 at 1:43 AM, Michael Monnerie <michael.monnerie@is.it-management.at> wrote: > I managed to recover the data that was still readable. About 650 > messageblock entries got lost. What makes me nervous a bit is that > postgres kept running despite (partially) being destroyed. It should > really have shutdown itself after the first problem was found. That > database is for mails, and I would understand to have lost some from the > time before the power loss, but I even lost entries from *after* the > crash. That means the error happened after the system was up again and > happily did it's work, but lost entries to a table. Could there be a > better way to check the db at crash recovery startup time? > > If someone is interested, I have a full postgres log with every single > command done to the database. We preach this again and again. PostgreSQL can only survive a power outage type failure ONLY if the hardware / OS / filesystem don't lie about fsync. If they do, all bets are off, and this kind of failure means you should really failover to another machine or restore a backup. It's why you have to do possibly destructive tests to see if your server stands at least some chance of surviving this kind of failure, log shipping for recovery, and / or replication of another form (slony etc...) to have a reliable server. The recommendations for recovery of data are just that, recovery oriented. They can't fix a broken database at that point. You need to take it offline after this kind of failure if you can't trust your hardware. Usually when it finds something wrong it just won't start up.
On Sat, Feb 21, 2009 at 1:23 AM, Michael Monnerie <michael.monnerie@is.it-management.at> wrote: > Also a question: Because I must read all data, the psql client runs out > of memory, trying to cache all the 10GB from that table. I circumvented > this with selecting only parts of the table all the time. Is there a > smart way to do such a select without caching the results in memory? Is > that what temporary tables and "select into" are made for? I just want > to know the recommended way for doing huge queries. You can dump individual tables with pg_dump -t table1 -t table2. That should work without running out of memory. And yeah, temp tables and select into are a good way to get your data ready to be pg_dumped.
On Samstag 21 Februar 2009 Scott Marlowe wrote: > You can dump individual tables with pg_dump -t table1 -t table2. > That should work without running out of memory. And yeah, temp > tables and select into are a good way to get your data ready to be > pg_dumped. Yes, actually I did pg_dump -T broken_table, and got all others out. Thanks to Murphy's Law that one broken table was the biggest, the one containing the e-mail data. mfg zmi -- // Michael Monnerie, Ing.BSc ----- http://it-management.at // Tel: 0660 / 415 65 31 .network.your.ideas. // PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import" // Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4 // Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4
Вложения
On Samstag 21 Februar 2009 Scott Marlowe wrote: > We preach this again and again. PostgreSQL can only survive a power > outage type failure ONLY if the hardware / OS / filesystem don't lie > about fsync. If they do, all bets are off, and this kind of failure > means you should really failover to another machine or restore a > backup. The shit thing is, I just discussed with the XFS devs last week, whether it is save to have a virtualization like VMware or XEN, and the answer was "depends on the hypervisor". I had such an issue with VMware 2 years ago, and now with XEN, so I would say they are not save. But there must be something you can configure in order not to have such drastic errors on power fail. It's just nobody seems to know (or want to tell) how to do that. At least, not to me ;-) > It's why you have to do possibly destructive tests to see if your > server stands at least some chance of surviving this kind of failure, > log shipping for recovery, and / or replication of another form > (slony etc...) to have a reliable server. As I need another Postgres setup with a server syncing dbmail to another, I guess I'll do that with WAL, so at least then I can recover to that latest entry. > The recommendations for recovery of data are just that, recovery > oriented. They can't fix a broken database at that point. You need > to take it offline after this kind of failure if you can't trust your > hardware. > > Usually when it finds something wrong it just won't start up. The problem was I wasn't working this week, and did just a basic check if everything is up again. There were e-mails arriving, so I thought it's OK. I was very pissed when some days later I found strange things happening, and then to see that a table was broken and ate nearly all e- mails. If at least Postgres would have whined and stopped working... I know it's not Postgres' fault to have fsync messed up, but at least error recovery should have found the problem, latest at the moment the first transaction touched the problematic table. Instead of throwing the data effectively to /dev/null :-( mfg zmi -- // Michael Monnerie, Ing.BSc ----- http://it-management.at // Tel: 0660 / 415 65 31 .network.your.ideas. // PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import" // Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4 // Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4
Вложения
Other than disaster tests, how would I know if I have an system that lies about fsync? > > We preach this again and again. PostgreSQL can only survive a power > outage type failure ONLY if the hardware / OS / filesystem don't lie > about fsync. If they do, all bets are off, and this kind of failure > means you should really failover to another machine or restore a > backup. > > It's why you have to do possibly destructive tests to see if your > server stands at least some chance of surviving this kind of failure, > log shipping for recovery, and / or replication of another form (slony > etc...) to have a reliable server. > > The recommendations for recovery of data are just that, recovery > oriented. They can't fix a broken database at that point. You need > to take it offline after this kind of failure if you can't trust your > hardware. > > Usually when it finds something wrong it just won't start up. > > -- ------------------------------------------------------------------------ Naomi Walker Chief Information Officer Eldorado Computing, Inc nwalker@eldocomp.com ---An EDS Company 602-604-3100 ------------------------------------------------------------------------ Hard things are put in our way, not to stop us, but to call out our courage and strength. -- Anonymous ------------------------------------------------------------------------
Naomi Walker wrote: > Other than disaster tests, how would I know if I have an system that > lies about fsync? Well, the linux kernel tries to detect it on bootup and will give messages like this: %dmesg | grep 'disabling barriers' JBD: barrier-based sync failed on md1 - disabling barriers JBD: barrier-based sync failed on hda3 - disabling barriers when it detects certain types of unreliable fsync's. The command %hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT will give you clues if a hard drive itself even can support a non-lying fsync when it's internal cache is enabled. Sadly some filesystems (ext3) lie even above and beyond what Linux does - by only using the write barriers correctly when the inode itself is modified; not when the data is modified. A test program here: http://archives.postgresql.org/pgsql-performance/2008-08/msg00159.php can detect those cases where the kernel & drive don't lie about fsync but ext3 lies in spite of them; with more background info here: http://article.gmane.org/gmane.linux.file-systems/21373 http://thread.gmane.org/gmane.linux.kernel/646040 Elsewhere in the archives you can find programs that measure how fast fsyncs happen - but on your hardware, and you can try to see if those numbers approximately match how fast your disks spin. But then you still need to make sure the test program used the same methods for syncing the drive that your postgres configuration files are choosing. I wonder if the only really safe way is to run a very write intensive database script and pull and kill your system in a number of ways, including yanking power to the system; to disk arrays, etc and see if your database died. > > >> >> We preach this again and again. PostgreSQL can only survive a power >> outage type failure ONLY if the hardware / OS / filesystem don't lie >> about fsync. If they do, all bets are off, and this kind of failure >> means you should really failover to another machine or restore a >> backup. >> >> It's why you have to do possibly destructive tests to see if your >> server stands at least some chance of surviving this kind of failure, >> log shipping for recovery, and / or replication of another form (slony >> etc...) to have a reliable server. >> >> The recommendations for recovery of data are just that, recovery >> oriented. They can't fix a broken database at that point. You need >> to take it offline after this kind of failure if you can't trust your >> hardware. >> >> Usually when it finds something wrong it just won't start up. >> >> > >
On Sat, Feb 21, 2009 at 3:41 PM, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote: > Naomi Walker wrote: >> Other than disaster tests, how would I know if I have an system that >> lies about fsync? > > Well, the linux kernel tries to detect it on bootup and > will give messages like this: > %dmesg | grep 'disabling barriers' > JBD: barrier-based sync failed on md1 - disabling barriers > JBD: barrier-based sync failed on hda3 - disabling barriers > when it detects certain types of unreliable fsync's. The command > %hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT > will give you clues if a hard drive itself even can support > a non-lying fsync when it's internal cache is enabled. > > > Sadly some filesystems (ext3) lie even above and beyond what > Linux does - by only using the write barriers correctly > when the inode itself is modified; not when the data is modified. > A test program here: > http://archives.postgresql.org/pgsql-performance/2008-08/msg00159.php > can detect those cases where the kernel & drive don't lie > about fsync but ext3 lies in spite of them; with more background > info here: > http://article.gmane.org/gmane.linux.file-systems/21373 > http://thread.gmane.org/gmane.linux.kernel/646040 > > > Elsewhere in the archives you can find programs that measure > how fast fsyncs happen - but on your hardware, and you can > try to see if those numbers approximately match how fast your > disks spin. But then you still need to make sure the test > program used the same methods for syncing the drive that your > postgres configuration files are choosing. > > I wonder if the only really safe way is to run a very > write intensive database script and pull and kill your > system in a number of ways, including yanking power to > the system; to disk arrays, etc and see if your database died. Well, you can't prove it's 100% safe but you can usually find most of the not safe systems this way. I usually setup a big pgbench db, run 500 or so concurrent, wait 5 or 10 minutes, run a checkpoint, and halfway through it pull the plug. It's still possible for a system to fail after passing this test, but I feel a lot better knowing I've done it a couple of times and the db came back up without problems.