Re: Row data corruption under 7.3.5
От | Marc Mitchell |
---|---|
Тема | Re: Row data corruption under 7.3.5 |
Дата | |
Msg-id | 001f01c40c3a$b1a95e80$6701050a@MarcM8500 обсуждение исходный текст |
Ответ на | Re: Row data corruption under 7.3.5 (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Row data corruption under 7.3.5
(Tom Lane <tgl@sss.pgh.pa.us>)
Re: Row data corruption under 7.3.5 ("scott.marlowe" <scott.marlowe@ihs.com>) Re: Row data corruption under 7.3.5 (Radu-Adrian Popescu <radu.popescu@aldratech.com>) |
Список | pgsql-admin |
This is follow-up to a problem first reported on 3/1/04. The problem has continued to occur intermittently and recently we experienced the first occurrence where the first column of a table was the column where the corrupted and thus we could not recover it. Google groups searching have found numerous hits for people reporting the same symptoms. While we've seen some instructions to get things back, we've seen nothing about correcting the root cause. This is becoming a major production problem and starting to cast doubt on the "Postgres in Production" decision. We've observed nothing that would lead us to believe there are any hardware problems. Initially we were using write-caching using battery-backed up cache but we turned that off and are using direct I/O and still experiencing the same problem. Furthermore, the fact that the problems seem isolated to 3 specific tables in a 50+ table database makes us weary of hardware-level issues. As far as matching up correct and corrupted rows, here's more detail on a recent occurrence: [root@cin1 backups]# /usr/local/pgsql/7.3.5/bin/pg_dump -Ft -p 5432 -U postgres solo > solo.dmp pg_dump: ERROR: MemoryContextAlloc: invalid request size 4294967293 pg_dump: lost synchronization with server, resetting connection pg_dump: SQL command to dump the contents of table "freight_track_detail" failed: PQendcopy() failed. pg_dump: Error message from server: pg_dump: The command was: COPY public.freight_track_detail (ftd_uid, ftm_uid, txl_uid, ref_nbr_1, ref_nbr_2, fg_tab_uid, fg_tab_alias, ftd_status_code, scan_timestamp, add_userid, add_timestamp, mod_userid, mod_timestamp) TO stdout; ..<snip>.. 118171 512 2159 00004300854908405208 46366 FGI 2004-01-21 12:25:00 postgres 2004-01-21 15:39:29 OSD 2004-01-22 01:04:40 118153 512 2159 00004304730000071106 46990 FGI 2004-01-21 12:20:00 post 2000-01-01 00:00:00 ..End of output. The second row shows the vchar userid getting lopped off after the first 4 characters. Note that we've experienced this problem with several different vchar-typed columns though, as mentioned before, we have recently seen corruption of integer typed columns. If we issued an update setting that column plus the subsequent 3 columns to "null", everything then was back to normal. This row was right in the middle of the table. Furthermore, we recently found problems reported in the same table from nightly vacuums. See the following cron-generated emails that contain error messages as well as datetimes to show the temporal relationship between these problems:
В списке pgsql-admin по дате отправления: