Обсуждение: pg_dump causes postgres crash
Fairly new (less than a week) install.
"PostgreSQL 8.2.4 on i386-pc-solaris2.10, compiled by
GCC gcc (GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath)"
database size around 43 gigabytes.
2 attempts at a pg_dump across the network caused the
database to go down...
The first time I thought it was because of mismatched
pg_dump (was version 8.0.X)...but the second time it
was definitely 8.2.4 version of pg_dump.
My first thought was corruption...but this database
has successfully seeded 2 slony subscriber nodes from
scratch as well running flawlessly under heavy load
for the past week.
Even more odd is that a LOCAL pg_dump (from on the
box) succeeded just fine tonight (after the second
crash).
Thoughts?
----First Crash-------
backup-srv2 prod_backup # time /usr/bin/pg_dump
--format=c --compress=9 --ignore-version
--username=backup --host=prod_server prod > x
pg_dump: server version: 8.2.4; pg_dump version:
8.0.13
pg_dump: proceeding despite version mismatch
pg_dump: WARNING: terminating connection because of
crash of another server process
DETAIL: The postmaster has commanded this server
process to roll back the current transaction and exit,
because another server process exited abnormally and
possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to
the database and repeat your command.
pg_dump: server closed the connection unexpectedly
This probably means the server terminated
abnormally
before or while processing the request.
pg_dump: SQL command to dump the contents of table
"access_logs" failed: PQendcopy() failed.
pg_dump: Error message from server: server closed the
connection unexpectedly
This probably means the server terminated
abnormally
before or while processing the request.
pg_dump: The command was: COPY public.access_logs (ip,
username, "action", date, params) TO stdout;
------Second Crash--------
backup-srv2 ~ # time /usr/bin/pg_dump --format=c
--compress=9 --username=backup --host=prod_server
prod | wc -l
pg_dump: Dumping the contents of table "audit" failed:
PQgetCopyData() failed.
pg_dump: Error message from server: server closed the
connection unexpectedly
This probably means the server terminated
abnormally
before or while processing the request.
pg_dump: The command was: COPY public.audit (audit_id,
entity_id, table_name, serial_id, audit_action,
when_ts, user_id, user_ip) TO stdout;
____________________________________________________________________________________
Moody friends. Drama queens. Your life? Nope! - their life, your story. Play Sims Stories at Yahoo! Games.
http://sims.yahoo.com/
From the logs tonight when the second crash occurred.. Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848 local0.info] [6-1] 2007-08-22 20:45:12 CDT LOG: received smart shutdown request Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848 local0.info] [7-1] 2007-08-22 20:45:12 CDT LOG: server process (PID 20188) was terminated by signal 11 Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848 local0.info] [8-1] 2007-08-22 20:45:12 CDT LOG: terminating any other active server processes There was a core file created...but I believe I do not have postgresql compiled with debug info.....(well, a pstack provided nothing useful) pstack core |more core 'core' of 20188: /usr/local/pgsql/bin/postgres -D /db fee8ec23 sk_value (10023d, 105d8b00, d2840f, 1c7f0000, f20f883, 10584) + 33 0c458b51 ???????? (0, 0, 511f1600, 2000400, ff001c09, 467f71ea) 00000000 ???????? () Once again...a local pg_dump worked just fine 30 minutes later...... We have introduced some new network architecture which is acting odd lately (dell managed switches, netscreen ssgs, etc) and the database itself resides on a zfs partition on a Pillar SAN (connected via fibre channel) Any thoughts would be appreciated. ____________________________________________________________________________________ Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/
Jeff Amiel <becauseimjeff@yahoo.com> writes:
> Even more odd is that a LOCAL pg_dump (from on the
> box) succeeded just fine tonight (after the second
> crash).
That seems to eliminate the theory of a crash due to data corruption
... unless the corruption somehow repaired itself in the intervening
30 minutes, which hardly seems likely.
> ----First Crash-------
> backup-srv2 prod_backup # time /usr/bin/pg_dump
> --format=c --compress=9 --ignore-version
> --username=backup --host=prod_server prod > x
> pg_dump: server version: 8.2.4; pg_dump version:
> 8.0.13
> pg_dump: proceeding despite version mismatch
> pg_dump: WARNING: terminating connection because of
> crash of another server process
> DETAIL: The postmaster has commanded this server
> process to roll back the current transaction and exit,
> because another server process exited abnormally and
> possibly corrupted shared memory.
Notice that pg_dump is showing that the crash was in some OTHER server
process, not the one it was attached to.
> ------Second Crash--------
> backup-srv2 ~ # time /usr/bin/pg_dump --format=c
> --compress=9 --username=backup --host=prod_server
> prod | wc -l
> pg_dump: Dumping the contents of table "audit" failed:
> PQgetCopyData() failed.
> pg_dump: Error message from server: server closed the
> connection unexpectedly
> This probably means the server terminated
> abnormally
> before or while processing the request.
> pg_dump: The command was: COPY public.audit (audit_id,
This one looks more like it might have been the directly connected
server process that crashed. However, your postmaster log from
the other message:
> From the logs tonight when the second crash occurred..
> Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848
> local0.info] [6-1] 2007-08-22 20:45:12 CDT LOG:
> received smart shutdown request
> Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848
> local0.info] [7-1] 2007-08-22 20:45:12 CDT LOG:
> server process (PID 20188) was terminated by signal 11
> Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848
> local0.info] [8-1] 2007-08-22 20:45:12 CDT LOG:
> terminating any other active server processes
raises still more questions: where the heck did the "smart shutdown
request" (that is to say, a SIGTERM interrupt to the postmaster) come
from? It's far too much of a coincidence for that to have occurred
within a second of detecting the server process crash.
> We have introduced some new network architecture which
> is acting odd lately (dell managed switches, netscreen
> ssgs, etc) and the database itself resides on a zfs
> partition on a Pillar SAN (connected via fibre
> channel)
I can't help thinking you are looking at generalized system
instability. Maybe someone knocked a few cables loose while
installing new network hardware?
regards, tom lane
--- Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I can't help thinking you are looking at generalized > system > instability. Maybe someone knocked a few cables > loose while > installing new network hardware? Database server/storage instability or network instability? There is no doubt that there is something flaky about the networking between the db server and the box(es) trying to do the pg_dump. We have indeed had issues (timeouts, halts, etc) moving large quantities of data across various segments to and from these boxes...like the db server....but how would this effect something like a pg_dump? Would a good stack trace (assuming I want to crash my database again) help here? ____________________________________________________________________________________Ready for the edge of your seat? Check out tonight's top picks on Yahoo! TV. http://tv.yahoo.com/
Jeff Amiel <becauseimjeff@yahoo.com> writes:
> Would a good stack trace (assuming I want to crash my
> database again) help here?
Well, it'd be more information than we have now ...
regards, tom lane