Обсуждение: PGError: server closed the connection unexpectedly
Hello list, I've been working a bit today to resolve the error I'm seeing mentioned in the title, and I was hoping you might have some insight. I've managed to semi-reliably reproduce this based on two different tasks: - a pg_dumpall from my backup server will throw this, but not always on the same database. Sometimes it seems to get 'stuck' on one, but later if I try it again it'll throw the error on a different DB. The pg_dumpall appears to work perfectly when run locally. - A unit test for a Rails app I'm developing will consistently throw this error, though oddly I can run the problem statement in psql, over the network, without problem. My biggest problem is the lack of any real error message on the server. I don't see anything wrong in the system logs, and there's no core file in the /var/postgresql directory. I did a 'vacuumdb -afz' just as a shot in the dark, without affect. Pretty much all I see in the logs is this: <USER%DB x.y.z.a(51478) 487>LOG: unexpected EOF on client connection Googling turned up a few reports suggesting bad hardware, or corrupted indexes, but I don't think that's the case here. Any starting points or ideas would be greatly appreciated. Basic server info: - Less than 3 months old - OpenBSD 3.8, patched fully, x86 w/ a P4 - 2G of Ram, ~1.5G free usually - /var/postgresql partition has about 8G free, and is backed by a AMI hardware raid-1 array. - PostgreSQL version 8.0.3 on server and clients, built locally from OpenBSD ports - ruby-postgres adapter version 0.7.1 used in the rails project Special OS Config bits: - kern.seminfo sysctl tree: kern.seminfo.semmni=256 kern.seminfo.semmns=2048 kern.seminfo.semmnu=30 kern.seminfo.semmsl=60 kern.seminfo.semopm=100 kern.seminfo.semume=10 kern.seminfo.semusz=100 kern.seminfo.semvmx=32767 kern.seminfo.semaem=16384 - Bits from _postgresql's ulimit: $ ulimit -a time(cpu-seconds) unlimited file(blocks) unlimited coredump(blocks) unlimited data(kbytes) 1048576 stack(kbytes) 8192 lockedmem(kbytes) 635424 memory(kbytes) 1905136 nofiles(descriptors) 768 processes 532 - postgresql.conf changes: max_connections = 200 syslog_facility = 'LOCAL0' syslog_ident = 'postgres' log_duration = true stats_start_collector = true stats_command_string = false stats_block_level = false stats_row_level = false Thanks in advance! PS - My apologies if this is a dup! -- Dave Steinberg http://www.geekisp.com/ http://www.steinbergcomputing.com/
Dave Steinberg <dave-pgsql@redterror.net> writes: > My biggest problem is the lack of any real error message on the server. > I don't see anything wrong in the system logs, and there's no core > file in the /var/postgresql directory. Are you sure core files are enabled; i.e. the server is running with 'ulimit -c unlimited' ? > I did a 'vacuumdb -afz' just as > a shot in the dark, without affect. Pretty much all I see in the logs > is this: > > <USER%DB x.y.z.a(51478) 487>LOG: unexpected EOF on client connection This means a client is dying or closing its connection prematurely, and would seem to be a different problem. It shouldn't ever cause the server to crash. > Googling turned up a few reports suggesting bad hardware, or corrupted > indexes, but I don't think that's the case here. > > Any starting points or ideas would be greatly appreciated. Make sure the server is able to dump a core file, and perhaps crank up the logging level. -Doug
>>My biggest problem is the lack of any real error message on the server. >> I don't see anything wrong in the system logs, and there's no core >>file in the /var/postgresql directory. > > > Are you sure core files are enabled; i.e. the server is running with > 'ulimit -c unlimited' ? Yes: $ whoami _postgresql $ ulimit -c unlimited >> I did a 'vacuumdb -afz' just as >>a shot in the dark, without affect. Pretty much all I see in the logs >>is this: >> >><USER%DB x.y.z.a(51478) 487>LOG: unexpected EOF on client connection > > > This means a client is dying or closing its connection prematurely, > and would seem to be a different problem. It shouldn't ever cause the > server to crash. If the server is crashing - I have absolutely no info to suggest that, except possibly this: at a later time I did a tcpdump of one of these sessions, for lack of better ideas, and saw this near the end: <snip lots of tcpdump output> 16:32:19.523842 clam.int.geekisp.com.15245 > morningsun.int.geekisp.com.postgresql: P 17389:18036(647) ack 9323 win 16384 <nop,nop,timestamp 133396667 3702263885> (DF) 16:32:19.524289 morningsun.int.geekisp.com.postgresql > clam.int.geekisp.com.15245: P 9323:9343(20) ack 18036 win 17376 <nop,nop,timestamp 3702263885 133396667> (DF) 16:32:19.562544 clam.int.geekisp.com.15245 > morningsun.int.geekisp.com.postgresql: P 18036:18041(5) ack 9343 win 16384 <nop,nop,timestamp 133396667 3702263885> (DF) ******16:32:19.566662 morningsun.int.geekisp.com.postgresql > clam.int.geekisp.com.15245: F 9343:9343(0) ack 18041 win 17376 <nop,nop,timestamp 3702263885 133396667> (DF) 16:32:19.567043 clam.int.geekisp.com.15245 > morningsun.int.geekisp.com.postgresql: . ack 9344 win 16384 <nop,nop,timestamp 133396667 3702263885> (DF) 16:32:19.649445 clam.int.geekisp.com.15245 > morningsun.int.geekisp.com.postgresql: P 18041:18851(810) ack 9344 win 16384 <nop,nop,timestamp 133396667 3702263885> (DF) 16:32:19.649468 morningsun.int.geekisp.com.postgresql > clam.int.geekisp.com.15245: R 545360451:545360451(0) win 0 (DF) 16:32:19.649652 clam.int.geekisp.com.15245 > morningsun.int.geekisp.com.postgresql: F 18851:18851(0) ack 9344 win 16384 <nop,nop,timestamp 133396667 3702263885> (DF) 16:32:19.649665 morningsun.int.geekisp.com.postgresql > clam.int.geekisp.com.15245: R 545360451:545360451(0) win 0 (DF) The '*****' above is my own, to highlight the interesting part. Morningsun is the server, clam is the ruby client. Based on this, I think the server *might* be dying first (hard to tell), since morningsun is the one who sends the FIN packet and thereby closes the connection. (Sorry for the horrible wrapping). Again, having trouble making sure. >>Googling turned up a few reports suggesting bad hardware, or corrupted >>indexes, but I don't think that's the case here. >> >>Any starting points or ideas would be greatly appreciated. > > > Make sure the server is able to dump a core file, and perhaps crank up > the logging level. Just to verify that a core dump is possible for the postgresql user, I wrote a tiny C program that just calls abort(). Sure enough, running it gives me a core file. I cranked the debug level up to 5 (man that's a lot), and here's what I think is the relevent chunk: <geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>STATEMENT: UPDATE job_members SET "start_time" = NULL, "exit_code" = NULL, "create_time" = '2006-01-07 17:10:31', "job_id" = 30, "command" = '--- !ruby/object:GeekISP::ShellCommand args: command: /bin/ls /tmp/ exit_code: logger: !ruby/object:Logger datetime_format: level: 0 logdev: !ruby/object:Logger::LogDevice dev: !ruby/object:File {} filename: /mnt/scratch/dave/control_panel/trunk/config/../log/test.log shift_age: 0 shift_size: 1048576 progname: run_as: stderr: stdin: stdout: ', "finish_time" = NULL, "last_update_time" = NULL, "host_id" = 4, "status" = 'in_progress', "output" = NULL WHERE job_member_id = 105 <geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>DEBUG: name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 10290432/1/0, nestlvl: 1, children: <> <geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>STATEMENT: UPDATE job_members SET "start_time" = NULL, "exit_code" = NULL, "create_time" = '2006-01-07 17:10:31', "job_id" = 30, "command" = '--- !ruby/object:GeekISP::ShellCommand args: command: /bin/ls /tmp/ exit_code: logger: !ruby/object:Logger datetime_format: level: 0 logdev: !ruby/object:Logger::LogDevice dev: !ruby/object:File {} filename: /mnt/scratch/dave/control_panel/trunk/config/../log/test.log shift_age: 0 shift_size: 1048576 progname: run_as: stderr: stdin: stdout: ', "finish_time" = NULL, "last_update_time" = NULL, "host_id" = 4, "status" = 'in_progress', "output" = NULL WHERE job_member_id = 105 <geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>DEBUG: proc_exit(0) <geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>DEBUG: shmem_exit(0) <geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>LOG: disconnection: session time: 0:00:00.84 user=geekispv2 database=geekisp-v2-test host=192.168.4.38 port=31992 <geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>DEBUG: exit(0) <% 7303>DEBUG: reaping dead processes <% 7303>DEBUG: server process (PID 9155) exited with exit code 0 That looks to me like a clean and normal exit. This is pointing more and more towards the client in the ruby case, isn't it? Regards, -- Dave Steinberg http://www.geekisp.com/ http://www.steinbergcomputing.com/
Dave Steinberg <dave@redterror.net> writes: >>>My biggest problem is the lack of any real error message on the server. >>> I don't see anything wrong in the system logs, and there's no core >>> file in the /var/postgresql directory. >> Are you sure core files are enabled; i.e. the server is running with >> 'ulimit -c unlimited' ? > > Yes: > > $ whoami > _postgresql > $ ulimit -c > unlimited But does the startup script for PG set the limits as well? It's quite possible that the PG daemon startup sequence and logging in as the PG user go through different scripts. Also, make sure that you're looking in the right place for core dumps--OpenBSD may put them somewhere weird by default. [...] > That looks to me like a clean and normal exit. This is pointing more > and more towards the client in the ruby case, isn't it? Yeah, if the server were crashing its exit code would be greater than 127. Also, usually when a backend crashes, the postmaster takes the whole server down on the assumption that shared memory may have been corrupted. It doesn't sound like this is happening to you, which again points to a client problem. You're not using the same PG connection from two different threads, or fork()ing and trying to use the same connection in the parent and the child, or anything like that? -Doug
> You're not using the same PG connection from two different threads, or > fork()ing and trying to use the same connection in the parent and the > child, or anything like that? Aha! In the ruby code, I am forking! I'll make the child reconnect and see if that helps (I am almost sure this will fix it). This one I should have guessed - I dealt with similar stuff in perl somewhat recently. The pg_dumpall problem... I'll crank the debug level and see if there's anything interesting there, and if so I'll post about it again. Thanks Doug! -- Dave Steinberg http://www.geekisp.com/ http://www.steinbergcomputing.com/
Dave Steinberg <dave@redterror.net> writes: >> You're not using the same PG connection from two different threads, or >> fork()ing and trying to use the same connection in the parent and the >> child, or anything like that? > > Aha! In the ruby code, I am forking! I'll make the child reconnect > and see if that helps (I am almost sure this will fix it). This one I > should have guessed - I dealt with similar stuff in perl somewhat > recently. Yay! I thought it might be something like that. > The pg_dumpall problem... I'll crank the debug level and see if > there's anything interesting there, and if so I'll post about it again. You could also, if you have to, run the BSD equivalent of 'strace' (ktrace?) against the backend that pg_dump is connected to, and see what might be going on. 'strace' is the gun I pull out when logfiles aren't working for me. :) -Doug