Обсуждение: PGError: server closed the connection unexpectedly

Поиск
Список
Период
Сортировка

PGError: server closed the connection unexpectedly

От
Dave Steinberg
Дата:
Hello list,

I've been working a bit today to resolve the error I'm seeing mentioned
in the title, and I was hoping you might have some insight.  I've
managed to semi-reliably reproduce this based on two different tasks:

- a pg_dumpall from my backup server will throw this, but not always on
the same database.  Sometimes it seems to get 'stuck' on one, but later
if I try it again it'll throw the error on a different DB.  The
pg_dumpall appears to work perfectly when run locally.

- A unit test for a Rails app I'm developing will consistently throw
this error, though oddly I can run the problem statement in psql, over
the network, without problem.

My biggest problem is the lack of any real error message on the server.
    I don't see anything wrong in the system logs, and there's no core
file in the /var/postgresql directory.  I did a 'vacuumdb -afz' just as
a shot in the dark, without affect.  Pretty much all I see in the logs
is this:

<USER%DB x.y.z.a(51478) 487>LOG:  unexpected EOF on client connection

Googling turned up a few reports suggesting bad hardware, or corrupted
indexes, but I don't think that's the case here.

Any starting points or ideas would be greatly appreciated.

Basic server info:
- Less than 3 months old
- OpenBSD 3.8, patched fully, x86 w/ a P4
- 2G of Ram, ~1.5G free usually
- /var/postgresql partition has about 8G free, and is backed by a AMI
hardware raid-1 array.
- PostgreSQL version 8.0.3 on server and clients, built locally from
OpenBSD ports
- ruby-postgres adapter version 0.7.1 used in the rails project

Special OS Config bits:

- kern.seminfo sysctl tree:
kern.seminfo.semmni=256
kern.seminfo.semmns=2048
kern.seminfo.semmnu=30
kern.seminfo.semmsl=60
kern.seminfo.semopm=100
kern.seminfo.semume=10
kern.seminfo.semusz=100
kern.seminfo.semvmx=32767
kern.seminfo.semaem=16384

- Bits from _postgresql's ulimit:
$ ulimit -a
time(cpu-seconds)    unlimited
file(blocks)         unlimited
coredump(blocks)     unlimited
data(kbytes)         1048576
stack(kbytes)        8192
lockedmem(kbytes)    635424
memory(kbytes)       1905136
nofiles(descriptors) 768
processes            532

- postgresql.conf changes:
max_connections = 200
syslog_facility = 'LOCAL0'
syslog_ident = 'postgres'
log_duration = true
stats_start_collector = true
stats_command_string = false
stats_block_level = false
stats_row_level = false

Thanks in advance!
PS - My apologies if this is a dup!
--
Dave Steinberg
http://www.geekisp.com/
http://www.steinbergcomputing.com/

Re: PGError: server closed the connection unexpectedly

От
Doug McNaught
Дата:
Dave Steinberg <dave-pgsql@redterror.net> writes:

> My biggest problem is the lack of any real error message on the server.
>     I don't see anything wrong in the system logs, and there's no core
> file in the /var/postgresql directory.

Are you sure core files are enabled; i.e. the server is running with
'ulimit -c unlimited' ?

>     I did a 'vacuumdb -afz' just as
> a shot in the dark, without affect.  Pretty much all I see in the logs
> is this:
>
> <USER%DB x.y.z.a(51478) 487>LOG:  unexpected EOF on client connection

This means a client is dying or closing its connection prematurely,
and would seem to be a different problem.  It shouldn't ever cause the
server to crash.

> Googling turned up a few reports suggesting bad hardware, or corrupted
> indexes, but I don't think that's the case here.
>
> Any starting points or ideas would be greatly appreciated.

Make sure the server is able to dump a core file, and perhaps crank up
the logging level.

-Doug

Re: PGError: server closed the connection unexpectedly

От
Dave Steinberg
Дата:
>>My biggest problem is the lack of any real error message on the server.
>>    I don't see anything wrong in the system logs, and there's no core
>>file in the /var/postgresql directory.
>
>
> Are you sure core files are enabled; i.e. the server is running with
> 'ulimit -c unlimited' ?

Yes:

$ whoami
_postgresql
$ ulimit -c
unlimited


>>    I did a 'vacuumdb -afz' just as
>>a shot in the dark, without affect.  Pretty much all I see in the logs
>>is this:
>>
>><USER%DB x.y.z.a(51478) 487>LOG:  unexpected EOF on client connection
>
>
> This means a client is dying or closing its connection prematurely,
> and would seem to be a different problem.  It shouldn't ever cause the
> server to crash.

If the server is crashing - I have absolutely no info to suggest that,
except possibly this: at a later time I did a tcpdump of one of these
sessions, for lack of better ideas, and saw this near the end:

<snip lots of tcpdump output>
16:32:19.523842 clam.int.geekisp.com.15245 >
morningsun.int.geekisp.com.postgresql: P 17389:18036(647) ack 9323 win
16384 <nop,nop,timestamp 133396667 3702263885> (DF)
16:32:19.524289 morningsun.int.geekisp.com.postgresql >
clam.int.geekisp.com.15245: P 9323:9343(20) ack 18036 win 17376
<nop,nop,timestamp 3702263885 133396667> (DF)
16:32:19.562544 clam.int.geekisp.com.15245 >
morningsun.int.geekisp.com.postgresql: P 18036:18041(5) ack 9343 win
16384 <nop,nop,timestamp 133396667 3702263885> (DF)
******16:32:19.566662 morningsun.int.geekisp.com.postgresql >
clam.int.geekisp.com.15245: F 9343:9343(0) ack 18041 win 17376
<nop,nop,timestamp 3702263885 133396667> (DF)
16:32:19.567043 clam.int.geekisp.com.15245 >
morningsun.int.geekisp.com.postgresql: . ack 9344 win 16384
<nop,nop,timestamp 133396667 3702263885> (DF)
16:32:19.649445 clam.int.geekisp.com.15245 >
morningsun.int.geekisp.com.postgresql: P 18041:18851(810) ack 9344 win
16384 <nop,nop,timestamp 133396667 3702263885> (DF)
16:32:19.649468 morningsun.int.geekisp.com.postgresql >
clam.int.geekisp.com.15245: R 545360451:545360451(0) win 0 (DF)
16:32:19.649652 clam.int.geekisp.com.15245 >
morningsun.int.geekisp.com.postgresql: F 18851:18851(0) ack 9344 win
16384 <nop,nop,timestamp 133396667 3702263885> (DF)
16:32:19.649665 morningsun.int.geekisp.com.postgresql >
clam.int.geekisp.com.15245: R 545360451:545360451(0) win 0 (DF)

The '*****' above is my own, to highlight the interesting part.
Morningsun is the server, clam is the ruby client.  Based on this, I
think the server *might* be dying first (hard to tell), since morningsun
is the one who sends the FIN packet and thereby closes the connection.
(Sorry for the horrible wrapping).

Again, having trouble making sure.

>>Googling turned up a few reports suggesting bad hardware, or corrupted
>>indexes, but I don't think that's the case here.
>>
>>Any starting points or ideas would be greatly appreciated.
>
>
> Make sure the server is able to dump a core file, and perhaps crank up
> the logging level.

Just to verify that a core dump is possible for the postgresql user, I
wrote a tiny C program that just calls abort().  Sure enough, running it
gives me a core file.

I cranked the debug level up to 5 (man that's a lot), and here's what I
think is the relevent chunk:

<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>STATEMENT:  UPDATE
job_members SET "start_time" = NULL, "exit_code" = NULL, "create_time" =
'2006-01-07 17:10:31', "job_id" = 30, "command" = '---
!ruby/object:GeekISP::ShellCommand
         args:
         command: /bin/ls /tmp/
         exit_code:
         logger: !ruby/object:Logger
           datetime_format:
           level: 0
           logdev: !ruby/object:Logger::LogDevice
             dev: !ruby/object:File {}
             filename:
/mnt/scratch/dave/control_panel/trunk/config/../log/test.log
             shift_age: 0
             shift_size: 1048576
           progname:
         run_as:
         stderr:
         stdin:
         stdout: ', "finish_time" = NULL, "last_update_time" = NULL,
"host_id" = 4, "status" = 'in_progress', "output" = NULL WHERE
job_member_id = 105
<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>DEBUG:  name:
unnamed; blockState:       STARTED; state: INPROGR, xid/subid/cid:
10290432/1/0, nestlvl: 1, children: <>
<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>STATEMENT:  UPDATE
job_members SET "start_time" = NULL, "exit_code" = NULL, "create_time" =
'2006-01-07 17:10:31', "job_id" = 30, "command" = '---
!ruby/object:GeekISP::ShellCommand
         args:
         command: /bin/ls /tmp/
         exit_code:
         logger: !ruby/object:Logger
           datetime_format:
           level: 0
           logdev: !ruby/object:Logger::LogDevice
             dev: !ruby/object:File {}
             filename:
/mnt/scratch/dave/control_panel/trunk/config/../log/test.log
             shift_age: 0
             shift_size: 1048576
           progname:
         run_as:
         stderr:
         stdin:
         stdout: ', "finish_time" = NULL, "last_update_time" = NULL,
"host_id" = 4, "status" = 'in_progress', "output" = NULL WHERE
job_member_id = 105
<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>DEBUG:  proc_exit(0)
<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>DEBUG:  shmem_exit(0)
<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>LOG:  disconnection:
session time: 0:00:00.84 user=geekispv2 database=geekisp-v2-test
host=192.168.4.38 port=31992
<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>DEBUG:  exit(0)
<%  7303>DEBUG:  reaping dead processes
<%  7303>DEBUG:  server process (PID 9155) exited with exit code 0

That looks to me like a clean and normal exit.  This is pointing more
and more towards the client in the ruby case, isn't it?

Regards,
--
Dave Steinberg
http://www.geekisp.com/
http://www.steinbergcomputing.com/

Re: PGError: server closed the connection unexpectedly

От
Doug McNaught
Дата:
Dave Steinberg <dave@redterror.net> writes:

>>>My biggest problem is the lack of any real error message on the server.
>>>    I don't see anything wrong in the system logs, and there's no core
>>> file in the /var/postgresql directory.
>> Are you sure core files are enabled; i.e. the server is running with
>> 'ulimit -c unlimited' ?
>
> Yes:
>
> $ whoami
> _postgresql
> $ ulimit -c
> unlimited

But does the startup script for PG set the limits as well?  It's quite
possible that the PG daemon startup sequence and logging in as the PG user go
through different scripts.

Also, make sure that you're looking in the right place for core
dumps--OpenBSD may put them somewhere weird by default.

[...]

> That looks to me like a clean and normal exit.  This is pointing more
> and more towards the client in the ruby case, isn't it?

Yeah, if the server were crashing its exit code would be greater than
127.

Also, usually when a backend crashes, the postmaster takes the whole
server down on the assumption that shared memory may have been
corrupted.  It doesn't sound like this is happening to you, which
again points to a client problem.

You're not using the same PG connection from two different threads, or
fork()ing and trying to use the same connection in the parent and the
child, or anything like that?

-Doug

Re: PGError: server closed the connection unexpectedly

От
Dave Steinberg
Дата:
> You're not using the same PG connection from two different threads, or
> fork()ing and trying to use the same connection in the parent and the
> child, or anything like that?

Aha!  In the ruby code, I am forking!  I'll make the child reconnect and
see if that helps (I am almost sure this will fix it).  This one I
should have guessed - I dealt with similar stuff in perl somewhat recently.

The pg_dumpall problem...  I'll crank the debug level and see if there's
anything interesting there, and if so I'll post about it again.

Thanks Doug!
--
Dave Steinberg
http://www.geekisp.com/
http://www.steinbergcomputing.com/

Re: PGError: server closed the connection unexpectedly

От
Doug McNaught
Дата:
Dave Steinberg <dave@redterror.net> writes:

>> You're not using the same PG connection from two different threads, or
>> fork()ing and trying to use the same connection in the parent and the
>> child, or anything like that?
>
> Aha!  In the ruby code, I am forking!  I'll make the child reconnect
> and see if that helps (I am almost sure this will fix it).  This one I
> should have guessed - I dealt with similar stuff in perl somewhat
> recently.

Yay!  I thought it might be something like that.

> The pg_dumpall problem...  I'll crank the debug level and see if
> there's anything interesting there, and if so I'll post about it again.

You could also, if you have to, run the BSD equivalent of 'strace'
(ktrace?) against the backend that pg_dump is connected to, and see
what might be going on.  'strace' is the gun I pull out when logfiles
aren't working for me.  :)

-Doug