Обсуждение: posmaster failed under high load
Today I run http_load to do some benchmark of my Web-DB application and found that under high load (it was about 18 postgres running ) postsmaster failed. Restarting postmaster produces error: IpcMemoryCreate: shmget failed (Identifier removed) key=5432010, size=24588, permission=700 IpcMemoryIdGet: shmget failed (Identifier removed) key=5432010, size=24588, permission=0 IpcMemoryAttach: shmat failed (Invalid argument) id=-2 FATAL 1: AttachSLockMemory: could not attach segment I checked shared memory: 23:27[zeus]:~>ipcs -a ------ Shared Memory Segments -------- shmid owner perms bytes nattch status 10496 postgres 700 24588 5 dest 10497 postgres 600 8852184 5 dest 10498 postgres 600 96804 5 dest ------ Semaphore Arrays -------- semid owner perms nsems status ------ Message Queues -------- msqid owner perms used-bytes messages Does this error means I need to increase shared memory ? My setup: Linux 2.0.36, Dual PPRO, 256 Mb RAM Postgres 6.4.2 Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> Today I run http_load to do some benchmark of my Web-DB application > and found that under high load (it was about 18 postgres running ) postsmaster failed. > Restarting postmaster produces error: > > IpcMemoryCreate: shmget failed (Identifier removed) key=5432010, size=24588, permission=700 > IpcMemoryIdGet: shmget failed (Identifier removed) key=5432010, size=24588, permission=0 > IpcMemoryAttach: shmat failed (Invalid argument) id=-2 > FATAL 1: AttachSLockMemory: could not attach segment > > I checked shared memory: > 23:27[zeus]:~>ipcs -a > > ------ Shared Memory Segments -------- > shmid owner perms bytes nattch status > 10496 postgres 700 24588 5 dest > 10497 postgres 600 8852184 5 dest > 10498 postgres 600 96804 5 dest > > ------ Semaphore Arrays -------- > semid owner perms nsems status > > ------ Message Queues -------- > msqid owner perms used-bytes messages > > Does this error means I need to increase shared memory ? > My setup: Linux 2.0.36, Dual PPRO, 256 Mb RAM > Postgres 6.4.2 I don't think so unless you increased the shared buffer size using -B option. Stock 6.4.2 is very buggy with the shared memory usage. Probably it's the cause. Try Tom Lane's fix or 6.5b. I have tested 6.5b with 128 backends running and it seems very stable. Another possibility is you don't have enough file descriptors. What do you get by: $ cat /proc/sys/kernel/file-max 1024 or so is not enough. You could increase it by: echo 4096 > /proc/sys/kernel/file-max Decreasing the usage of file descriptors per backend is also a good idea. try: ulimit -n 20 before starting postmaster. --- Tatsuo Ishii
On Tue, 4 May 1999, Tatsuo Ishii wrote: > Date: Tue, 04 May 1999 14:05:33 +0900 > From: Tatsuo Ishii <t-ishii@sra.co.jp> > To: Oleg Bartunov <oleg@sai.msu.su> > Cc: hackers@postgreSQL.org > Subject: Re: [HACKERS] posmaster failed under high load > > > Today I run http_load to do some benchmark of my Web-DB application > > and found that under high load (it was about 18 postgres running ) postsmaster failed. > > Restarting postmaster produces error: > > > > IpcMemoryCreate: shmget failed (Identifier removed) key=5432010, size=24588, permission=700 > > IpcMemoryIdGet: shmget failed (Identifier removed) key=5432010, size=24588, permission=0 > > IpcMemoryAttach: shmat failed (Invalid argument) id=-2 > > FATAL 1: AttachSLockMemory: could not attach segment > > > > I checked shared memory: > > 23:27[zeus]:~>ipcs -a > > > > ------ Shared Memory Segments -------- > > shmid owner perms bytes nattch status > > 10496 postgres 700 24588 5 dest > > 10497 postgres 600 8852184 5 dest > > 10498 postgres 600 96804 5 dest > > > > ------ Semaphore Arrays -------- > > semid owner perms nsems status > > > > ------ Message Queues -------- > > msqid owner perms used-bytes messages > > > > Does this error means I need to increase shared memory ? > > My setup: Linux 2.0.36, Dual PPRO, 256 Mb RAM > > Postgres 6.4.2 > > I don't think so unless you increased the shared buffer size using -B > option. Stock 6.4.2 is very buggy with the shared memory > usage. Probably it's the cause. Try Tom Lane's fix or 6.5b. I have > tested 6.5b with 128 backends running and it seems very stable. Yes, I used 6.4.2 + LIMIT patch, I'll try 6.5 from cvs I run postmaster with -B 1024 option - is this too much ? > > Another possibility is you don't have enough file descriptors. > What do you get by: > $ cat /proc/sys/kernel/file-max > 1024 or so is not enough. You could increase it by: > echo 4096 > /proc/sys/kernel/file-max > Decreasing the usage of file descriptors per backend is also a good idea. > try: > ulimit -n 20 > before starting postmaster. Thanks a lot, I got several times a problem with file descriptors, it looks like every backend opens abot 90 files. I'll try your hints. Why not add your experience how to work with postgres under high load to Linux specific FAQ ? Regards, Oleg > --- > Tatsuo Ishii > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> Thanks a lot, I got several times a problem with file descriptors, > it looks like every backend opens abot 90 files. I'll try your > hints. Why not add your experience how to work with postgres under > high load to Linux specific FAQ ? Good idea. Do you have time to add this topic yourself? Actually, the general problem is common to all platforms, so we might want to write up something for the Admin Guide too. - Tom -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
> > I don't think so unless you increased the shared buffer size using -B > > option. Stock 6.4.2 is very buggy with the shared memory > > usage. Probably it's the cause. Try Tom Lane's fix or 6.5b. I have > > tested 6.5b with 128 backends running and it seems very stable. > > Yes, I used 6.4.2 + LIMIT patch, I'll try 6.5 from cvs You need Tom Lane's share mem fix patch if you use 6.4.2. 6.5 has the fix. > I run postmaster with -B 1024 option - is this too much ? No. -B 1024 means 8MB shared mem that should be ok on x86/Linux (I think most x86 based Linux allow 32MB shared mem). > Thanks a lot, I got several times a problem with file descriptors, > it looks like every backend opens abot 90 files. I'll try your > hints. But be carefull lower # of file descriptors per backend might cause lower performance because of the file opening overhead. So you should increase the file table entries in the system first. >Why not add your experience how to work with postgres under high > load to Linux specific FAQ ? I'm not good at English, that is the reason:-) BTW, FreeBSD box has more serious problems than Linux since the default kernel has lower limit of file descriptors (~700). This should be noted somewhere in the docs too. --- Tatsuo Ishii
Interesting, I just tried to load my home machine and got very weird result:76 ? S 0:00 postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe 218 ? SW 0:02 (postmaster) 219 ? SW 0:02 (postmaster) 220 ? SW 0:01 (postmaster) 222 ? SW 0:02 (postmaster) 241 ? SW 0:02 (postmaster)242 ? SW 0:02 (postmaster) 252 ? SW 0:01 (postmaster) 263 ? SW 0:01 (postmaster) 372 ? SW 0:00(postmaster) 377 ? SW 0:00 (postmaster) 378 ? SW 0:00 (postmaster) 379 ? SW 0:00 (postmaster) 383 ? SW 0:00 (postmaster) 387 ? SW 0:00 (postmaster) 388 ? SW 0:00 (postmaster) 406 ? SW 0:00 (postmaster) System is still in working conditions and psql could connects ! Postmasters seems dies with time, but after 15 minutes I still see 7 postmasters. This is my scenario and setup: P166, 64Mb RAM, Linux 2.2.7,PostgreSQL 6.4.2 on i586-pc-linux-gnu, compiled by gcc egcs-2.91.5 ( This is a stock 6.4.2 + LIMIT feature patch, Tatsuo suggests to use Tom's shared memory patches but I didn't find them ) I have apache 1.3.6+mod_perl 1.19, Apache::DBI to open persistent connection to database and Mason (http://www.masonhq.com) to produce dynamical document which I retrieve using http_load from http://www.acme.com/jef/ which I found is quite useful for testing because it doesn't big down a client machine (in my case client and server are on the same machine) So, I run ./http_load -verbose -checksum -rate 25 -fetches 40 TEST, where TEST is a file with URL to document. If interesting here are benchmarks: 40 fetches, 117 max parallel, 403192 bytes, in 9 seconds 10079.8 mean bytes/connection 4.44444 fetches/sec, 44799.1 bytes/sec 38 bad checksums I'm going to test 6.5 cvs. Regards, Oleg On Wed, 5 May 1999, Tatsuo Ishii wrote: > Date: Wed, 05 May 1999 00:02:44 +0900 > From: Tatsuo Ishii <t-ishii@sra.co.jp> > To: Oleg Bartunov <oleg@sai.msu.su> > Cc: Tatsuo Ishii <t-ishii@sra.co.jp>, hackers@postgreSQL.org > Subject: Re: [HACKERS] posmaster failed under high load > > > > I don't think so unless you increased the shared buffer size using -B > > > option. Stock 6.4.2 is very buggy with the shared memory > > > usage. Probably it's the cause. Try Tom Lane's fix or 6.5b. I have > > > tested 6.5b with 128 backends running and it seems very stable. > > > > Yes, I used 6.4.2 + LIMIT patch, I'll try 6.5 from cvs > > You need Tom Lane's share mem fix patch if you use 6.4.2. 6.5 has the > fix. > > > I run postmaster with -B 1024 option - is this too much ? > > No. -B 1024 means 8MB shared mem that should be ok on x86/Linux (I > think most x86 based Linux allow 32MB shared mem). > > > Thanks a lot, I got several times a problem with file descriptors, > > it looks like every backend opens abot 90 files. I'll try your > > hints. > > But be carefull lower # of file descriptors per backend might cause > lower performance because of the file opening overhead. So you should > increase the file table entries in the system first. > > >Why not add your experience how to work with postgres under high > > load to Linux specific FAQ ? > > I'm not good at English, that is the reason:-) > > BTW, FreeBSD box has more serious problems than Linux since the > default kernel has lower limit of file descriptors (~700). This should > be noted somewhere in the docs too. > --- > Tatsuo Ishii > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Well, just run test with 6.5 cvs and it looks much stable. I run ./http_load -rate 20 -verbose -fetches 80 TEST (notice, test is much stronger than in previous post) and got results: 81 fetches, 393 max parallel, 809028 bytes, in 24 seconds 9988 mean bytes/connection 3.375 fetches/sec, 33709.5 bytes/sec My machine was very-very load during this test - I saw peak load about 65, a lot of swapping but test completes and system after 20 minutes of swapping remains usable. I still saw many postmasters (not postgres) processes running but after about 30-40 minutes they gone. Actually pstree -a now shows|-postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe | |-(postmaster)| `-postmaster Is this a normal behaivour for postmaster ? I thought there is must be only one postmaster which forks postgres processes for every connection. Anyway, system is usable, postmaster survives and continue working ! 6.5 in this respect is much stable. I run postmaster with -B 1024 option. This test I run under Linux 2.2.7, so tomorrow I'll test on my production server which runs Linux 2.0.36, SMP, Dual PPRO, 256 Mb Ram. As I wrote 6.4.2 fails under high load, so I'll test 6.5 cvs to be sure what's is critical kernel or postgres version. Regards, Oleg On Wed, 5 May 1999, Tatsuo Ishii wrote: > Date: Wed, 05 May 1999 00:02:44 +0900 > From: Tatsuo Ishii <t-ishii@sra.co.jp> > To: Oleg Bartunov <oleg@sai.msu.su> > Cc: Tatsuo Ishii <t-ishii@sra.co.jp>, hackers@postgreSQL.org > Subject: Re: [HACKERS] posmaster failed under high load > > > > I don't think so unless you increased the shared buffer size using -B > > > option. Stock 6.4.2 is very buggy with the shared memory > > > usage. Probably it's the cause. Try Tom Lane's fix or 6.5b. I have > > > tested 6.5b with 128 backends running and it seems very stable. > > > > Yes, I used 6.4.2 + LIMIT patch, I'll try 6.5 from cvs > > You need Tom Lane's share mem fix patch if you use 6.4.2. 6.5 has the > fix. > > > I run postmaster with -B 1024 option - is this too much ? > > No. -B 1024 means 8MB shared mem that should be ok on x86/Linux (I > think most x86 based Linux allow 32MB shared mem). > > > Thanks a lot, I got several times a problem with file descriptors, > > it looks like every backend opens abot 90 files. I'll try your > > hints. > > But be carefull lower # of file descriptors per backend might cause > lower performance because of the file opening overhead. So you should > increase the file table entries in the system first. > > >Why not add your experience how to work with postgres under high > > load to Linux specific FAQ ? > > I'm not good at English, that is the reason:-) > > BTW, FreeBSD box has more serious problems than Linux since the > default kernel has lower limit of file descriptors (~700). This should > be noted somewhere in the docs too. > --- > Tatsuo Ishii > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> My machine was very-very load during this test - I saw peak > load about 65, a lot of swapping but test completes and system > after 20 minutes of swapping remains usable. I still saw many > postmasters (not postgres) processes running but after about > 30-40 minutes they gone. Actually pstree -a now shows > > |-postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe > | |-(postmaster) > | `-postmaster ps should show our process listing display change. They are postgres processes, but without the exec() call we used to do, it shows this way only on OS's that don't support ps arg display changes from inside the process. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Tue, 4 May 1999, Bruce Momjian wrote: > Date: Tue, 4 May 1999 21:35:56 -0400 (EDT) > From: Bruce Momjian <maillist@candle.pha.pa.us> > To: Oleg Bartunov <oleg@sai.msu.su> > Cc: Tatsuo Ishii <t-ishii@sra.co.jp>, hackers@postgreSQL.org > Subject: Re: [HACKERS] posmaster failed under high load > > > My machine was very-very load during this test - I saw peak > > load about 65, a lot of swapping but test completes and system > > after 20 minutes of swapping remains usable. I still saw many > > postmasters (not postgres) processes running but after about > > 30-40 minutes they gone. Actually pstree -a now shows > > > > |-postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe > > | |-(postmaster) > > | `-postmaster > > ps should show our process listing display change. They are postgres > processes, but without the exec() call we used to do, it shows this way > only on OS's that don't support ps arg display changes from inside the > process. No, it does on Linux.5159 ? S 0:00 postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe 5168 ? S 0:00 /usr/local/pgsql/bin/postgreslocalhost httpd apod idle 5169 ? S 0:00 /usr/local/pgsql/bin/postgres localhost httpd apodidle 5170 ? S 0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle 5171 ? S 0:00 /usr/local/pgsql/bin/postgreslocalhost httpd apod idle That's why I noticed 10 or more (postmaster) processes, which eventually gone after 30-40 minutes. Oleg > > > > -- > Bruce Momjian | http://www.op.net/~candle > maillist@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: > That's why I noticed 10 or more (postmaster) processes, which eventually > gone after 30-40 minutes. Could those be new backends that have been forked off the main postmaster, but haven't yet gotten around to changing their ps info? I'm not sure what would block a new backend for many minutes before it did that, however. Can you attach to one of these processes with a debugger and get a backtrace to show what it's doing? regards, tom lane
On Wed, 5 May 1999, Tom Lane wrote: > Date: Wed, 05 May 1999 09:33:14 -0400 > From: Tom Lane <tgl@sss.pgh.pa.us> > To: Oleg Bartunov <oleg@sai.msu.su> > Cc: hackers@postgreSQL.org > Subject: Re: [HACKERS] posmaster failed under high load > > Oleg Bartunov <oleg@sai.msu.su> writes: > > That's why I noticed 10 or more (postmaster) processes, which eventually > > gone after 30-40 minutes. > > Could those be new backends that have been forked off the main > postmaster, but haven't yet gotten around to changing their ps info? > I'm not sure what would block a new backend for many minutes before > it did that, however. Can you attach to one of these processes with > a debugger and get a backtrace to show what it's doing? Well, http_load -r 40 -f 240 MASON-DBI results: 244 fetches, 1020 max parallel, 272060 bytes, in 52 seconds 1115 mean bytes/connection 4.69231 fetches/sec, 5231.92 bytes/sec Below some output from ps and attached backtrace of one postmaster process. Regards, Oleg PS. Will see what happens with those (postmasters) 18:08[om]:~/app/www/http_load>w 6:09pm up 1:44, 3 users, load average: 44.92, 18.56, 7.08 18:08[om]:/usr/local/etc/httpd/conf>psg post 76 ? S 0:00 postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe 602 ? SW 0:00 (postmaster) 634 ? D 0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle 644 ? SW 0:00(postmaster) 646 ? SW 0:00 (postmaster) 648 ? SW 0:00 (postmaster) 650 ? SW 0:00 (postmaster) 651 ? SW 0:00 (postmaster) 652 ? SW 0:00 (postmaster) 653 ? SW 0:00 (postmaster) 661 ? SW 0:00 (postmaster) 662 ? D 0:00 (postmaster) 663 ? SW 0:00 (postmaster) 664 ? SW 0:00 (postmaster) 665 ? D 0:00 (postmaster) 666 ? D 0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle 18:08[om]:/usr/local/etc/httpd/conf>psg post 76 ? S 0:00 postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe 651 ? SW 0:00 (postmaster) 693 ? SW 0:00 (postmaster) 694 ? SW 0:00 (postmaster) 698 ? S 0:00 /usr/local/pgsql/bin/postgreslocalhost httpd apod idle 699 ? SW 0:00 (postmaster) 700 ? S 0:00 /usr/local/pgsql/bin/postgreslocalhost httpd apod idle 701 ? SW 0:00 (postmaster) 702 ? SW 0:00 (postmaster) 18:31[om]:/usr/local/etc/httpd/conf>psg post 76 ? S 0:00 postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe 651 ? SW 0:00 (postmaster) 693 ? SW 0:00 (postmaster) 694 ? SW 0:00 (postmaster) 698 ? S 0:00 /usr/local/pgsql/bin/postgreslocalhost httpd apod idle 699 ? SW 0:00 (postmaster) 700 ? S 0:00 /usr/local/pgsql/bin/postgreslocalhost httpd apod idle 701 ? SW 0:00 (postmaster) 702 ? SW 0:00 (postmaster) 18:34[om]:/usr/local/etc/httpd/conf> om:~$ gdb 702 GDB is free software and you are welcome to distribute copies of itunder certain conditions; type "show copying" to see theconditions. There is absolutely no warranty for GDB; type "show warranty" for details. GDB 4.16 (i486-slackware-linux), Copyright 1996 Free Software Foundation, Inc... 702: No such file or directory. (gdb) q om:~$ gdb /usr/local/pgsql/bin/postmaster 702 GDB is free software and you are welcome to distribute copies of itunder certain conditions; type "show copying" to see theconditions. There is absolutely no warranty for GDB; type "show warranty" for details. GDB 4.16 (i486-slackware-linux), Copyright 1996 Free Software Foundation, Inc... /u/postgres/702: No such file or directory. Attaching to program /usr/local/pgsql/bin/postmaster', process 702 Reading symbols from /lib/libdl.so.1...done. Reading symbols from /lib/libm.so.5...done. Reading symbols from /lib/libtermcap.so.2...done. Reading symbols from /lib/libc.so.5...done. Reading symbols from /lib/ld-linux.so.1...done. 0x40081464 in recv (sockfd=0, buffer=0xbfffa394, len=3221224256, flags=135555648) (gdb) bt #0 0x40081464 in recv (sockfd=0, buffer=0xbfffa394, len=3221224256, flags=135555648) #1 0x400a58e8 in __DTOR_END__ () #2 0x80a2585 in pq_getbytes () #3 0x80e1ed8 in SocketBackend () #4 0x80e1f66 in ReadCommand () #5 0x80e350c in PostgresMain () #6 0x80ccf2a in DoBackend () #7 0x80cca5b in BackendStartup () #8 0x80cc1d7 in ServerLoop () #9 0x80cbd63 in PostmasterMain () #10 0x80a3059 in main () #11 0x806121e in _start () (gdb) > > regards, tom lane > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: >> I'm not sure what would block a new backend for many minutes before >> it did that, however. Can you attach to one of these processes with >> a debugger and get a backtrace to show what it's doing? > Below some output from ps and attached backtrace of one postmaster > process. Hmm, that backend is quite obviously done with initialization and waiting for a client command. So why doesn't it show up as "postgres ... idle" in ps? I wonder whether we have the ps-info-setting operation in the wrong place, ie at the bottom of the loop instead of the top, so that a backend that hasn't yet received its first client command will never have set the ps data. Will take a look. regards, tom lane
> Oleg Bartunov <oleg@sai.msu.su> writes: > > That's why I noticed 10 or more (postmaster) processes, which eventually > > gone after 30-40 minutes. > > Could those be new backends that have been forked off the main > postmaster, but haven't yet gotten around to changing their ps info? > I'm not sure what would block a new backend for many minutes before > it did that, however. Can you attach to one of these processes with > a debugger and get a backtrace to show what it's doing? I can't imagine what they would be waiting for, but it seems like a good guess. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Wed, 5 May 1999, Bruce Momjian wrote: > Date: Wed, 5 May 1999 11:17:27 -0400 (EDT) > From: Bruce Momjian <maillist@candle.pha.pa.us> > To: Tom Lane <tgl@sss.pgh.pa.us> > Cc: Oleg Bartunov <oleg@sai.msu.su>, hackers@postgreSQL.org > Subject: Re: [HACKERS] posmaster failed under high load > > > Oleg Bartunov <oleg@sai.msu.su> writes: > > > That's why I noticed 10 or more (postmaster) processes, which eventually > > > gone after 30-40 minutes. > > > > Could those be new backends that have been forked off the main > > postmaster, but haven't yet gotten around to changing their ps info? > > I'm not sure what would block a new backend for many minutes before > > it did that, however. Can you attach to one of these processes with > > a debugger and get a backtrace to show what it's doing? > > I can't imagine what they would be waiting for, but it seems like a good > guess. > After more than 1 hour postmaster processes still running 76 ? S 0:00 postmaster -i -B 1024 -S -D/usr/local/pgsql/data/-o -Fe 651 ? SW 0:00 (postmaster) 693 ? SW 0:00 (postmaster) 694 ? SW 0:00 (postmaster)698 ? S 0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle 699 ? SW 0:00 (postmaster) 700 ? S 0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle 701 ? SW 0:00 (postmaster) 702 ? S 0:00/usr/local/pgsql/bin/postgres localhost httpd apod idle 18:48[om]:/usr/local/etc/httpd/conf>psg post 76 ? S 0:00 postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe 651 ? SW 0:00 (postmaster) 693 ? SW 0:00 (postmaster) 694 ? SW 0:00 (postmaster) 698 ? S 0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle 699 ? SW 0:00 (postmaster) 700 ? S 0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle 701 ? S 0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle 702 ? S 0:00 /usr/local/pgsql/bin/postgres localhosthttpd apod idle 19:30[om]:/usr/local/etc/httpd/conf> It's interesting, that process with pid 701 migrates from (postmaster) to postgres with normal ps output ! It seems postgres lives with his own life :-) regards, Oleg > > -- > Bruce Momjian | http://www.op.net/~candle > maillist@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: > It's interesting, that process with pid 701 migrates from > (postmaster) to postgres with normal ps output ! Yes, that's pretty strong evidence in favor of my theory (that these processes are just new backends that haven't received a command yet). I'm surprised that it takes so long for your test clients to issue their first commands --- is the test load *that* high, or do you have a deliberate delay in there? regards, tom lane
I wrote: > Oleg Bartunov <oleg@sai.msu.su> writes: >> It's interesting, that process with pid 701 migrates from >> (postmaster) to postgres with normal ps output ! > Yes, that's pretty strong evidence in favor of my theory (that these > processes are just new backends that haven't received a command yet). Nope, that theory is all wet --- the backend definitely does PS_SET_STATUS("idle") before it waits for a query. Something is *really* peculiar here, since your backtrace shows that the backend has reached the point of waiting for client input. It is not possible to get there without having done PS_SET_STATUS. So why does the process still show up as "(postmaster)" in ps? Something is flaky about your system's support of ps status setting, I think. regards, tom lane
On Wed, 5 May 1999, Tom Lane wrote: > Nope, that theory is all wet --- the backend definitely does > PS_SET_STATUS("idle") before it waits for a query. Something is > *really* peculiar here, since your backtrace shows that the backend > has reached the point of waiting for client input. It is not possible > to get there without having done PS_SET_STATUS. So why does the process > still show up as "(postmaster)" in ps? Something is flaky about your > system's support of ps status setting, I think. You never altered the task_struct, and so it's still 'postmaster' there. Note the W... the process is paged out, so the argv is not available! Taral
On Wed, 5 May 1999, Taral wrote: > Date: Wed, 5 May 1999 20:50:17 -0500 (CDT) > From: Taral <taral@taral.net> > To: Tom Lane <tgl@sss.pgh.pa.us> > Cc: Oleg Bartunov <oleg@sai.msu.su>, hackers@postgreSQL.org > Subject: Re: [HACKERS] posmaster failed under high load > > On Wed, 5 May 1999, Tom Lane wrote: > > > Nope, that theory is all wet --- the backend definitely does > > PS_SET_STATUS("idle") before it waits for a query. Something is > > *really* peculiar here, since your backtrace shows that the backend > > has reached the point of waiting for client input. It is not possible > > to get there without having done PS_SET_STATUS. So why does the process > > still show up as "(postmaster)" in ps? Something is flaky about your > > system's support of ps status setting, I think. > > You never altered the task_struct, and so it's still 'postmaster' there. > Note the W... the process is paged out, so the argv is not available! The system was under very high load, at peak load was about 69 (actually, it could be higher, I just wasn't able to enter a command :-) Client (http_load from http://www.acme.com) tests checksum for every connection, so definetely command was issued and backend returns a result. Oleg > > Taral > > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> On Wed, 5 May 1999, Tom Lane wrote: > > > Nope, that theory is all wet --- the backend definitely does > > PS_SET_STATUS("idle") before it waits for a query. Something is > > *really* peculiar here, since your backtrace shows that the backend > > has reached the point of waiting for client input. It is not possible > > to get there without having done PS_SET_STATUS. So why does the process > > still show up as "(postmaster)" in ps? Something is flaky about your > > system's support of ps status setting, I think. > > You never altered the task_struct, and so it's still 'postmaster' there. > Note the W... the process is paged out, so the argv is not available! Yes, I remember now. To do ps-args you need to read the process address space. If it is paged out, ps does not bring in the pages just to read the args. This is probably as expected. If someone wants to add a linux-specific fix for this, I guess you could, though I am not sure it is worth it. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Thu, 6 May 1999, Bruce Momjian wrote: > Date: Thu, 6 May 1999 01:20:13 -0400 (EDT) > From: Bruce Momjian <maillist@candle.pha.pa.us> > To: Taral <taral@taral.net> > Cc: Tom Lane <tgl@sss.pgh.pa.us>, Oleg Bartunov <oleg@sai.msu.su>, > hackers@postgreSQL.org > Subject: Re: [HACKERS] posmaster failed under high load > > > On Wed, 5 May 1999, Tom Lane wrote: > > > > > Nope, that theory is all wet --- the backend definitely does > > > PS_SET_STATUS("idle") before it waits for a query. Something is > > > *really* peculiar here, since your backtrace shows that the backend > > > has reached the point of waiting for client input. It is not possible > > > to get there without having done PS_SET_STATUS. So why does the process > > > still show up as "(postmaster)" in ps? Something is flaky about your > > > system's support of ps status setting, I think. > > > > You never altered the task_struct, and so it's still 'postmaster' there. > > Note the W... the process is paged out, so the argv is not available! > > Yes, I remember now. To do ps-args you need to read the process address > space. If it is paged out, ps does not bring in the pages just to read > the args. This is probably as expected. If someone wants to add a > linux-specific fix for this, I guess you could, though I am not sure it > is worth it. > How to explain that process with PID 701 which shown in ps output as (postmaster) after some time becomes looks as usual postgres Oleg > -- > Bruce Momjian | http://www.op.net/~candle > maillist@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Thu, 6 May 1999, Oleg Bartunov wrote: > How to explain that process with PID 701 which shown in ps output > as (postmaster) after some time becomes looks as usual postgres Because 'postmaster' is written in the kernel task_struct, whereas the task's argv[] says 'postgres'. The only way to get around this is to do an execv(), at which point the kernel will recopy argv[0]. Taral
> On Thu, 6 May 1999, Oleg Bartunov wrote: > > > How to explain that process with PID 701 which shown in ps output > > as (postmaster) after some time becomes looks as usual postgres > > Because 'postmaster' is written in the kernel task_struct, whereas the > task's argv[] says 'postgres'. > > The only way to get around this is to do an execv(), at which point the > kernel will recopy argv[0]. We used to do execv(), but stopped doing it for performance reasons. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: >> The only way to get around this is to do an execv(), at which point the >> kernel will recopy argv[0]. > We used to do execv(), but stopped doing it for performance reasons. It's clearly not worth re-introducing the exec call just to make ps status display work (especially since it's only failing when the backend is swapped out). However, I wonder whether there is another answer. Something that's been on my to-do list since the ps-status-display code got added is to import "sendmail"'s ps-status-display module lock, stock, and barrel. Sendmail's code is kinda ugly, but it's been wrung out and works on a wide variety of Unixes. The code we have doesn't ... (it's never worked on my HPUX box, for instance, whereas sendmail does). I have no idea at the moment whether sendmail knows how to change the task_struct on Linux; but it might. regards, tom lane
On Thu, 6 May 1999, Tom Lane wrote: > Something that's been on my to-do list since the ps-status-display code > got added is to import "sendmail"'s ps-status-display module lock, stock, > and barrel. Sendmail's code is kinda ugly, but it's been wrung out and > works on a wide variety of Unixes. The code we have doesn't ... (it's > never worked on my HPUX box, for instance, whereas sendmail does). The code was touched by wu-ftpd authors once. When I thought of stealing the code I found sources in wu-ftpd distribution a little cleaner. > regards, tom lane Oleg. ---- Oleg Broytmann http://members.xoom.com/phd2/ phd2@earthling.net Programmers don't die, they justGOSUB without RETURN.