Обсуждение: posmaster failed under high load

Поиск
Список
Период
Сортировка

posmaster failed under high load

От
Oleg Bartunov
Дата:
Today I run http_load to do some benchmark of my Web-DB application
and found that under high load (it was about 18 postgres running ) postsmaster failed.
Restarting postmaster produces error:

IpcMemoryCreate: shmget failed (Identifier removed) key=5432010, size=24588, permission=700
IpcMemoryIdGet: shmget failed (Identifier removed) key=5432010, size=24588, permission=0
IpcMemoryAttach: shmat failed (Invalid argument) id=-2
FATAL 1:  AttachSLockMemory: could not attach segment

I checked shared memory:
23:27[zeus]:~>ipcs -a

------ Shared Memory Segments --------
shmid     owner     perms     bytes     nattch    status      
10496     postgres  700       24588     5         dest        
10497     postgres  600       8852184   5         dest        
10498     postgres  600       96804     5         dest        

------ Semaphore Arrays --------
semid     owner     perms     nsems     status      

------ Message Queues --------
msqid     owner     perms     used-bytes  messages    

Does this error means I need to increase shared memory ?
My setup: Linux 2.0.36, Dual PPRO, 256 Mb RAM
Postgres 6.4.2
Regards,
    Oleg


_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] posmaster failed under high load

От
Tatsuo Ishii
Дата:
> Today I run http_load to do some benchmark of my Web-DB application
> and found that under high load (it was about 18 postgres running ) postsmaster failed.
> Restarting postmaster produces error:
> 
> IpcMemoryCreate: shmget failed (Identifier removed) key=5432010, size=24588, permission=700
> IpcMemoryIdGet: shmget failed (Identifier removed) key=5432010, size=24588, permission=0
> IpcMemoryAttach: shmat failed (Invalid argument) id=-2
> FATAL 1:  AttachSLockMemory: could not attach segment
> 
> I checked shared memory:
> 23:27[zeus]:~>ipcs -a
> 
> ------ Shared Memory Segments --------
> shmid     owner     perms     bytes     nattch    status      
> 10496     postgres  700       24588     5         dest        
> 10497     postgres  600       8852184   5         dest        
> 10498     postgres  600       96804     5         dest        
> 
> ------ Semaphore Arrays --------
> semid     owner     perms     nsems     status      
> 
> ------ Message Queues --------
> msqid     owner     perms     used-bytes  messages    
> 
> Does this error means I need to increase shared memory ?
> My setup: Linux 2.0.36, Dual PPRO, 256 Mb RAM
> Postgres 6.4.2

I don't think so unless you increased the shared buffer size using -B
option. Stock 6.4.2 is very buggy with the shared memory
usage. Probably it's the cause. Try Tom Lane's fix or 6.5b. I have
tested 6.5b with 128 backends running and it seems very stable.

Another possibility is you don't have enough file descriptors.
What do you get by:
$ cat /proc/sys/kernel/file-max
1024 or so is not enough. You could increase it by:
echo 4096 > /proc/sys/kernel/file-max
Decreasing the usage of file descriptors per backend is also a good idea.
try:
ulimit -n 20
before starting postmaster.
---
Tatsuo Ishii


Re: [HACKERS] posmaster failed under high load

От
Oleg Bartunov
Дата:
On Tue, 4 May 1999, Tatsuo Ishii wrote:

> Date: Tue, 04 May 1999 14:05:33 +0900
> From: Tatsuo Ishii <t-ishii@sra.co.jp>
> To: Oleg Bartunov <oleg@sai.msu.su>
> Cc: hackers@postgreSQL.org
> Subject: Re: [HACKERS] posmaster failed under high load 
> 
> > Today I run http_load to do some benchmark of my Web-DB application
> > and found that under high load (it was about 18 postgres running ) postsmaster failed.
> > Restarting postmaster produces error:
> > 
> > IpcMemoryCreate: shmget failed (Identifier removed) key=5432010, size=24588, permission=700
> > IpcMemoryIdGet: shmget failed (Identifier removed) key=5432010, size=24588, permission=0
> > IpcMemoryAttach: shmat failed (Invalid argument) id=-2
> > FATAL 1:  AttachSLockMemory: could not attach segment
> > 
> > I checked shared memory:
> > 23:27[zeus]:~>ipcs -a
> > 
> > ------ Shared Memory Segments --------
> > shmid     owner     perms     bytes     nattch    status      
> > 10496     postgres  700       24588     5         dest        
> > 10497     postgres  600       8852184   5         dest        
> > 10498     postgres  600       96804     5         dest        
> > 
> > ------ Semaphore Arrays --------
> > semid     owner     perms     nsems     status      
> > 
> > ------ Message Queues --------
> > msqid     owner     perms     used-bytes  messages    
> > 
> > Does this error means I need to increase shared memory ?
> > My setup: Linux 2.0.36, Dual PPRO, 256 Mb RAM
> > Postgres 6.4.2
> 
> I don't think so unless you increased the shared buffer size using -B
> option. Stock 6.4.2 is very buggy with the shared memory
> usage. Probably it's the cause. Try Tom Lane's fix or 6.5b. I have
> tested 6.5b with 128 backends running and it seems very stable.

Yes, I used 6.4.2 + LIMIT patch, I'll try 6.5 from cvs
I run postmaster with -B 1024 option - is this too much ?

> 
> Another possibility is you don't have enough file descriptors.
> What do you get by:
> $ cat /proc/sys/kernel/file-max
> 1024 or so is not enough. You could increase it by:
> echo 4096 > /proc/sys/kernel/file-max
> Decreasing the usage of file descriptors per backend is also a good idea.
> try:
> ulimit -n 20
> before starting postmaster.

Thanks a lot, I got several times a problem with file descriptors,
it looks like every backend opens abot 90 files. I'll try your 
hints. Why not add your experience how to work with postgres under high
load to Linux specific FAQ ?
Regards,    Oleg
> ---
> Tatsuo Ishii
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] posmaster failed under high load

От
Thomas Lockhart
Дата:
> Thanks a lot, I got several times a problem with file descriptors,
> it looks like every backend opens abot 90 files. I'll try your
> hints. Why not add your experience how to work with postgres under 
> high load to Linux specific FAQ ?

Good idea. Do you have time to add this topic yourself? Actually, the
general problem is common to all platforms, so we might want to write
up something for the Admin Guide too.
                       - Tom

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] posmaster failed under high load

От
Tatsuo Ishii
Дата:
> > I don't think so unless you increased the shared buffer size using -B
> > option. Stock 6.4.2 is very buggy with the shared memory
> > usage. Probably it's the cause. Try Tom Lane's fix or 6.5b. I have
> > tested 6.5b with 128 backends running and it seems very stable.
> 
> Yes, I used 6.4.2 + LIMIT patch, I'll try 6.5 from cvs

You need Tom Lane's share mem fix patch if you use 6.4.2. 6.5 has the
fix.

> I run postmaster with -B 1024 option - is this too much ?

No. -B 1024 means 8MB shared mem that should be ok on x86/Linux (I
think most x86 based Linux allow 32MB shared mem).

> Thanks a lot, I got several times a problem with file descriptors,
> it looks like every backend opens abot 90 files. I'll try your 
> hints.

But be carefull lower # of file descriptors per backend might cause
lower performance because of the file opening overhead. So you should
increase the file table entries in the system first.

>Why not add your experience how to work with postgres under high
> load to Linux specific FAQ ?

I'm not good at English, that is the reason:-)

BTW, FreeBSD box has more serious problems than Linux since the
default kernel has lower limit of file descriptors (~700). This should
be noted somewhere in the docs too.
---
Tatsuo Ishii


Re: [HACKERS] posmaster failed under high load

От
Oleg Bartunov
Дата:
Interesting, I just tried to load my home machine and got very
weird result:76  ?  S    0:00 postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe  218  ?  SW   0:02 (postmaster)
219 ?  SW   0:02 (postmaster) 220  ?  SW   0:01 (postmaster) 222  ?  SW   0:02 (postmaster) 241  ?  SW   0:02
(postmaster)242  ?  SW   0:02 (postmaster) 252  ?  SW   0:01 (postmaster) 263  ?  SW   0:01 (postmaster) 372  ?  SW
0:00(postmaster) 377  ?  SW   0:00 (postmaster) 378  ?  SW   0:00 (postmaster) 379  ?  SW   0:00 (postmaster) 383  ?
SW  0:00 (postmaster) 387  ?  SW   0:00 (postmaster) 388  ?  SW   0:00 (postmaster) 406  ?  SW   0:00 (postmaster)
 

System is still in working conditions and psql could connects !
Postmasters seems dies with time, but after 15 minutes I still see
7 postmasters.
This is my scenario and setup:
P166, 64Mb RAM,
Linux 2.2.7,PostgreSQL 6.4.2 on i586-pc-linux-gnu, compiled by gcc egcs-2.91.5
( This is a stock 6.4.2 + LIMIT feature patch, Tatsuo suggests to use
Tom's shared memory patches but I didn't find them )

I have apache 1.3.6+mod_perl 1.19, Apache::DBI to open persistent
connection to database and Mason (http://www.masonhq.com) to
produce dynamical document which I retrieve using 
http_load from  http://www.acme.com/jef/ which I found is quite
useful for testing because it doesn't big down a client machine
(in my case client and server are on the same machine)
So, I run 
./http_load -verbose -checksum -rate 25 -fetches 40 TEST,
where TEST is a file with URL to document.
If interesting here are benchmarks: 
40 fetches, 117 max parallel, 403192 bytes, in 9 seconds
10079.8 mean bytes/connection
4.44444 fetches/sec, 44799.1 bytes/sec
38 bad checksums

I'm going to test 6.5 cvs.
Regards,
    Oleg


On Wed, 5 May 1999, Tatsuo Ishii wrote:

> Date: Wed, 05 May 1999 00:02:44 +0900
> From: Tatsuo Ishii <t-ishii@sra.co.jp>
> To: Oleg Bartunov <oleg@sai.msu.su>
> Cc: Tatsuo Ishii <t-ishii@sra.co.jp>, hackers@postgreSQL.org
> Subject: Re: [HACKERS] posmaster failed under high load 
> 
> > > I don't think so unless you increased the shared buffer size using -B
> > > option. Stock 6.4.2 is very buggy with the shared memory
> > > usage. Probably it's the cause. Try Tom Lane's fix or 6.5b. I have
> > > tested 6.5b with 128 backends running and it seems very stable.
> > 
> > Yes, I used 6.4.2 + LIMIT patch, I'll try 6.5 from cvs
> 
> You need Tom Lane's share mem fix patch if you use 6.4.2. 6.5 has the
> fix.
> 
> > I run postmaster with -B 1024 option - is this too much ?
> 
> No. -B 1024 means 8MB shared mem that should be ok on x86/Linux (I
> think most x86 based Linux allow 32MB shared mem).
> 
> > Thanks a lot, I got several times a problem with file descriptors,
> > it looks like every backend opens abot 90 files. I'll try your 
> > hints.
> 
> But be carefull lower # of file descriptors per backend might cause
> lower performance because of the file opening overhead. So you should
> increase the file table entries in the system first.
> 
> >Why not add your experience how to work with postgres under high
> > load to Linux specific FAQ ?
> 
> I'm not good at English, that is the reason:-)
> 
> BTW, FreeBSD box has more serious problems than Linux since the
> default kernel has lower limit of file descriptors (~700). This should
> be noted somewhere in the docs too.
> ---
> Tatsuo Ishii
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] posmaster failed under high load

От
Oleg Bartunov
Дата:
Well,
just run test with 6.5 cvs and it looks much stable.
I run ./http_load -rate 20 -verbose -fetches 80 TEST
(notice, test is much stronger than in previous post)  and got results:
81 fetches, 393 max parallel, 809028 bytes, in 24 seconds
9988 mean bytes/connection
3.375 fetches/sec, 33709.5 bytes/sec

My machine was very-very load during this test - I saw peak
load about 65, a lot of swapping but test completes and system
after 20 minutes of swapping remains usable. I still saw many
postmasters (not postgres) processes running but after about 
30-40 minutes they gone. Actually pstree -a now shows|-postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe |
|-(postmaster)|  `-postmaster 
 
Is this a normal behaivour for postmaster ?
I thought there is must be only one postmaster which forks postgres
processes for every connection. Anyway, system is usable,
postmaster survives and continue working ! 6.5 in this respect is much
stable. I run postmaster with -B 1024 option. This test I run under
Linux 2.2.7, so tomorrow I'll test on my production server which 
runs Linux 2.0.36, SMP, Dual PPRO, 256 Mb Ram. As I wrote 6.4.2 fails
under high load, so I'll test 6.5 cvs to be sure what's is critical
kernel or postgres version.
Regards,
    Oleg

On Wed, 5 May 1999, Tatsuo Ishii wrote:

> Date: Wed, 05 May 1999 00:02:44 +0900
> From: Tatsuo Ishii <t-ishii@sra.co.jp>
> To: Oleg Bartunov <oleg@sai.msu.su>
> Cc: Tatsuo Ishii <t-ishii@sra.co.jp>, hackers@postgreSQL.org
> Subject: Re: [HACKERS] posmaster failed under high load 
> 
> > > I don't think so unless you increased the shared buffer size using -B
> > > option. Stock 6.4.2 is very buggy with the shared memory
> > > usage. Probably it's the cause. Try Tom Lane's fix or 6.5b. I have
> > > tested 6.5b with 128 backends running and it seems very stable.
> > 
> > Yes, I used 6.4.2 + LIMIT patch, I'll try 6.5 from cvs
> 
> You need Tom Lane's share mem fix patch if you use 6.4.2. 6.5 has the
> fix.
> 
> > I run postmaster with -B 1024 option - is this too much ?
> 
> No. -B 1024 means 8MB shared mem that should be ok on x86/Linux (I
> think most x86 based Linux allow 32MB shared mem).
> 
> > Thanks a lot, I got several times a problem with file descriptors,
> > it looks like every backend opens abot 90 files. I'll try your 
> > hints.
> 
> But be carefull lower # of file descriptors per backend might cause
> lower performance because of the file opening overhead. So you should
> increase the file table entries in the system first.
> 
> >Why not add your experience how to work with postgres under high
> > load to Linux specific FAQ ?
> 
> I'm not good at English, that is the reason:-)
> 
> BTW, FreeBSD box has more serious problems than Linux since the
> default kernel has lower limit of file descriptors (~700). This should
> be noted somewhere in the docs too.
> ---
> Tatsuo Ishii
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] posmaster failed under high load

От
Bruce Momjian
Дата:
> My machine was very-very load during this test - I saw peak
> load about 65, a lot of swapping but test completes and system
> after 20 minutes of swapping remains usable. I still saw many
> postmasters (not postgres) processes running but after about 
> 30-40 minutes they gone. Actually pstree -a now shows
>  
>  |-postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe
>   |  |-(postmaster)
>   |  `-postmaster 

ps should show our process listing display change.  They are postgres
processes, but without the exec() call we used to do, it shows this way
only on OS's that don't support ps arg display changes from inside the
process.



--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] posmaster failed under high load

От
Oleg Bartunov
Дата:
On Tue, 4 May 1999, Bruce Momjian wrote:

> Date: Tue, 4 May 1999 21:35:56 -0400 (EDT)
> From: Bruce Momjian <maillist@candle.pha.pa.us>
> To: Oleg Bartunov <oleg@sai.msu.su>
> Cc: Tatsuo Ishii <t-ishii@sra.co.jp>, hackers@postgreSQL.org
> Subject: Re: [HACKERS] posmaster failed under high load
> 
> > My machine was very-very load during this test - I saw peak
> > load about 65, a lot of swapping but test completes and system
> > after 20 minutes of swapping remains usable. I still saw many
> > postmasters (not postgres) processes running but after about 
> > 30-40 minutes they gone. Actually pstree -a now shows
> >  
> >  |-postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe
> >   |  |-(postmaster)
> >   |  `-postmaster 
> 
> ps should show our process listing display change.  They are postgres
> processes, but without the exec() call we used to do, it shows this way
> only on OS's that don't support ps arg display changes from inside the
> process.

No, it does on Linux.5159  ?  S    0:00 postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe 5168  ?  S    0:00
/usr/local/pgsql/bin/postgreslocalhost httpd apod idle 5169  ?  S    0:00 /usr/local/pgsql/bin/postgres localhost httpd
apodidle 5170  ?  S    0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle 5171  ?  S    0:00
/usr/local/pgsql/bin/postgreslocalhost httpd apod idle 
 

That's why I noticed 10 or more (postmaster) processes, which eventually
gone after 30-40 minutes.
Oleg

> 
> 
> 
> -- 
>   Bruce Momjian                        |  http://www.op.net/~candle
>   maillist@candle.pha.pa.us            |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] posmaster failed under high load

От
Tom Lane
Дата:
Oleg Bartunov <oleg@sai.msu.su> writes:
> That's why I noticed 10 or more (postmaster) processes, which eventually
> gone after 30-40 minutes.

Could those be new backends that have been forked off the main
postmaster, but haven't yet gotten around to changing their ps info?
I'm not sure what would block a new backend for many minutes before
it did that, however.  Can you attach to one of these processes with
a debugger and get a backtrace to show what it's doing?
        regards, tom lane


Re: [HACKERS] posmaster failed under high load

От
Oleg Bartunov
Дата:
On Wed, 5 May 1999, Tom Lane wrote:

> Date: Wed, 05 May 1999 09:33:14 -0400
> From: Tom Lane <tgl@sss.pgh.pa.us>
> To: Oleg Bartunov <oleg@sai.msu.su>
> Cc: hackers@postgreSQL.org
> Subject: Re: [HACKERS] posmaster failed under high load 
> 
> Oleg Bartunov <oleg@sai.msu.su> writes:
> > That's why I noticed 10 or more (postmaster) processes, which eventually
> > gone after 30-40 minutes.
> 
> Could those be new backends that have been forked off the main
> postmaster, but haven't yet gotten around to changing their ps info?
> I'm not sure what would block a new backend for many minutes before
> it did that, however.  Can you attach to one of these processes with
> a debugger and get a backtrace to show what it's doing?

Well,

http_load -r 40 -f 240 MASON-DBI
results:
244 fetches, 1020 max parallel, 272060 bytes, in 52 seconds
1115 mean bytes/connection
4.69231 fetches/sec, 5231.92 bytes/sec

Below some output from ps and attached backtrace of one postmaster
process.
Regards,
    Oleg

PS.
Will see what happens with those (postmasters)

18:08[om]:~/app/www/http_load>w 6:09pm  up  1:44,  3 users,  load average: 44.92, 18.56, 7.08

18:08[om]:/usr/local/etc/httpd/conf>psg post  76  ?  S    0:00 postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe
602  ?  SW   0:00 (postmaster) 634  ?  D    0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle  644  ?  SW
0:00(postmaster) 646  ?  SW   0:00 (postmaster) 648  ?  SW   0:00 (postmaster) 650  ?  SW   0:00 (postmaster) 651  ?
SW  0:00 (postmaster) 652  ?  SW   0:00 (postmaster) 653  ?  SW   0:00 (postmaster) 661  ?  SW   0:00 (postmaster) 662
? D    0:00 (postmaster) 663  ?  SW   0:00 (postmaster) 664  ?  SW   0:00 (postmaster) 665  ?  D    0:00 (postmaster)
666 ?  D    0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle 
 
18:08[om]:/usr/local/etc/httpd/conf>psg post  76  ?  S    0:00 postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe
651  ?  SW   0:00 (postmaster) 693  ?  SW   0:00 (postmaster) 694  ?  SW   0:00 (postmaster) 698  ?  S    0:00
/usr/local/pgsql/bin/postgreslocalhost httpd apod idle  699  ?  SW   0:00 (postmaster) 700  ?  S    0:00
/usr/local/pgsql/bin/postgreslocalhost httpd apod idle  701  ?  SW   0:00 (postmaster) 702  ?  SW   0:00 (postmaster)
 
18:31[om]:/usr/local/etc/httpd/conf>psg post  76  ?  S    0:00 postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe
651  ?  SW   0:00 (postmaster) 693  ?  SW   0:00 (postmaster) 694  ?  SW   0:00 (postmaster) 698  ?  S    0:00
/usr/local/pgsql/bin/postgreslocalhost httpd apod idle  699  ?  SW   0:00 (postmaster) 700  ?  S    0:00
/usr/local/pgsql/bin/postgreslocalhost httpd apod idle  701  ?  SW   0:00 (postmaster) 702  ?  SW   0:00 (postmaster)
 
18:34[om]:/usr/local/etc/httpd/conf>

om:~$ gdb 702
GDB is free software and you are welcome to distribute copies of itunder certain conditions; type "show copying" to see
theconditions.
 
There is absolutely no warranty for GDB; type "show warranty" for details.
GDB 4.16 (i486-slackware-linux), 
Copyright 1996 Free Software Foundation, Inc...

702: No such file or directory.

(gdb) q
om:~$ gdb /usr/local/pgsql/bin/postmaster 702
GDB is free software and you are welcome to distribute copies of itunder certain conditions; type "show copying" to see
theconditions.
 
There is absolutely no warranty for GDB; type "show warranty" for details.
GDB 4.16 (i486-slackware-linux), 
Copyright 1996 Free Software Foundation, Inc...

/u/postgres/702: No such file or directory.
Attaching to program /usr/local/pgsql/bin/postmaster', process 702
Reading symbols from /lib/libdl.so.1...done.
Reading symbols from /lib/libm.so.5...done.
Reading symbols from /lib/libtermcap.so.2...done.
Reading symbols from /lib/libc.so.5...done.
Reading symbols from /lib/ld-linux.so.1...done.
0x40081464 in recv (sockfd=0, buffer=0xbfffa394, len=3221224256,    flags=135555648)
(gdb) bt
#0  0x40081464 in recv (sockfd=0, buffer=0xbfffa394, len=3221224256,    flags=135555648)
#1  0x400a58e8 in __DTOR_END__ ()
#2  0x80a2585 in pq_getbytes ()
#3  0x80e1ed8 in SocketBackend ()
#4  0x80e1f66 in ReadCommand ()
#5  0x80e350c in PostgresMain ()
#6  0x80ccf2a in DoBackend ()
#7  0x80cca5b in BackendStartup ()
#8  0x80cc1d7 in ServerLoop ()
#9  0x80cbd63 in PostmasterMain ()
#10 0x80a3059 in main ()
#11 0x806121e in _start ()
(gdb) 


> 
>             regards, tom lane
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] posmaster failed under high load

От
Tom Lane
Дата:
Oleg Bartunov <oleg@sai.msu.su> writes:
>> I'm not sure what would block a new backend for many minutes before
>> it did that, however.  Can you attach to one of these processes with
>> a debugger and get a backtrace to show what it's doing?

> Below some output from ps and attached backtrace of one postmaster
> process.

Hmm, that backend is quite obviously done with initialization and
waiting for a client command.  So why doesn't it show up as
"postgres ... idle" in ps?

I wonder whether we have the ps-info-setting operation in the wrong
place, ie at the bottom of the loop instead of the top, so that a
backend that hasn't yet received its first client command will never
have set the ps data.  Will take a look.
        regards, tom lane


Re: [HACKERS] posmaster failed under high load

От
Bruce Momjian
Дата:
> Oleg Bartunov <oleg@sai.msu.su> writes:
> > That's why I noticed 10 or more (postmaster) processes, which eventually
> > gone after 30-40 minutes.
> 
> Could those be new backends that have been forked off the main
> postmaster, but haven't yet gotten around to changing their ps info?
> I'm not sure what would block a new backend for many minutes before
> it did that, however.  Can you attach to one of these processes with
> a debugger and get a backtrace to show what it's doing?

I can't imagine what they would be waiting for, but it seems like a good
guess.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] posmaster failed under high load

От
Oleg Bartunov
Дата:
On Wed, 5 May 1999, Bruce Momjian wrote:

> Date: Wed, 5 May 1999 11:17:27 -0400 (EDT)
> From: Bruce Momjian <maillist@candle.pha.pa.us>
> To: Tom Lane <tgl@sss.pgh.pa.us>
> Cc: Oleg Bartunov <oleg@sai.msu.su>, hackers@postgreSQL.org
> Subject: Re: [HACKERS] posmaster failed under high load
> 
> > Oleg Bartunov <oleg@sai.msu.su> writes:
> > > That's why I noticed 10 or more (postmaster) processes, which eventually
> > > gone after 30-40 minutes.
> > 
> > Could those be new backends that have been forked off the main
> > postmaster, but haven't yet gotten around to changing their ps info?
> > I'm not sure what would block a new backend for many minutes before
> > it did that, however.  Can you attach to one of these processes with
> > a debugger and get a backtrace to show what it's doing?
> 
> I can't imagine what they would be waiting for, but it seems like a good
> guess.
> 

After more than 1 hour postmaster processes still running   76  ?  S    0:00 postmaster -i -B 1024 -S
-D/usr/local/pgsql/data/-o -Fe  651  ?  SW   0:00 (postmaster) 693  ?  SW   0:00 (postmaster) 694  ?  SW   0:00
(postmaster)698  ?  S    0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle  699  ?  SW   0:00 (postmaster)
700 ?  S    0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle  701  ?  SW   0:00 (postmaster) 702  ?  S
0:00/usr/local/pgsql/bin/postgres localhost httpd apod idle 
 
18:48[om]:/usr/local/etc/httpd/conf>psg post
  76  ?  S    0:00 postmaster -i -B 1024 -S -D/usr/local/pgsql/data/ -o -Fe  651  ?  SW   0:00 (postmaster) 693  ?  SW
0:00 (postmaster) 694  ?  SW   0:00 (postmaster) 698  ?  S    0:00 /usr/local/pgsql/bin/postgres localhost httpd apod
idle 699  ?  SW   0:00 (postmaster) 700  ?  S    0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle  701  ?
S   0:00 /usr/local/pgsql/bin/postgres localhost httpd apod idle  702  ?  S    0:00 /usr/local/pgsql/bin/postgres
localhosthttpd apod idle 
 
19:30[om]:/usr/local/etc/httpd/conf>

It's interesting, that process with pid 701 migrates from 
(postmaster) to postgres with normal ps output !
It seems postgres lives with his own life :-)

regards,    Oleg

> 
> -- 
>   Bruce Momjian                        |  http://www.op.net/~candle
>   maillist@candle.pha.pa.us            |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] posmaster failed under high load

От
Tom Lane
Дата:
Oleg Bartunov <oleg@sai.msu.su> writes:
> It's interesting, that process with pid 701 migrates from 
> (postmaster) to postgres with normal ps output !

Yes, that's pretty strong evidence in favor of my theory (that these
processes are just new backends that haven't received a command yet).

I'm surprised that it takes so long for your test clients to issue their
first commands --- is the test load *that* high, or do you have a
deliberate delay in there?
        regards, tom lane


Re: [HACKERS] posmaster failed under high load

От
Tom Lane
Дата:
I wrote:
> Oleg Bartunov <oleg@sai.msu.su> writes:
>> It's interesting, that process with pid 701 migrates from 
>> (postmaster) to postgres with normal ps output !

> Yes, that's pretty strong evidence in favor of my theory (that these
> processes are just new backends that haven't received a command yet).

Nope, that theory is all wet --- the backend definitely does 
PS_SET_STATUS("idle") before it waits for a query.  Something is
*really* peculiar here, since your backtrace shows that the backend
has reached the point of waiting for client input.  It is not possible
to get there without having done PS_SET_STATUS.  So why does the process
still show up as "(postmaster)" in ps?  Something is flaky about your
system's support of ps status setting, I think.
        regards, tom lane


Re: [HACKERS] posmaster failed under high load

От
Taral
Дата:
On Wed, 5 May 1999, Tom Lane wrote:

> Nope, that theory is all wet --- the backend definitely does 
> PS_SET_STATUS("idle") before it waits for a query.  Something is
> *really* peculiar here, since your backtrace shows that the backend
> has reached the point of waiting for client input.  It is not possible
> to get there without having done PS_SET_STATUS.  So why does the process
> still show up as "(postmaster)" in ps?  Something is flaky about your
> system's support of ps status setting, I think.

You never altered the task_struct, and so it's still 'postmaster' there.
Note the W... the process is paged out, so the argv is not available!

Taral



Re: [HACKERS] posmaster failed under high load

От
Oleg Bartunov
Дата:
On Wed, 5 May 1999, Taral wrote:

> Date: Wed, 5 May 1999 20:50:17 -0500 (CDT)
> From: Taral <taral@taral.net>
> To: Tom Lane <tgl@sss.pgh.pa.us>
> Cc: Oleg Bartunov <oleg@sai.msu.su>, hackers@postgreSQL.org
> Subject: Re: [HACKERS] posmaster failed under high load 
> 
> On Wed, 5 May 1999, Tom Lane wrote:
> 
> > Nope, that theory is all wet --- the backend definitely does 
> > PS_SET_STATUS("idle") before it waits for a query.  Something is
> > *really* peculiar here, since your backtrace shows that the backend
> > has reached the point of waiting for client input.  It is not possible
> > to get there without having done PS_SET_STATUS.  So why does the process
> > still show up as "(postmaster)" in ps?  Something is flaky about your
> > system's support of ps status setting, I think.
> 
> You never altered the task_struct, and so it's still 'postmaster' there.
> Note the W... the process is paged out, so the argv is not available!


The system was under very high load, at peak load was about 69 
(actually, it could be higher, I just wasn't able to enter a command :-)
Client (http_load from http://www.acme.com) tests checksum for every
connection, so definetely command was issued and backend returns a result.
Oleg

> 
> Taral
> 
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] posmaster failed under high load

От
Bruce Momjian
Дата:
> On Wed, 5 May 1999, Tom Lane wrote:
> 
> > Nope, that theory is all wet --- the backend definitely does 
> > PS_SET_STATUS("idle") before it waits for a query.  Something is
> > *really* peculiar here, since your backtrace shows that the backend
> > has reached the point of waiting for client input.  It is not possible
> > to get there without having done PS_SET_STATUS.  So why does the process
> > still show up as "(postmaster)" in ps?  Something is flaky about your
> > system's support of ps status setting, I think.
> 
> You never altered the task_struct, and so it's still 'postmaster' there.
> Note the W... the process is paged out, so the argv is not available!

Yes, I remember now.  To do ps-args you need to read the process address
space.  If it is paged out, ps does not bring in the pages just to read
the args.  This is probably as expected.  If someone wants to add a
linux-specific fix for this, I guess you could, though I am not sure it
is worth it.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] posmaster failed under high load

От
Oleg Bartunov
Дата:
On Thu, 6 May 1999, Bruce Momjian wrote:

> Date: Thu, 6 May 1999 01:20:13 -0400 (EDT)
> From: Bruce Momjian <maillist@candle.pha.pa.us>
> To: Taral <taral@taral.net>
> Cc: Tom Lane <tgl@sss.pgh.pa.us>, Oleg Bartunov <oleg@sai.msu.su>,
>     hackers@postgreSQL.org
> Subject: Re: [HACKERS] posmaster failed under high load
> 
> > On Wed, 5 May 1999, Tom Lane wrote:
> > 
> > > Nope, that theory is all wet --- the backend definitely does 
> > > PS_SET_STATUS("idle") before it waits for a query.  Something is
> > > *really* peculiar here, since your backtrace shows that the backend
> > > has reached the point of waiting for client input.  It is not possible
> > > to get there without having done PS_SET_STATUS.  So why does the process
> > > still show up as "(postmaster)" in ps?  Something is flaky about your
> > > system's support of ps status setting, I think.
> > 
> > You never altered the task_struct, and so it's still 'postmaster' there.
> > Note the W... the process is paged out, so the argv is not available!
> 
> Yes, I remember now.  To do ps-args you need to read the process address
> space.  If it is paged out, ps does not bring in the pages just to read
> the args.  This is probably as expected.  If someone wants to add a
> linux-specific fix for this, I guess you could, though I am not sure it
> is worth it.
> 

How to explain that process with PID 701 which shown in ps output
as (postmaster) after some time becomes looks as usual postgres
Oleg

> -- 
>   Bruce Momjian                        |  http://www.op.net/~candle
>   maillist@candle.pha.pa.us            |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] posmaster failed under high load

От
Taral
Дата:
On Thu, 6 May 1999, Oleg Bartunov wrote:

> How to explain that process with PID 701 which shown in ps output
> as (postmaster) after some time becomes looks as usual postgres

Because 'postmaster' is written in the kernel task_struct, whereas the
task's argv[] says 'postgres'.

The only way to get around this is to do an execv(), at which point the
kernel will recopy argv[0].

Taral



Re: [HACKERS] posmaster failed under high load

От
Bruce Momjian
Дата:
> On Thu, 6 May 1999, Oleg Bartunov wrote:
> 
> > How to explain that process with PID 701 which shown in ps output
> > as (postmaster) after some time becomes looks as usual postgres
> 
> Because 'postmaster' is written in the kernel task_struct, whereas the
> task's argv[] says 'postgres'.
> 
> The only way to get around this is to do an execv(), at which point the
> kernel will recopy argv[0].

We used to do execv(), but stopped doing it for performance reasons.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] posmaster failed under high load

От
Tom Lane
Дата:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
>> The only way to get around this is to do an execv(), at which point the
>> kernel will recopy argv[0].

> We used to do execv(), but stopped doing it for performance reasons.

It's clearly not worth re-introducing the exec call just to make ps
status display work (especially since it's only failing when the backend
is swapped out).  However, I wonder whether there is another answer.

Something that's been on my to-do list since the ps-status-display code
got added is to import "sendmail"'s ps-status-display module lock, stock,
and barrel.  Sendmail's code is kinda ugly, but it's been wrung out and
works on a wide variety of Unixes.  The code we have doesn't ... (it's
never worked on my HPUX box, for instance, whereas sendmail does).

I have no idea at the moment whether sendmail knows how to change the
task_struct on Linux; but it might.
        regards, tom lane


Re: [HACKERS] posmaster failed under high load

От
Oleg Broytmann
Дата:
On Thu, 6 May 1999, Tom Lane wrote:
> Something that's been on my to-do list since the ps-status-display code
> got added is to import "sendmail"'s ps-status-display module lock, stock,
> and barrel.  Sendmail's code is kinda ugly, but it's been wrung out and
> works on a wide variety of Unixes.  The code we have doesn't ... (it's
> never worked on my HPUX box, for instance, whereas sendmail does).
  The code was touched by wu-ftpd authors once. When I thought of stealing
the code I found sources in wu-ftpd distribution a little cleaner.

>             regards, tom lane

Oleg.
----    Oleg Broytmann     http://members.xoom.com/phd2/     phd2@earthling.net          Programmers don't die, they
justGOSUB without RETURN.