Обсуждение: Autovacuum seems to block database: WARNING worker took too long to start

Поиск
Список
Период
Сортировка

Autovacuum seems to block database: WARNING worker took too long to start

От
Pablo Delgado Díaz-Pache
Дата:
Hi all,

I'm having a strange problem with postgres & autovacuum
Everything is working fine until I start getting the following errors ... and postgres stops working shortly after (it stops accepting connections)

2010-11-13 12:34:08.599 CET|1|||7104||4cde77b0.1bc0|2010-11-13 12:34:08 CET|1/44303|0|| LOG:  automatic vacuum of table "mrs.pg_catalog.pg_statistic": index scans: 1
        pages: 0 removed, 189 remain
        tuples: 132 removed, 4587 remain
        system usage: CPU 0.00s/0.00u sec elapsed 0.04 sec
2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-13 13:26:41.254 CET|4|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING:  worker took too long to start; cancelled
.........
.........
infinite errors like this ...
.........
.........

This situation obviously creates serious problems for us!! He have to restart the database to make it work again.

Some info of our system

OS: Centos 5.5
Kernel: 2.6.18-194.26.1.el5
Postgres version: 8.4.5 (installation out-of-the-box using yum)
Server memory: 28GB

Autovacuum configuration
autovacuum = on
vacuum_cost_delay = 10
log_autovacuum_min_duration =   0
autovacuum_naptime = 60
autovacuum_vacuum_threshold = 50
autovacuum_vacuum_cost_delay = 20ms

Shared memory value
shared_buffers = 917504

I've read the problem seems to be autovacuum using all resources and leaving the system unasable (not allowing to accept more connections), but I've not read the way to fix it.
Does upgrading to postgres 9.0 help?

Any help would be appreciated!

Thanks

Pablo

Re: Autovacuum seems to block database: WARNING worker took too long to start

От
Tom Lane
Дата:
=?ISO-8859-1?Q?Pablo_Delgado_D=EDaz=2DPache?= <delgadop@gmail.com> writes:
> I'm having a strange problem with postgres & autovacuum
> Everything is working fine until I start getting the following errors ...
> and postgres stops working shortly after (it stops accepting connections)

> 2010-11-13 12:34:08.599 CET|1|||7104||4cde77b0.1bc0|2010-11-13 12:34:08
> CET|1/44303|0|| LOG:  automatic vacuum of table
> "mrs.pg_catalog.pg_statistic": index scans: 1
>         pages: 0 removed, 189 remain
>         tuples: 132 removed, 4587 remain
>         system usage: CPU 0.00s/0.00u sec elapsed 0.04 sec
> 2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled
> 2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled
> 2010-11-13 13:26:41.254 CET|4|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled

Hm.  The code comment above that warning says

             * The only problems that may cause this code to
             * fire are errors in the earlier sections of AutoVacWorkerMain,
             * before the worker removes the WorkerInfo from the
             * startingWorker pointer.

but it's hard to see what problem there could lead to an issue.  (In
particular, I discount the idea that AutovacuumLock could be stuck,
because we had to acquire it in order to issue this message.)

But it strikes me that the code comment is wrong in one significant way:
if the postmaster were failing to heed SIGUSR1 at all, you could reach
the timeout here, because the fork-failed signal wouldn't get sent.

Given that you say it also stops accepting connections, I'm thinking
this is a postmaster problem not an autovacuum problem.  But you've
not provided any information about that end of it.  Exactly what happens
when you try to make a connection?  Are there any entries at all in the
postmaster log?  What about the kernel log?  Are you sure that new
connections stop working *after* this happens, and not at exactly the
same time?

> OS: Centos 5.5
> Kernel: 2.6.18-194.26.1.el5
> Postgres version: 8.4.5 (installation out-of-the-box using yum)

Given that this is a Linux system, I think that an OOM kill on the
postmaster is a not-insignificant possibility.  Or at least I would
think that if there weren't a PostmasterIsAlive check in the autovac
launcher loop.  It's real hard to see how you could get more than
one of these messages if the postmaster were gone entirely.

Could you try strace'ing the postmaster process to see what it's doing
when this is happening?

            regards, tom lane

Re: Autovacuum seems to block database: WARNING worker took too long to start

От
Pablo Delgado Díaz-Pache
Дата:
Hi Tom, thanks for your detailed responses. I really appreciate it.
My comments below inline ...


> I'm having a strange problem with postgres & autovacuum
> Everything is working fine until I start getting the following errors ...
> and postgres stops working shortly after (it stops accepting connections)

> 2010-11-13 12:34:08.599 CET|1|||7104||4cde77b0.1bc0|2010-11-13 12:34:08
> CET|1/44303|0|| LOG:  automatic vacuum of table
> "mrs.pg_catalog.pg_statistic": index scans: 1
>         pages: 0 removed, 189 remain
>         tuples: 132 removed, 4587 remain
>         system usage: CPU 0.00s/0.00u sec elapsed 0.04 sec
> 2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled
> 2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled
> 2010-11-13 13:26:41.254 CET|4|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled

Hm.  The code comment above that warning says

            * The only problems that may cause this code to
            * fire are errors in the earlier sections of AutoVacWorkerMain,
            * before the worker removes the WorkerInfo from the
            * startingWorker pointer.

but it's hard to see what problem there could lead to an issue.  (In
particular, I discount the idea that AutovacuumLock could be stuck,
because we had to acquire it in order to issue this message.)

But it strikes me that the code comment is wrong in one significant way:
if the postmaster were failing to heed SIGUSR1 at all, you could reach
the timeout here, because the fork-failed signal wouldn't get sent.

Given that you say it also stops accepting connections, I'm thinking
this is a postmaster problem not an autovacuum problem.  But you've
not provided any information about that end of it.  Exactly what happens
when you try to make a connection?  Are there any entries at all in the
postmaster log?  What about the kernel log?  Are you sure that new
connections stop working *after* this happens, and not at exactly the
same time?

When I try to make a connection from the command line I get nothing. I mean, if I execute ...

# psql -U postgres -l

I have to Ctr-C or Ctr-Z (followed by kill %1) because I can wait forever.
It's like no more connections were accepted. Obviously our applications stop working immediately. 
Trying to stop postgres with

# /etc/init.d/postgres stop
Stopping postgresql service:  

I have to strop it with Ctr-C as it does not wotk (I waited up to 3 min. with no luck). As a consequence I had to kill the postmaster process manually.

Kernel log says nothing :(

Not sure what started failing first: the connections or the autovacuum. I just went to the log and saw that sequence of events. Searching in google I found a few references saying autovacuum and that error log "WARNING:  worker took too long to start; cancelled" were releated.
 

> OS: Centos 5.5
> Kernel: 2.6.18-194.26.1.el5
> Postgres version: 8.4.5 (installation out-of-the-box using yum)

Given that this is a Linux system, I think that an OOM kill on the
postmaster is a not-insignificant possibility.  Or at least I would
think that if there weren't a PostmasterIsAlive check in the autovac
launcher loop.  It's real hard to see how you could get more than
one of these messages if the postmaster were gone entirely.

I just noticed the error log repeats once every minute but the content is the same ...
First time the log appeared

2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING:  worker took too long to start; cancelled

It repeats once every minute until we fix it

2010-11-14 09:44:01.889 CET|1208|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-14 09:45:03.105 CET|1209|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING:  worker took too long to start; cancelled

All 1,000+ lines are alike except for the date (every new line adds 1 more minute)
However the log is the same

2010-11-14 09:45:03.105 CET|1209|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING:  worker took too long to start; cancelled

According to our postgresql.conf log format ...

log_line_prefix = '%m|%l|%d|%r|%p|%i|%c|%s|%v|%x|%u| '            
                                        #   %u = user name
                                        #   %d = database name
                                        #   %r = remote host and port
                                        #   %h = remote host
                                        #   %p = process ID
                                        #   %t = timestamp without milliseconds
                                        #   %m = timestamp with milliseconds
                                        #   %i = command tag
                                        #   %c = session ID
                                        #   %l = session line number
                                        #   %s = session start timestamp
                                        #   %v = virtual transaction ID
                                        #   %x = transaction ID (0 if none)
                                        #   %q = stop here in non-session
                                        #        processes

Splitting the log ... 

(%m = timestamp with milliseconds) = 2010-11-14 09:45:03.105 CET
(%l = session line number) = 209
(%d = database name) = 
(%r = remote host and port) = 
(%p = process ID) = 3300
(%i = command tag) = 
(%c = session ID) = 4cdc2ae6.ce4
(%s = session start timestamp) = 2010-11-11 18:41:58 CET
(%v = virtual transaction ID) = 
(%x = transaction ID (0 if none)) = 0
%u = user name) = 
Message = WARNING:  worker took too long to start; cancelled

I see the following fields are the same in the 1,200+ lines of the log

(%d = database name) = 
(%r = remote host and port) = 
(%c = session ID) 
(%s = session start timestamp) 
(%v = virtual transaction ID) 
(%x = transaction ID (0 if none)) 
%u = user name) 
Message
 

Does this info help?
What does "empty database" mean? could be the autovacuum process?
I noticed the "%s" (
2010-11-11 18:41:58 CET) 
matches the time we started postgres. Here you have the log

2010-11-11 18:41:58.840 CET|1|||3268||4cdc2ae6.cc4|2010-11-11 18:41:58 CET||0|| LOG:  database system was shut down at 2010-11-11 18:38:55 CET
2010-11-11 18:41:58.902 CET|1|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| LOG:  autovacuum launcher started
2010-11-11 18:41:58.902 CET|1|||3234||4cdc2ae4.ca2|2010-11-11 18:41:56 CET||0|| LOG:  database system is ready to accept connections



Could you try strace'ing the postmaster process to see what it's doing
when this is happening?

I definitely will.

More information in case that helps. Strangely, we had the same problem in a temporary database we used last week while migrating some data. This time there was not autovacuum entry in the log ...

2010-11-10 20:26:06.441 CET|1|cache|10.19.0.48(20840)|11239|SELECT|4cdaf1cc.2be7|2010-11-10 20:26:04 CET|114/0|0|postgres| LOG:  duration: 1770.500 ms  statement: select id_type,id,id_language_fk,data from cache where id_type=5  and id = '60820269' and id_language_fk = 2
2010-11-10 20:26:07.100 CET|1|xxx|10.19.0.43(14080)|9423|SELECT|4cdaef43.24cf|2010-11-10 20:15:15 CET|80/0|0|postgres| LOG:  duration: 1178.423 ms  statement: SELECT "reviews".id FROM "reviews" WHERE ("reviews"."booking_id" = E'10101690551')  LIMIT 1
2010-11-10 20:27:26.905 CET|3|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:28:27.422 CET|4|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:28:41.641 CET|1|mrs|10.19.0.51(20106)|11272|SELECT|4cdaf1d0.2c08|2010-11-10 20:26:08 CET|6/701|0|postgres| LOG:  duration: 1878.724 ms  execute <unnamed>: select tp.*,t.id_token,uo.* from tokenprovider as tp inner join token as t on (tp.id_tokenprovider = t.id_tokenprovider_fk) inner join tkuo as tu on (t.id_token = tu.id_token_fk) inner join useoccupation as uo on (uo.id_useoccupation = tu.id_useoccupation_fk and uo.id_tokenprovider_fk = tp.id_tokenprovider) where tp.id_tokenprovider in (10203746,10369604,10718768,10001514,10170665,10091612,10829359,10797243,50013934,12536008,90529952,10064394,10769848,50013983,10012115,23951091,25688092,60712161,10455933,10766831,62781261,10560488,10808512,80751062,38735328,50013938,10481737,50013710,10916713,10556915,10999695,75417634,22313851,55471672,88737419,46563417,69789399,20418193,37363697,55232743,10328868,10171688,77959221,10380595,28839589,10439708,10165273,10872146,50356324,15842616,10597119,10943311,43894782,61323020) order by tp.id_tokenprovider,t.id_token,uo.sort
2010-11-10 20:29:28.349 CET|5|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:30:01.940 CET|1|xmas|10.19.0.43(14078)|9422|idle|4cdaef43.24ce|2010-11-10 20:15:15 CET|70/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:01.942 CET|1|xxxx|10.19.0.43(2900)|5793|idle|4cdaebb6.16a1|2010-11-10 20:00:06 CET|60/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:01.948 CET|1|mrs|10.19.0.43(14081)|9424|idle|4cdaef43.24d0|2010-11-10 20:15:15 CET|89/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:01.950 CET|3|xxxx|10.19.0.43(14429)|9840|idle|4cdaefc8.2670|2010-11-10 20:17:28 CET|72/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:01.951 CET|1|xxxx|10.19.0.43(60284)|10509|idle|4cdaf0ba.290d|2010-11-10 20:21:30 CET|83/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:01.961 CET|1|xxxx|10.19.0.43(49864)|10986|idle|4cdaf19f.2aea|2010-11-10 20:25:19 CET|57/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:01.969 CET|1|xxxx|10.19.0.43(50039)|11282|idle|4cdaf1d9.2c12|2010-11-10 20:26:17 CET|65/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:28.875 CET|6|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:31:29.210 CET|7|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:32:29.642 CET|8|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:33:30.271 CET|9|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:34:30.604 CET|10|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:35:31.106 CET|11|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled

Well, thanks for any information. I can provide more logs/information if necessary.

Pablo
 

                       regards, tom lane

Re: Autovacuum seems to block database: WARNING worker took too long to start

От
Alvaro Herrera
Дата:
Excerpts from Pablo Delgado Díaz-Pache's message of lun nov 15 04:52:53 -0300 2010:

> > But it strikes me that the code comment is wrong in one significant way:
> > if the postmaster were failing to heed SIGUSR1 at all, you could reach
> > the timeout here, because the fork-failed signal wouldn't get sent.

Hmm, yeah, I guess I wasn't considering the idea that postmaster was so
busy with something else that it wouldn't be able to respond to the
launcher's requests.  As you say, if it went away entirely, autovacuum
would also suicide.

> > Could you try strace'ing the postmaster process to see what it's doing
> > when this is happening?
>
> I definitely will.

Yes, please.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Autovacuum seems to block database: WARNING worker took too long to start

От
Pablo Delgado Díaz-Pache
Дата:
Not sure what you mean.

Once we start getting that error in the postgres log, there is no autovacuum entry. Only that error message once every minute (and as i wrote in my last email, the same message all the time).

The question is what is causing postgres to stop working. Autovacuum or another reason? If it's another reason ... what could it be?

thanks in advance for your help

2010/11/16 Alvaro Herrera <alvherre@commandprompt.com>
Excerpts from Pablo Delgado Díaz-Pache's message of lun nov 15 04:52:53 -0300 2010:

> > But it strikes me that the code comment is wrong in one significant way:
> > if the postmaster were failing to heed SIGUSR1 at all, you could reach
> > the timeout here, because the fork-failed signal wouldn't get sent.

Hmm, yeah, I guess I wasn't considering the idea that postmaster was so
busy with something else that it wouldn't be able to respond to the
launcher's requests.  As you say, if it went away entirely, autovacuum
would also suicide.

> > Could you try strace'ing the postmaster process to see what it's doing
> > when this is happening?
>
> I definitely will.

Yes, please.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Autovacuum seems to block database: WARNING worker took too long to start

От
Alvaro Herrera
Дата:
Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09 -0300 2010:
> Not sure what you mean.
>
> Once we start getting that error in the postgres log, there is no autovacuum
> entry. Only that error message once every minute (and as i wrote in my last
> email, the same message all the time).

Yeah.  This means that the autovacuum launcher daemon is still running
and trying to schedule new jobs, but it's not able to for some reason.
When it tries to start a new one, it sees the old one still waiting to
start running; and then it logs this message.

To start a new job, what it does is send a signal to postmaster and wait
for it to create a child process.  This is the part that's not working.

> The question is what is causing postgres to stop working. Autovacuum or
> another reason? If it's another reason ... what could it be?

I have no idea what's going on.  I have never seen this problem.  If you
see it happen again, please get a stacktrace from postmaster; and/or
strace it.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Autovacuum seems to block database: WARNING worker took too long to start

От
Alvaro Herrera
Дата:
Excerpts from Pablo Delgado Díaz-Pache's message of jue nov 18 08:57:16 -0300 2010:
> Well, we had the problem again. This is what we did
>
> 1) A "ps" to check what processes were running. Many "postmaster" processes
> in defunct state. A few postgres connections still working. A few "psql" (by
> shell script) queries hanged (they are scripts to monitor state of database.
> We use big-sister to do so).
>
> The normal state of the server is 1 postmaster pid and many "postgres"
> connections in "idle" state.
>
> This the result when we were having the problem. I've excluded irrelevant
> processes.
>
> *postgres   892  3889  0 09:07 ?        00:01:05 postgres: postgres
> international 10.19.0.51(49264) idle*
> *postgres   934  3889  0 12:00 ?        00:00:04 [postmaster] <defunct>*
> *postgres   935  3889  0 12:00 ?        00:00:04 [postmaster] <defunct>*

Hmm, these are processes that were forked from postmaster (PID 3889) and
died before they could become something useful -- but postmaster hasn't
reclaimed yet.

Can you see what process 3889 is doing?  strace it or get a stacktrace
with GDB.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Autovacuum seems to block database: WARNING worker took too long to start

От
Alvaro Herrera
Дата:
Excerpts from Pablo Delgado Díaz-Pache's message of jue nov 18 08:57:16 -0300 2010:

> 2) We did a strace to the postmaster pid. However we had 2 postmasters not
> dead
>
> # ps -fea |grep -i postmaster
> postgres  3889     1  0 Nov16 ?        00:01:24 /usr/bin/postmaster -p 5432
> -D /var/lib/pgsql/data
> postgres  7601  3889  0 12:37 ?        00:00:00 /usr/bin/postmaster -p 5432
> -D /var/lib/pgsql/data
>
> As soon as we did a "strace" to the 3889 pid everything started to work
> again.

Sorry for my previous response -- evidently I failed to scroll down
enough to notice this part.

It seems to me that this process was stuck in a unnatural way.

> Not sure it was a coincidence but that was the way it was.
>
> *# strace -p 3889*
> *Process 3889 attached - interrupt to quit*
> *select(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be
> restarted)*
> *--- SIGUSR1 (User defined signal 1) @ 0 (0) ---*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*

This seems normal postmaster activity: receiving SIGUSR1, then SIGCHLD,
and doing stuff accordingly.

Rather than a coincidence, I would think that the act of tracing it made
it come back to life.  A kernel bug maybe?  Have you upgraded your
kernel or libc lately?

> I also straced the other postmaster pid
>
> *# strace -p 7601*
> *Process 7601 attached - interrupt to quit*
> *recvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL,
> NULL) = 181*

This one seems like a regular postmaster child that hadn't gotten around
to changing its ps status yet.  (Note it had PPID 3889 which is
consistent with this idea.)

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Autovacuum seems to block database: WARNING worker took too long to start

От
Pablo Delgado Díaz-Pache
Дата:
Just noticed the mysql database (we just used it for wordpress) that we had in the same server was also "stucked" and had to restart it again. Checking the "netstat -an" I noticed the following ...

# netstat -an | grep 3306
tcp        0      0 0.0.0.0:3306                0.0.0.0:*                   LISTEN
tcp        0      0 10.19.0.47:3306             10.19.0.46:33691            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33745            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33710            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33743            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33751            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33690            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33758            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33725            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33739            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33730            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33744            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33688            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33741            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33757            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33712            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33734            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33719            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33738            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33682            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33702            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33701            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33742            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33708            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33703            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33731            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33723            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33695            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33724            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33728            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33689            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33729            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33694            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33720            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33754            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33732            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33705            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33748            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33749            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33733            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33753            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33755            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33726            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33727            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33737            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33693            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33700            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33740            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33747            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33716            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33709            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33697            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33717            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33736            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33715            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33704            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33711            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33746            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33718            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33692            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33707            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33696            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33750            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33681            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33699            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33756            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33714            SYN_RECV
tcp        0      0 10.19.0.47:3306             10.19.0.46:33706            SYN_RECV
tcp        1      0 10.19.0.47:3306             10.19.0.46:49437            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49436            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49439            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49438            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49433            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49432            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49435            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49434            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49429            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49428            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49431            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49430            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49427            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49453            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49452            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49455            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49454            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49449            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49448            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49451            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49450            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49445            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49444            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49447            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49446            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49441            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49440            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49443            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49442            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49457            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:49456            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30397            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30396            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30399            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30398            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30393            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30392            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30395            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30394            TIME_WAIT
tcp      100      0 10.19.0.47:3306             10.19.0.46:30391            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30409            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30408            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30411            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30410            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30405            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30404            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30407            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30406            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30401            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30400            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30403            TIME_WAIT
tcp        1      0 10.19.0.47:3306             10.19.0.46:30402            TIME_WAIT

Didn't do the netstat -an for port 5432, but I guess all max_connections for postgres were busy. Just a guess ...

All connections are done from internal servers, and there are no firewall rules in that interface. Just in case here it goes the iptables -L -n -v

# iptables -L -n -v
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
 219M   56G mirai      all  --  *      *       0.0.0.0/0            0.0.0.0/0

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 170M packets, 392G bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain mirai (1 references)
 pkts bytes target     prot opt in     out     source               destination
  12M 6207M ACCEPT     all  --  lo     *       0.0.0.0/0            0.0.0.0/0
 6186  509K ACCEPT     icmp --  *      *       0.0.0.0/0            0.0.0.0/0           icmp type 255
 205M   50G ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0           state RELATED,ESTABLISHED
   30  1800 ACCEPT     all  --  eth0   *       x.x.x.x        0.0.0.0/0
1023K   61M ACCEPT     all  --  eth1   *       10.0.0.0/8           0.0.0.0/0
 6224  373K ACCEPT     tcp  --  eth0   *       0.0.0.0/0            0.0.0.0/0           tcp multiport dports 80,443
 308K  183M REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0           reject-with icmp-port-unreachable


2010/11/18 Pablo Delgado Díaz-Pache <delgadop@gmail.com>
Well, we had the problem again. This is what we did

1) A "ps" to check what processes were running. Many "postmaster" processes in defunct state. A few postgres connections still working. A few "psql" (by shell script) queries hanged (they are scripts to monitor state of database. We use big-sister to do so).

The normal state of the server is 1 postmaster pid and many "postgres" connections in "idle" state.

This the result when we were having the problem. I've excluded irrelevant processes.

postgres   892  3889  0 09:07 ?        00:01:05 postgres: postgres international 10.19.0.51(49264) idle
postgres   934  3889  0 12:00 ?        00:00:04 [postmaster] <defunct>
postgres   935  3889  0 12:00 ?        00:00:04 [postmaster] <defunct>
postgres   936  3889  0 12:00 ?        00:00:00 [postmaster] <defunct>
postgres   938  3889  0 12:00 ?        00:00:01 [postmaster] <defunct>
postgres  1244  3889  0 12:01 ?        00:00:01 [postmaster] <defunct>
postgres  1453  3889  0 12:02 ?        00:00:02 [postmaster] <defunct>
postgres  1563  3889  0 12:03 ?        00:00:10 [postmaster] <defunct>
postgres  1862  3889  0 12:04 ?        00:00:04 [postmaster] <defunct>
postgres  2275  3889  0 12:06 ?        00:00:02 [postmaster] <defunct>
root      2287    49  0 Oct29 ?        00:00:00 [kondemand/0]
root      2288    49  0 Oct29 ?        00:00:00 [kondemand/1]
postgres  2485  3889  0 12:07 ?        00:00:05 postgres: postgres international 10.19.0.51(59066) idle
postgres  2490  3889  1 12:07 ?        00:00:26 postgres: postgres international 10.19.0.51(59070) idle
postgres  2568  3889  0 12:07 ?        00:00:00 postgres: postgres international 10.19.0.50(1249) idle
postgres  3544  3889  0 12:12 ?        00:00:07 [postmaster] <defunct>
postgres  3875  3889  0 12:14 ?        00:00:03 [postmaster] <defunct>
postgres  3889     1  0 Nov16 ?        00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres  3896  3889  0 Nov16 ?        00:00:00 postgres: logger process
postgres  3913  3889  0 Nov16 ?        00:00:05 postgres: writer process
postgres  3914  3889  0 Nov16 ?        00:00:00 postgres: wal writer process
postgres  3915  3889  0 Nov16 ?        00:00:01 postgres: autovacuum launcher process
postgres  3916  3889  0 Nov16 ?        00:01:42 postgres: stats collector process
postgres  4169  3889  0 12:15 ?        00:00:02 [postmaster] <defunct>
postgres  4170  3889  0 12:15 ?        00:00:04 [postmaster] <defunct>
postgres  4171  3889  0 12:15 ?        00:00:07 [postmaster] <defunct>
postgres  4281  3889  0 12:15 ?        00:00:04 [postmaster] <defunct>
postgres  4672  3889  0 12:17 ?        00:00:01 [postmaster] <defunct>
postgres  5038  3889  0 12:19 ?        00:00:01 [postmaster] <defunct>
postgres  5203  3889  0 12:20 ?        00:00:00 [postmaster] <defunct>
postgres  5484  3889  0 12:21 ?        00:00:01 [postmaster] <defunct>
postgres  5694  3889  0 12:22 ?        00:00:01 postgres: postgres international 10.19.0.49(29113) idle
postgres  5769  3889  0 12:23 ?        00:00:04 [postmaster] <defunct>
postgres  5832  3889  0 12:23 ?        00:00:03 postgres: postgres international 10.19.0.49(29129) idle
postgres  5835  3889  0 12:23 ?        00:00:02 [postmaster] <defunct>
postgres  5979  3889  0 12:24 ?        00:00:01 [postmaster] <defunct>
postgres  6243  3889  0 12:25 ?        00:00:00 [postmaster] <defunct>
postgres  6424  3889  0 12:26 ?        00:00:01 postgres: postgres international 10.19.0.49(45160) idle
postgres  6425  3889  0 12:26 ?        00:00:01 [postmaster] <defunct>
postgres  6429  3889  0 12:26 ?        00:00:01 [postmaster] <defunct>
postgres  6599  3889  0 12:27 ?        00:00:02 [postmaster] <defunct>
postgres  6635  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6653  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6654  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6655  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6663  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6664  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6665  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6666  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6667  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6668  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6669  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6670  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6671  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6672  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6673  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6773  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6774  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6778  3889  0 12:28 ?        00:00:01 [postmaster] <defunct>
postgres  6779  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6785  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6787  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6788  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6796  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6797  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6837  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6838  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6839  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6840  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6841  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6842  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6843  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6844  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6852  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6859  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6860  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6862  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6872  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6873  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6874  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6910  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6970  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6971  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6972  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6973  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6978  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6990  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6991  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6992  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6993  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7000  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7088  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7089  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7090  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7091  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7094  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7095  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7096  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7122  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7126  3889  0 12:29 ?        00:00:03 postgres: postgres mrs 10.19.0.51(7717) idle
postgres  7127  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
root      7152  7150  0 12:30 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7157  7155  0 12:30 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7168  7166  0 12:31 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7174  7172  0 12:31 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7188  7186  0 12:32 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7196  7194  0 12:32 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7258  7256  0 12:34 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7263  7261  0 12:34 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7308  7306  0 12:35 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7313  7311  0 12:35 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7325  7323  0 12:36 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7330  7328  0 12:36 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7371  7369  0 12:37 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
postgres 22959  3889  0 11:09 ?        00:00:01 postgres: postgres spain 10.19.0.45(57042) idle
postgres 22961  3889  0 11:09 ?        00:00:22 postgres: postgres mirai 10.19.0.45(57043) idle
postgres 23225  3889  0 11:10 ?        00:00:00 postgres: postgres xmas 10.19.0.45(57061) idle
postgres 25844  3889  0 11:23 ?        00:00:01 postgres: postgres spain 10.19.0.51(11291) idle
postgres 31942  3889  0 11:53 ?        00:00:00 [postmaster] <defunct>
postgres 31973  3889  0 11:53 ?        00:00:16 [postmaster] <defunct>

2) We did a strace to the postmaster pid. However we had 2 postmasters not dead

# ps -fea |grep -i postmaster
postgres  3889     1  0 Nov16 ?        00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres  7601  3889  0 12:37 ?        00:00:00 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data

As soon as we did a "strace" to the 3889 pid everything started to work again. Not sure it was a coincidence but that was the way it was.

# strace -p 3889
Process 3889 attached - interrupt to quit
select(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7766
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7766
wait4(-1, 0x7fffc37259a4, WNOHANG, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7767
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---

I also straced the other postmaster pid

# strace -p 7601
Process 7601 attached - interrupt to quit
recvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL, NULL) = 181
brk(0x14ac4000)                         = 0x14ac4000
lseek(3, 0, SEEK_END)                   = 16130048
lseek(5, 0, SEEK_END)                   = 4661248
lseek(3, 0, SEEK_END)                   = 16130048
brk(0x14ae9000)                         = 0x14ae9000
brk(0x14adb000)                         = 0x14adb000
brk(0x14ad7000)                         = 0x14ad7000
sendto(7, "\2\0\0\0\230\1\0\0H\203\314\0\4\0\0\0\5\0\0\0\0\0\0\0X\203\314\0\0\0\0\0"..., 408, 0, NULL, 0) = 408
sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 74, 0, NULL, 0) = 74
recvfrom(8, "P\0\0\0\213\0select id_key from transla"..., 8192, 0, NULL, NULL) = 175
brk(0x14ac4000)                         = 0x14ac4000
lseek(3, 0, SEEK_END)                   = 16130048
lseek(5, 0, SEEK_END)                   = 4661248
lseek(3, 0, SEEK_END)                   = 16130048
brk(0x14ae9000)                         = 0x14ae9000
brk(0x14adb000)                         = 0x14adb000
brk(0x14ad7000)                         = 0x14ad7000
sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 73, 0, NULL, 0) = 73
recvfrom(8,  <unfinished ...>
Process 7601 detached

3) Checked the logs. We could see the same error lines we had before, but they dissapeared after just a few minutes.

2010-11-18 12:30:02.830 CET|1|mirai|10.19.0.43(4452)|3875|idle|4ce50a7c.f23|2010-11-18 12:14:04 CET|67/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-18 12:30:02.841 CET|1|mirai|10.19.0.43(13943)|2275|idle|4ce508b4.8e3|2010-11-18 12:06:28 CET|75/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-18 12:30:47.628 CET|2|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:31:48.056 CET|3|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:32:48.184 CET|4|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:33:48.614 CET|5|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:34:48.747 CET|6|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:35:49.277 CET|7|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:36:49.706 CET|8|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:37:30.462 CET|1|spain|10.19.0.48(11276)|7414|idle|4ce50ffa.1cf6|2010-11-18 12:37:30 CET|2/0|0|spainadmin| LOG:  unexpected EOF on client connection

Any Ideas?

Thanks for your help and feedback!

Pablo

2010/11/16 Alvaro Herrera <alvherre@commandprompt.com>
Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09 -0300 2010:

> Not sure what you mean.
>
> Once we start getting that error in the postgres log, there is no autovacuum
> entry. Only that error message once every minute (and as i wrote in my last
> email, the same message all the time).

Yeah.  This means that the autovacuum launcher daemon is still running
and trying to schedule new jobs, but it's not able to for some reason.
When it tries to start a new one, it sees the old one still waiting to
start running; and then it logs this message.

To start a new job, what it does is send a signal to postmaster and wait
for it to create a child process.  This is the part that's not working.

> The question is what is causing postgres to stop working. Autovacuum or
> another reason? If it's another reason ... what could it be?

I have no idea what's going on.  I have never seen this problem.  If you
see it happen again, please get a stacktrace from postmaster; and/or
strace it.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Autovacuum seems to block database: WARNING worker took too long to start

От
Pablo Delgado Díaz-Pache
Дата:
Well, we had the problem again. This is what we did

1) A "ps" to check what processes were running. Many "postmaster" processes in defunct state. A few postgres connections still working. A few "psql" (by shell script) queries hanged (they are scripts to monitor state of database. We use big-sister to do so).

The normal state of the server is 1 postmaster pid and many "postgres" connections in "idle" state.

This the result when we were having the problem. I've excluded irrelevant processes.

postgres   892  3889  0 09:07 ?        00:01:05 postgres: postgres international 10.19.0.51(49264) idle
postgres   934  3889  0 12:00 ?        00:00:04 [postmaster] <defunct>
postgres   935  3889  0 12:00 ?        00:00:04 [postmaster] <defunct>
postgres   936  3889  0 12:00 ?        00:00:00 [postmaster] <defunct>
postgres   938  3889  0 12:00 ?        00:00:01 [postmaster] <defunct>
postgres  1244  3889  0 12:01 ?        00:00:01 [postmaster] <defunct>
postgres  1453  3889  0 12:02 ?        00:00:02 [postmaster] <defunct>
postgres  1563  3889  0 12:03 ?        00:00:10 [postmaster] <defunct>
postgres  1862  3889  0 12:04 ?        00:00:04 [postmaster] <defunct>
postgres  2275  3889  0 12:06 ?        00:00:02 [postmaster] <defunct>
root      2287    49  0 Oct29 ?        00:00:00 [kondemand/0]
root      2288    49  0 Oct29 ?        00:00:00 [kondemand/1]
postgres  2485  3889  0 12:07 ?        00:00:05 postgres: postgres international 10.19.0.51(59066) idle
postgres  2490  3889  1 12:07 ?        00:00:26 postgres: postgres international 10.19.0.51(59070) idle
postgres  2568  3889  0 12:07 ?        00:00:00 postgres: postgres international 10.19.0.50(1249) idle
postgres  3544  3889  0 12:12 ?        00:00:07 [postmaster] <defunct>
postgres  3875  3889  0 12:14 ?        00:00:03 [postmaster] <defunct>
postgres  3889     1  0 Nov16 ?        00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres  3896  3889  0 Nov16 ?        00:00:00 postgres: logger process
postgres  3913  3889  0 Nov16 ?        00:00:05 postgres: writer process
postgres  3914  3889  0 Nov16 ?        00:00:00 postgres: wal writer process
postgres  3915  3889  0 Nov16 ?        00:00:01 postgres: autovacuum launcher process
postgres  3916  3889  0 Nov16 ?        00:01:42 postgres: stats collector process
postgres  4169  3889  0 12:15 ?        00:00:02 [postmaster] <defunct>
postgres  4170  3889  0 12:15 ?        00:00:04 [postmaster] <defunct>
postgres  4171  3889  0 12:15 ?        00:00:07 [postmaster] <defunct>
postgres  4281  3889  0 12:15 ?        00:00:04 [postmaster] <defunct>
postgres  4672  3889  0 12:17 ?        00:00:01 [postmaster] <defunct>
postgres  5038  3889  0 12:19 ?        00:00:01 [postmaster] <defunct>
postgres  5203  3889  0 12:20 ?        00:00:00 [postmaster] <defunct>
postgres  5484  3889  0 12:21 ?        00:00:01 [postmaster] <defunct>
postgres  5694  3889  0 12:22 ?        00:00:01 postgres: postgres international 10.19.0.49(29113) idle
postgres  5769  3889  0 12:23 ?        00:00:04 [postmaster] <defunct>
postgres  5832  3889  0 12:23 ?        00:00:03 postgres: postgres international 10.19.0.49(29129) idle
postgres  5835  3889  0 12:23 ?        00:00:02 [postmaster] <defunct>
postgres  5979  3889  0 12:24 ?        00:00:01 [postmaster] <defunct>
postgres  6243  3889  0 12:25 ?        00:00:00 [postmaster] <defunct>
postgres  6424  3889  0 12:26 ?        00:00:01 postgres: postgres international 10.19.0.49(45160) idle
postgres  6425  3889  0 12:26 ?        00:00:01 [postmaster] <defunct>
postgres  6429  3889  0 12:26 ?        00:00:01 [postmaster] <defunct>
postgres  6599  3889  0 12:27 ?        00:00:02 [postmaster] <defunct>
postgres  6635  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6653  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6654  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6655  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6663  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6664  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6665  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6666  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6667  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6668  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6669  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6670  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6671  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6672  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6673  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6773  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6774  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6778  3889  0 12:28 ?        00:00:01 [postmaster] <defunct>
postgres  6779  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6785  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6787  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6788  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6796  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6797  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6837  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6838  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6839  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6840  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6841  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6842  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6843  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6844  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6852  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6859  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6860  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6862  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6872  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6873  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6874  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6910  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6970  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6971  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6972  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6973  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6978  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6990  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6991  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6992  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6993  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7000  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7088  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7089  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7090  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7091  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7094  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7095  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7096  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7122  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7126  3889  0 12:29 ?        00:00:03 postgres: postgres mrs 10.19.0.51(7717) idle
postgres  7127  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
root      7152  7150  0 12:30 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7157  7155  0 12:30 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7168  7166  0 12:31 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7174  7172  0 12:31 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7188  7186  0 12:32 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7196  7194  0 12:32 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7258  7256  0 12:34 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7263  7261  0 12:34 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7308  7306  0 12:35 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7313  7311  0 12:35 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7325  7323  0 12:36 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7330  7328  0 12:36 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7371  7369  0 12:37 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
postgres 22959  3889  0 11:09 ?        00:00:01 postgres: postgres spain 10.19.0.45(57042) idle
postgres 22961  3889  0 11:09 ?        00:00:22 postgres: postgres mirai 10.19.0.45(57043) idle
postgres 23225  3889  0 11:10 ?        00:00:00 postgres: postgres xmas 10.19.0.45(57061) idle
postgres 25844  3889  0 11:23 ?        00:00:01 postgres: postgres spain 10.19.0.51(11291) idle
postgres 31942  3889  0 11:53 ?        00:00:00 [postmaster] <defunct>
postgres 31973  3889  0 11:53 ?        00:00:16 [postmaster] <defunct>

2) We did a strace to the postmaster pid. However we had 2 postmasters not dead

# ps -fea |grep -i postmaster
postgres  3889     1  0 Nov16 ?        00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres  7601  3889  0 12:37 ?        00:00:00 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data

As soon as we did a "strace" to the 3889 pid everything started to work again. Not sure it was a coincidence but that was the way it was.

# strace -p 3889
Process 3889 attached - interrupt to quit
select(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7766
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7766
wait4(-1, 0x7fffc37259a4, WNOHANG, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7767
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---

I also straced the other postmaster pid

# strace -p 7601
Process 7601 attached - interrupt to quit
recvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL, NULL) = 181
brk(0x14ac4000)                         = 0x14ac4000
lseek(3, 0, SEEK_END)                   = 16130048
lseek(5, 0, SEEK_END)                   = 4661248
lseek(3, 0, SEEK_END)                   = 16130048
brk(0x14ae9000)                         = 0x14ae9000
brk(0x14adb000)                         = 0x14adb000
brk(0x14ad7000)                         = 0x14ad7000
sendto(7, "\2\0\0\0\230\1\0\0H\203\314\0\4\0\0\0\5\0\0\0\0\0\0\0X\203\314\0\0\0\0\0"..., 408, 0, NULL, 0) = 408
sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 74, 0, NULL, 0) = 74
recvfrom(8, "P\0\0\0\213\0select id_key from transla"..., 8192, 0, NULL, NULL) = 175
brk(0x14ac4000)                         = 0x14ac4000
lseek(3, 0, SEEK_END)                   = 16130048
lseek(5, 0, SEEK_END)                   = 4661248
lseek(3, 0, SEEK_END)                   = 16130048
brk(0x14ae9000)                         = 0x14ae9000
brk(0x14adb000)                         = 0x14adb000
brk(0x14ad7000)                         = 0x14ad7000
sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 73, 0, NULL, 0) = 73
recvfrom(8,  <unfinished ...>
Process 7601 detached

3) Checked the logs. We could see the same error lines we had before, but they dissapeared after just a few minutes.

2010-11-18 12:30:02.830 CET|1|mirai|10.19.0.43(4452)|3875|idle|4ce50a7c.f23|2010-11-18 12:14:04 CET|67/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-18 12:30:02.841 CET|1|mirai|10.19.0.43(13943)|2275|idle|4ce508b4.8e3|2010-11-18 12:06:28 CET|75/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-18 12:30:47.628 CET|2|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:31:48.056 CET|3|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:32:48.184 CET|4|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:33:48.614 CET|5|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:34:48.747 CET|6|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:35:49.277 CET|7|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:36:49.706 CET|8|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:37:30.462 CET|1|spain|10.19.0.48(11276)|7414|idle|4ce50ffa.1cf6|2010-11-18 12:37:30 CET|2/0|0|spainadmin| LOG:  unexpected EOF on client connection

Any Ideas?

Thanks for your help and feedback!

Pablo

2010/11/16 Alvaro Herrera <alvherre@commandprompt.com>
Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09 -0300 2010:
> Not sure what you mean.
>
> Once we start getting that error in the postgres log, there is no autovacuum
> entry. Only that error message once every minute (and as i wrote in my last
> email, the same message all the time).

Yeah.  This means that the autovacuum launcher daemon is still running
and trying to schedule new jobs, but it's not able to for some reason.
When it tries to start a new one, it sees the old one still waiting to
start running; and then it logs this message.

To start a new job, what it does is send a signal to postmaster and wait
for it to create a child process.  This is the part that's not working.

> The question is what is causing postgres to stop working. Autovacuum or
> another reason? If it's another reason ... what could it be?

I have no idea what's going on.  I have never seen this problem.  If you
see it happen again, please get a stacktrace from postmaster; and/or
strace it.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support