Re: Autovacuum seems to block database: WARNING worker took too long to start

Поиск
Список
Период
Сортировка
От Pablo Delgado Díaz-Pache
Тема Re: Autovacuum seems to block database: WARNING worker took too long to start
Дата
Msg-id AANLkTinGUsJ31au72wZEbQeWkXbcCGd6MsugMnPQ0hX7@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Autovacuum seems to block database: WARNING worker took too long to start  (Alvaro Herrera <alvherre@commandprompt.com>)
Список pgsql-admin
Well, we had the problem again. This is what we did

1) A "ps" to check what processes were running. Many "postmaster" processes in defunct state. A few postgres connections still working. A few "psql" (by shell script) queries hanged (they are scripts to monitor state of database. We use big-sister to do so).

The normal state of the server is 1 postmaster pid and many "postgres" connections in "idle" state.

This the result when we were having the problem. I've excluded irrelevant processes.

postgres   892  3889  0 09:07 ?        00:01:05 postgres: postgres international 10.19.0.51(49264) idle
postgres   934  3889  0 12:00 ?        00:00:04 [postmaster] <defunct>
postgres   935  3889  0 12:00 ?        00:00:04 [postmaster] <defunct>
postgres   936  3889  0 12:00 ?        00:00:00 [postmaster] <defunct>
postgres   938  3889  0 12:00 ?        00:00:01 [postmaster] <defunct>
postgres  1244  3889  0 12:01 ?        00:00:01 [postmaster] <defunct>
postgres  1453  3889  0 12:02 ?        00:00:02 [postmaster] <defunct>
postgres  1563  3889  0 12:03 ?        00:00:10 [postmaster] <defunct>
postgres  1862  3889  0 12:04 ?        00:00:04 [postmaster] <defunct>
postgres  2275  3889  0 12:06 ?        00:00:02 [postmaster] <defunct>
root      2287    49  0 Oct29 ?        00:00:00 [kondemand/0]
root      2288    49  0 Oct29 ?        00:00:00 [kondemand/1]
postgres  2485  3889  0 12:07 ?        00:00:05 postgres: postgres international 10.19.0.51(59066) idle
postgres  2490  3889  1 12:07 ?        00:00:26 postgres: postgres international 10.19.0.51(59070) idle
postgres  2568  3889  0 12:07 ?        00:00:00 postgres: postgres international 10.19.0.50(1249) idle
postgres  3544  3889  0 12:12 ?        00:00:07 [postmaster] <defunct>
postgres  3875  3889  0 12:14 ?        00:00:03 [postmaster] <defunct>
postgres  3889     1  0 Nov16 ?        00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres  3896  3889  0 Nov16 ?        00:00:00 postgres: logger process
postgres  3913  3889  0 Nov16 ?        00:00:05 postgres: writer process
postgres  3914  3889  0 Nov16 ?        00:00:00 postgres: wal writer process
postgres  3915  3889  0 Nov16 ?        00:00:01 postgres: autovacuum launcher process
postgres  3916  3889  0 Nov16 ?        00:01:42 postgres: stats collector process
postgres  4169  3889  0 12:15 ?        00:00:02 [postmaster] <defunct>
postgres  4170  3889  0 12:15 ?        00:00:04 [postmaster] <defunct>
postgres  4171  3889  0 12:15 ?        00:00:07 [postmaster] <defunct>
postgres  4281  3889  0 12:15 ?        00:00:04 [postmaster] <defunct>
postgres  4672  3889  0 12:17 ?        00:00:01 [postmaster] <defunct>
postgres  5038  3889  0 12:19 ?        00:00:01 [postmaster] <defunct>
postgres  5203  3889  0 12:20 ?        00:00:00 [postmaster] <defunct>
postgres  5484  3889  0 12:21 ?        00:00:01 [postmaster] <defunct>
postgres  5694  3889  0 12:22 ?        00:00:01 postgres: postgres international 10.19.0.49(29113) idle
postgres  5769  3889  0 12:23 ?        00:00:04 [postmaster] <defunct>
postgres  5832  3889  0 12:23 ?        00:00:03 postgres: postgres international 10.19.0.49(29129) idle
postgres  5835  3889  0 12:23 ?        00:00:02 [postmaster] <defunct>
postgres  5979  3889  0 12:24 ?        00:00:01 [postmaster] <defunct>
postgres  6243  3889  0 12:25 ?        00:00:00 [postmaster] <defunct>
postgres  6424  3889  0 12:26 ?        00:00:01 postgres: postgres international 10.19.0.49(45160) idle
postgres  6425  3889  0 12:26 ?        00:00:01 [postmaster] <defunct>
postgres  6429  3889  0 12:26 ?        00:00:01 [postmaster] <defunct>
postgres  6599  3889  0 12:27 ?        00:00:02 [postmaster] <defunct>
postgres  6635  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6653  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6654  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6655  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6663  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6664  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6665  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6666  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6667  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6668  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6669  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6670  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6671  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6672  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6673  3889  0 12:27 ?        00:00:00 [postmaster] <defunct>
postgres  6773  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6774  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6778  3889  0 12:28 ?        00:00:01 [postmaster] <defunct>
postgres  6779  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6785  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6787  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6788  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6796  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6797  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6837  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6838  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6839  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6840  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6841  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6842  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6843  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6844  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6852  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6859  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6860  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6862  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6872  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6873  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6874  3889  0 12:28 ?        00:00:00 [postmaster] <defunct>
postgres  6910  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6970  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6971  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6972  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6973  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6978  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6990  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6991  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6992  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  6993  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7000  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7088  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7089  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7090  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7091  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7094  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7095  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7096  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7122  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
postgres  7126  3889  0 12:29 ?        00:00:03 postgres: postgres mrs 10.19.0.51(7717) idle
postgres  7127  3889  0 12:29 ?        00:00:00 [postmaster] <defunct>
root      7152  7150  0 12:30 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7157  7155  0 12:30 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7168  7166  0 12:31 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7174  7172  0 12:31 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7188  7186  0 12:32 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7196  7194  0 12:32 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7258  7256  0 12:34 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7263  7261  0 12:34 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7308  7306  0 12:35 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7313  7311  0 12:35 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7325  7323  0 12:36 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7330  7328  0 12:36 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root      7371  7369  0 12:37 ?        00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
postgres 22959  3889  0 11:09 ?        00:00:01 postgres: postgres spain 10.19.0.45(57042) idle
postgres 22961  3889  0 11:09 ?        00:00:22 postgres: postgres mirai 10.19.0.45(57043) idle
postgres 23225  3889  0 11:10 ?        00:00:00 postgres: postgres xmas 10.19.0.45(57061) idle
postgres 25844  3889  0 11:23 ?        00:00:01 postgres: postgres spain 10.19.0.51(11291) idle
postgres 31942  3889  0 11:53 ?        00:00:00 [postmaster] <defunct>
postgres 31973  3889  0 11:53 ?        00:00:16 [postmaster] <defunct>

2) We did a strace to the postmaster pid. However we had 2 postmasters not dead

# ps -fea |grep -i postmaster
postgres  3889     1  0 Nov16 ?        00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres  7601  3889  0 12:37 ?        00:00:00 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data

As soon as we did a "strace" to the 3889 pid everything started to work again. Not sure it was a coincidence but that was the way it was.

# strace -p 3889
Process 3889 attached - interrupt to quit
select(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7766
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7766
wait4(-1, 0x7fffc37259a4, WNOHANG, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7767
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---

I also straced the other postmaster pid

# strace -p 7601
Process 7601 attached - interrupt to quit
recvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL, NULL) = 181
brk(0x14ac4000)                         = 0x14ac4000
lseek(3, 0, SEEK_END)                   = 16130048
lseek(5, 0, SEEK_END)                   = 4661248
lseek(3, 0, SEEK_END)                   = 16130048
brk(0x14ae9000)                         = 0x14ae9000
brk(0x14adb000)                         = 0x14adb000
brk(0x14ad7000)                         = 0x14ad7000
sendto(7, "\2\0\0\0\230\1\0\0H\203\314\0\4\0\0\0\5\0\0\0\0\0\0\0X\203\314\0\0\0\0\0"..., 408, 0, NULL, 0) = 408
sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 74, 0, NULL, 0) = 74
recvfrom(8, "P\0\0\0\213\0select id_key from transla"..., 8192, 0, NULL, NULL) = 175
brk(0x14ac4000)                         = 0x14ac4000
lseek(3, 0, SEEK_END)                   = 16130048
lseek(5, 0, SEEK_END)                   = 4661248
lseek(3, 0, SEEK_END)                   = 16130048
brk(0x14ae9000)                         = 0x14ae9000
brk(0x14adb000)                         = 0x14adb000
brk(0x14ad7000)                         = 0x14ad7000
sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 73, 0, NULL, 0) = 73
recvfrom(8,  <unfinished ...>
Process 7601 detached

3) Checked the logs. We could see the same error lines we had before, but they dissapeared after just a few minutes.

2010-11-18 12:30:02.830 CET|1|mirai|10.19.0.43(4452)|3875|idle|4ce50a7c.f23|2010-11-18 12:14:04 CET|67/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-18 12:30:02.841 CET|1|mirai|10.19.0.43(13943)|2275|idle|4ce508b4.8e3|2010-11-18 12:06:28 CET|75/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-18 12:30:47.628 CET|2|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:31:48.056 CET|3|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:32:48.184 CET|4|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:33:48.614 CET|5|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:34:48.747 CET|6|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:35:49.277 CET|7|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:36:49.706 CET|8|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-18 12:37:30.462 CET|1|spain|10.19.0.48(11276)|7414|idle|4ce50ffa.1cf6|2010-11-18 12:37:30 CET|2/0|0|spainadmin| LOG:  unexpected EOF on client connection

Any Ideas?

Thanks for your help and feedback!

Pablo

2010/11/16 Alvaro Herrera <alvherre@commandprompt.com>
Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09 -0300 2010:
> Not sure what you mean.
>
> Once we start getting that error in the postgres log, there is no autovacuum
> entry. Only that error message once every minute (and as i wrote in my last
> email, the same message all the time).

Yeah.  This means that the autovacuum launcher daemon is still running
and trying to schedule new jobs, but it's not able to for some reason.
When it tries to start a new one, it sees the old one still waiting to
start running; and then it logs this message.

To start a new job, what it does is send a signal to postmaster and wait
for it to create a child process.  This is the part that's not working.

> The question is what is causing postgres to stop working. Autovacuum or
> another reason? If it's another reason ... what could it be?

I have no idea what's going on.  I have never seen this problem.  If you
see it happen again, please get a stacktrace from postmaster; and/or
strace it.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

В списке pgsql-admin по дате отправления:

Предыдущее
От: Pablo Delgado Díaz-Pache
Дата:
Сообщение: Re: Autovacuum seems to block database: WARNING worker took too long to start
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: Unable to connect. Log : PDTWARNING: worker took too long to start; canceled