Обсуждение: rebellious pg stats collector (reopened case)

Поиск
Список
Период
Сортировка

rebellious pg stats collector (reopened case)

От
Laszlo Nagy
Дата:
PostgreSQL 8.3.5, the system is now stable (uptime > 10 days).
PostgreSQL stats collector uses 100% CPU forever:

On Thursday:

last pid: 29509;  load averages:  2.36,  2.01,
2.03
up 5+17:28:56  04:02:53
196 processes: 3 running, 184 sleeping, 9 zombie
CPU states:  5.3% user,  0.0% nice, 15.3% system,  0.0% interrupt, 79.4%
idle
Mem: 1009M Active, 5995M Inact, 528M Wired, 354M Cache, 214M Buf, 12M Free
Swap: 16G Total, 500K Used, 16G Free

 PID USERNAME       THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU
COMMAND
26709 pgsql            1 106    0 22400K  6832K CPU6   6 973:52 99.02%
postgres

Today:

last pid: 71326;  load averages:  2.69,  3.15,
2.92
up 10+20:13:44  06:47:41
176 processes: 3 running, 166 sleeping, 7 zombie
CPU states:     % user,     % nice,     % system,     % interrupt,     %
idle
Mem: 928M Active, 5868M Inact, 557M Wired, 380M Cache, 214M Buf, 172M Free
Swap: 16G Total, 620K Used, 16G Free

  PID USERNAME       THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU
COMMAND
44689 pgsql            1 107    0 22400K  7060K CPU6   6 748:07 99.02%
postgres
64221 www              1  96    0   158M 27628K select 5   0:18  4.20% httpd
68567 www              1  20    0   151M 23092K lockf  0   0:03  2.39% httpd

...

shopzeus# uptime
 6:48AM  up 10 days, 20:14, 1 user, load averages: 2.21, 3.01, 2.87

More than 10 hours on a dual-quad core Xeon 5420??? We have two
databases, total database size is about 15GB.
(The stats collector also uses significant disk I/O.)

Thursday:

# date
Thu Dec 11 04:05:00 EST 2008
# ls -l ~pgsql/data/
# ls -l ~pgsql/data/global/pgstat.stat
-rw-------  1 pgsql  pgsql  231673 Dec 10 12:01
/usr/local/pgsql/data/global/pgstat.stat

Today:

#date
Tue Dec 16 06:48:27 EST 2008
# cd ~pgsql/data
# ls -l global/pgstat.stat
-rw-------  1 pgsql  pgsql  232358 Dec 15 18:45 global/pgstat.stat

Looks like the pgstat.stat was not updated since the pg stats collector
(re)started.

#uname -a
FreeBSD shopzeus.com 7.0-RELEASE-p5 FreeBSD 7.0-RELEASE-p5 #0: Mon Nov
17 21:37:25 EST 2008
root@shopzeus.chello.hu:/usr/obj/usr/src/sys/SHOPZEUS  amd64

After restarting the postmaster, the process disappeares for a while
(some hours, sometimes for one day), then it start updating the stat
file correctly.

Please advise.

Thanks,

  Laszlo


Re: rebellious pg stats collector (reopened case)

От
Alvaro Herrera
Дата:
Laszlo Nagy wrote:
> PostgreSQL 8.3.5, the system is now stable (uptime > 10 days).
> PostgreSQL stats collector uses 100% CPU forever:

Could you grab a few stack traces from it and post them?  Also possibly
useful, leave strace running on the pgstat process for a while and post
the output somewhere.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: rebellious pg stats collector (reopened case)

От
Laszlo Nagy
Дата:
Alvaro Herrera wrote:
> Laszlo Nagy wrote:
>
>> PostgreSQL 8.3.5, the system is now stable (uptime > 10 days).
>> PostgreSQL stats collector uses 100% CPU forever:
>>
>
> Could you grab a few stack traces from it and post them?  Also possibly
> useful, leave strace running on the pgstat process for a while and post
> the output somewhere.
>
Very interesting results. Before I used 'truss' on the process, it was:

78816 pgsql            1 106    0 22400K  7100K CPU6   6  24.2H 99.02%
postgres

After I started "truss -p 78815" the first message I got was:

SIGNAL 17 (SIGSTOP)

and the it waited for some seconds. Then my screen was filled with
messages like this:

poll({7/POLLIN|POLLERR},1,2000)             = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-X\^C\0\0\M^T\f\r\0\n"...,1000,0x0,NULL,0x0) =
984 (0x3d8)
poll({7/POLLIN|POLLERR},1,2000)             = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-X\^C\0\0\M^T\f\r\0\n"...,1000,0x0,NULL,0x0) =
984 (0x3d8)
poll({7/POLLIN|POLLERR},1,2000)             = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M^X\^A\0\0\M^T\f\r\0"...,1000,0x0,NULL,0x0) = 408
(0x198)
poll({7/POLLIN|POLLERR},1,2000)             = 1 (0x1)
recvfrom(7,"\b\0\0\0@\0\0\0\0\0\0\0\0\0\0\0"...,1000,0x0,NULL,0x0) = 64
(0x40)
poll({7/POLLIN|POLLERR},1,2000)             = 1 (0x1)
recvfrom(7,"\b\0\0\0@\0\0\0\0\0\0\0\0\0\0\0"...,1000,0x0,NULL,0x0) = 64
(0x40)
poll({7/POLLIN|POLLERR},1,2000)             = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-X\0\0\0\M^O\v\b\0\^B"...,1000,0x0,NULL,0x0) =
216 (0xd8)
poll({7/POLLIN|POLLERR},1,2000)             = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-x\^A\0\0\0\0\0\0\^E"...,1000,0x0,NULL,0x0) = 504
(0x1f8)
poll({7/POLLIN|POLLERR},1,2000)             = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-X\0\0\0\M^T\f\r\0\^B"...,1000,0x0,NULL,0x0) =
216 (0xd8)
poll({7/POLLIN|POLLERR},1,2000)             = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-x\^A\0\0\0\0\0\0\^E"...,1000,0x0,NULL,0x0) = 504
(0x1f8)
poll({7/POLLIN|POLLERR},1,2000)             = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-X\0\0\0\M^T\f\r\0\^B"...,1000,0x0,NULL,0x0) =
216 (0xd8)
poll({7/POLLIN|POLLERR},1,2000)             = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-x\^A\0\0\0\0\0\0\^E"...,1000,0x0,NULL,0x0) = 504
(0x1f8)
poll({7/POLLIN|POLLERR},1,2000)             = 1 (0x1)
recvfrom(7,"\b\0\0\0@\0\0\0\0\0\0\0\0\0\0\0"...,1000,0x0,NULL,0x0) = 64
(0x40)
poll({7/POLLIN|POLLERR},1,2000)             = 1 (0x1)

I terminated the debugger with Ctrl+C. Then I started it again:


#truss -p 78815
SIGNAL 17 (SIGSTOP)
poll({7/POLLIN|POLLERR},1,2000)                  = 0 (0x0)
getppid(0x1,0x1,0x7d0,0x8014606cc,0xffffffff80a579c0,0x7fffffffceb8) =
95860 (0x17674)
open("global/pgstat.tmp",O_WRONLY|O_CREAT|O_TRUNC,0666) = 4 (0x4)
fstat(4,{mode=-rw------- ,inode=73223196,size=0,blksize=4096}) = 0 (0x0)
write(4,"\M^W\M-<\M-%\^A\^F\n\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\M-Z\^C\0\0\0\0\0\0\M-d\M^E\M-0u"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0+\0\0\0\0\0\0\0\^N\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\^C\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0T\^T\^N\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\^R\M-(\M^[\v\M-^\M-0A\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\^N\0\0\0\0\0\0\0\\n\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0T\M-A\n\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0~\M-d\M-fp\n\M-^\M-0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\^E\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"$\0\0\0\0\0\0+\^V\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0Z\M-F%"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0|\^B\0\0\0\0\0\0<\M-I\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0T\M-f\t\^V\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\M-H'\M-4\M-N\n\M-^\M-0A"...,4096) = 4096 (0x1000)
write(4,"\0\0\^N\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\b\0\0\0\0\0L\^B\0\0\0\0\0\0\M^^"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0dD\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\09\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0>\M-h\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0T%;\r\0\0\0\0\0 \0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\M-I\\j\n\M-^\M-0A\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0Y\240\^A\0\0\0\0\0\^C\M-3"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\^N\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\^A\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0P\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\^A\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\^P\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\M-P\^R\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,3119) = 3119 (0xc2f)
close(4)                     = 0 (0x0)
rename("global/pgstat.tmp","global/pgstat.stat") = 0 (0x0)
process exit, rval = 0

I did NOT send the SIGSTOP signal to the process. The pgstat collector
was running at 100%CPU for 24 hours, when I attached the debugger, and
attaching the debugger somehow caused the process to get a SIGSTOP, but
the process did not exit. When I started the debugger for the second
time, it got a SIGSTOP again, and this made the collector exiting the
endless loop (instead of stopping it).

What is the next step?

   Laszlo


Re: rebellious pg stats collector (reopened case)

От
Alvaro Herrera
Дата:
Laszlo Nagy wrote:
> Alvaro Herrera wrote:
>> Laszlo Nagy wrote:
>>
>>> PostgreSQL 8.3.5, the system is now stable (uptime > 10 days).
>>> PostgreSQL stats collector uses 100% CPU forever:
>>>
>>
>> Could you grab a few stack traces from it and post them?  Also possibly
>> useful, leave strace running on the pgstat process for a while and post
>> the output somewhere.
>>
> Very interesting results. Before I used 'truss' on the process, it was:
>
> 78816 pgsql            1 106    0 22400K  7100K CPU6   6  24.2H 99.02%
> postgres
>
> After I started "truss -p 78815" the first message I got was:

It was 78816 and you traced 78815?  Are you sure the process with 24h of
CPU was pgstat?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: rebellious pg stats collector (reopened case)

От
Laszlo Nagy
Дата:
> It was 78816 and you traced 78815?  Are you sure the process with 24h of
> CPU was pgstat?
>
I'm sorry that was a typo. Of course I traced the good process (proof is
that at the end it renamed a file to "global/pgstat.stat".

And yes, "top" showed 24H in the TIME column and 99% in the WCPU column.
Also this process used 50-90% of disk I/O continuously, slowing down the
computer.

I'm sure that it will happen again, then I'm going to post another trace
(and make sure that I trace the good process).

Best,

   Laszlo



Re: rebellious pg stats collector (reopened case)

От
Laszlo Nagy
Дата:
OK, here is the new test.

shopzeus# ps l -p 39766 -p 39767 -p 39769
  UID   PID  PPID CPU PRI NI   VSZ   RSS MWCHAN STAT  TT       TIME COMMAND
   70 39766 78806   0  96  0 451960 423896 select Ss    ??    8:41.85
postgres: writer process    (postgres)
   70 39767 78806   0  96  0 451960  7184 select Ss    ??    0:58.75
postgres: wal writer process    (postgres)
   70 39769 78806   0  96  0 22400  7008 select Ss    ??   14:42.39
postgres: stats collector process    (postgres)


top -otime 10


last pid:  7984;  load averages:  1.38,  1.51,
1.49
up 16+18:18:22  05:02:20
170 processes: 3 running, 161 sleeping, 1 stopped, 5 zombie
CPU states:     % user,     % nice,     % system,     % interrupt,     %
idle
Mem: 959M Active, 5947M Inact, 562M Wired, 275M Cache, 214M Buf, 164M Free
Swap: 16G Total, 1088K Used, 16G Free

  PID USERNAME       THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU
COMMAND
78806 pgsql            1  98    0   441M  6960K select 4  67:14  0.39%
postgres
39769 pgsql            1  96    0 22400K  7008K select 7  14:43  0.00%
postgres
39766 pgsql            1  96    0   441M   414M select 3   8:42  0.00%
postgres
10602 root             1  96    0   149M 18916K select 4   5:30  0.00% httpd
67535 root             1   4    0  3720K  1336K kqread 0   3:54  0.00%
dovecot
  795 root             1   4    0  8188K  1452K kqread 4   3:21  0.00%
master
95791 shopzeus         3 116   20 59892K 17440K select 3   2:22  0.00%
python
32397 postfix          1   4    0  8280K  1980K kqread 6   1:23  0.00% qmgr
67536 root             1   4    0  5996K  1448K kqread 4   1:07  0.00%
dovecot-auth
  525 root             1  96    0  4684K  1156K select 3   1:03  0.00%
syslogd

In other words, the stats collector is running since 14 hours.

shopzeus# date
Mon Dec 22 05:03:24 EST 2008
shopzeus# ls -l ~pgsql/data/global/pgstat.stat
-rw-------  1 pgsql  pgsql  232084 Dec 22 05:03
/usr/local/pgsql/data/global/pgstat.stat

Looks like it is updating the stats file continuously. Now I try to trace:


shopzeus# truss -p 39769
poll({7/POLLIN|POLLERR},1,2000)                  = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-X\^C\0\0\M^T\f\r\0\n"...,1000,0x0,NULL,0x0) =
984 (0x3d8)
poll({7/POLLIN|POLLERR},1,2000)                  = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-X\^C\0\0\M^T\f\r\0\n"...,1000,0x0,NULL,0x0) =
984 (0x3d8)
poll({7/POLLIN|POLLERR},1,2000)                  = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-X\^C\0\0\M^T\f\r\0\n"...,1000,0x0,NULL,0x0) =
984 (0x3d8)
poll({7/POLLIN|POLLERR},1,2000)                  = 1 (0x1)
recvfrom(7,"\^A\0\0\0X\^B\0\0\M^T\f\r\0\^F\0"...,1000,0x0,NULL,0x0) =
600 (0x258)
poll({7/POLLIN|POLLERR},1,2000)                  ERR#4 'Interrupted
system call'
SIGNAL 14 (SIGALRM)
sigreturn(0x7fffffffca70,0x0,0x7fffffffca70,0xffffffff809fb5c0,0x7fffffffffc0,0x7fffffffca60)
ERR#4 'Interrupted system call'
getppid(0x1,0x1,0x801901120,0x4,0x7fffffffd39f,0x6) = 11204 (0x2bc4)
open("global/pgstat.tmp",O_WRONLY|O_CREAT|O_TRUNC,0666) = 4 (0x4)
fstat(4,{mode=-rw------- ,inode=73223197,size=0,blksize=4096}) = 0 (0x0)
write(4,"\M^W\M-<\M-%\^A`\^B\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"l\0\0\0\0\0\0\0\M-T\M-s.\M-`\M-r"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\t\0\0\0\0\0\0\0\^C"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\^C\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0T\^T\^N\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\M-aKo\M-s\M-a\M-0A\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\^C\0\0\0\0\0\0\0\M-&"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0T\M-A\n\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\M-_o\M-T\M-u\M-p"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\^E\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\^A\0\0\0\0\0\0\^?\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0X\M-<"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\b\^D@f\0\0\0\0\0\0\0F \0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0T\M-2\f\b\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0;\0\0\0\0\0\0\0\^R\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\n\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0T_\n\0\0\0\0\0\0\M-*R\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"r\M^W\M-`\M-p\M-a\M-0A\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\M-a\M-p\M-a\M-0A\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0c\^Q\^A\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\^F\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0Z\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\^D\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,2708) = 2708 (0xa94)
close(4)                     = 0 (0x0)
rename("global/pgstat.tmp","global/pgstat.stat") = 0 (0x0)
process exit, rval = 0

And the stats collector exited! It cannot be by accident. Last time I
checked, it had the same problem: tried to poll something. It was in an
inifinte loop, polling something but always got POLERR. Then because I
started to trace it, it exited with rval=0 or rval=1.

I do not understand why it exited after I started to trace it? Is this
some kind of bug?

Thanks,

   Laszlo


Re: rebellious pg stats collector (reopened case)

От
Tom Lane
Дата:
Laszlo Nagy <gandalf@shopzeus.com> writes:
> And the stats collector exited! It cannot be by accident. Last time I
> checked, it had the same problem: tried to poll something. It was in an
> inifinte loop, polling something but always got POLERR. Then because I
> started to trace it, it exited with rval=0 or rval=1.

> I do not understand why it exited after I started to trace it? Is this
> some kind of bug?

AFAICS, the only ways to reach the exit(0) in PgstatCollectorMain are
(1) need_exit becomes set (and the only way for that to happen is
to receive a SIGQUIT signal); or (2) PostmasterIsAlive(true) fails,
which implies that getppid() is returning something different than it
used to.

I wonder whether your tracing tool is affecting the result of
getppid().  Most people would consider that a bug in the tracing tool.

            regards, tom lane

Re: rebellious pg stats collector (reopened case)

От
Alvaro Herrera
Дата:
Tom Lane wrote:

> I wonder whether your tracing tool is affecting the result of
> getppid().  Most people would consider that a bug in the tracing tool.

Probably having a close look at the PPID column in ps or top during the
truss run would prove this.

Still, the actual problem being debugged is something else.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: rebellious pg stats collector (reopened case)

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> I wonder whether your tracing tool is affecting the result of
>> getppid().  Most people would consider that a bug in the tracing tool.

> Probably having a close look at the PPID column in ps or top during the
> truss run would prove this.

It's at least conceivable that the result being returned inside the
process is different from what the rest of the world sees.  What'd prove
this one way or the other is to write a trivial test program along the
lines of

    while(1) {
        sleep(1);
        printf("ppid = %d\n", getppid());
    }

and see if its output changes when you start to trace it.

> Still, the actual problem being debugged is something else.

Agreed, but we need to understand what the tools being used to
investigate the problem are doing ...

            regards, tom lane

Re: rebellious pg stats collector (reopened case)

От
Laszlo Nagy
Дата:
> and see if its output changes when you start to trace it.
>
%cat test.c
#include <stdio.h>

int main() {
    while(1) {
        sleep(5);
        printf("ppid = %d\n", getppid());
    }
}

%gcc -o test test.c
%./test
ppid = 47653
ppid = 47653
ppid = 47653 # Started "truss -p 48864" here!
ppid = 49073
ppid = 49073
ppid = 49073

> Agreed, but we need to understand what the tools being used to
> investigate the problem are doing ...
>
Unfortunately, I'm not able to install strace:

# pwd
/usr/ports/devel/strace
# make
===>  strace-4.5.7 is only for i386, while you are running amd64.
*** Error code 1

Stop in /usr/ports/devel/strace.

I'll happily install any trace tool, but have no clue which one would help.




Re: rebellious pg stats collector (reopened case)

От
Laszlo Nagy
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>
>> Tom Lane wrote:
>>
>>> I wonder whether your tracing tool is affecting the result of
>>> getppid().  Most people would consider that a bug in the tracing tool.
>>>
>
>
I wrote to an official the FreeBSD list about this getppid() problem but
got no answer other than that "this behaviour is documented". :-(

The problem is still there:


  PID USERNAME       THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU
COMMAND
11205 pgsql            1 104    0 22400K  7112K CPU5   5 159.7H 99.02%
postgres


100% CPU since 159 hours! What can I do? Instead of tracing system
calls, is there a way to start the stats collector in debug mode? Or
maybe is it possible to change the source code, and disable the "is
postmaster alive" check for testing?

Thanks


Re: rebellious pg stats collector (reopened case)

От
Tom Lane
Дата:
Laszlo Nagy <gandalf@shopzeus.com> writes:
> maybe is it possible to change the source code, and disable the "is
> postmaster alive" check for testing?

Rather than disabling it, it'd probably be more convenient to make
any getppid value except 1 (the init process) be treated as "it's
alive".  Otherwise you'll have trouble with the collector not stopping
when you want it to.  Look into src/backend/storage/ipc/pmsignal.c:

-        return (getppid() == PostmasterPid);
+        return (getppid() != 1);

(Obviously this is a hack, but it should work on any Unixish system)

            regards, tom lane

Re: rebellious pg stats collector (reopened case)

От
Laszlo Nagy
Дата:
> alive".  Otherwise you'll have trouble with the collector not stopping
> when you want it to.  Look into src/backend/storage/ipc/pmsignal.c:
>
> -        return (getppid() == PostmasterPid);
> +        return (getppid() != 1);
>
> (Obviously this is a hack, but it should work on any Unixish system)
>
Thanks. I'm going to try this next weekend (when nobody works on the
database) and come back with the results.

Re: rebellious pg stats collector (reopened case)

От
Laszlo Nagy
Дата:
Laszlo Nagy írta:
>
>> alive".  Otherwise you'll have trouble with the collector not stopping
>> when you want it to.  Look into src/backend/storage/ipc/pmsignal.c:
>>
>> -        return (getppid() == PostmasterPid);
>> +        return (getppid() != 1);
>>
>> (Obviously this is a hack, but it should work on any Unixish system)
>>
Sorry for coming back so late. I changed this, and reinstalled
postgresql but it won't start. Some messages from the log:


Jan  8 00:40:29 shopzeus postgres[80522]: [3814-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:30 shopzeus postgres[80559]: [3849-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:30 shopzeus postgres[80632]: [3917-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:31 shopzeus postgres[80662]: [3946-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:31 shopzeus postgres[80700]: [3981-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:31 shopzeus postgres[80734]: [4014-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:31 shopzeus postgres[80738]: [4017-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:32 shopzeus postgres[80768]: [4046-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:32 shopzeus postgres[80802]: [4078-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:32 shopzeus postgres[80806]: [4079-1] WARNING:  terminating
connection because of crash of another server process
Jan  8 00:40:32 shopzeus postgres[80806]: [4079-2] DETAIL:  The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server
Jan  8 00:40:32 shopzeus postgres[80806]: [4079-3]  process exited
abnormally and possibly corrupted shared memory.
Jan  8 00:40:32 shopzeus postgres[80806]: [4079-4] HINT:  In a moment
you should be able to reconnect to the database and repeat your command.
Jan  8 00:40:32 shopzeus postgres[80835]: [4109-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:32 shopzeus postgres[80837]: [4110-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:33 shopzeus postgres[80870]: [4141-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:33 shopzeus postgres[80900]: [4170-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:34 shopzeus postgres[80938]: [4205-1] FATAL:  the database
system is in recovery mode
Jan  8 00:40:34 shopzeus postgres[80967]: [4233-1] FATAL:  the database
system is in recovery mode

(I had to reinstall the original version because it will be used today.)