Обсуждение: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November
Hi,
I've been checking progress on barnacle, one of the animals running with
CLOBBER_CACHE_RECURSIVELY. It's running for ~170 days (250k minutes).
This time I've however checked the log, and what caught my eye is that
the last log message is from November. There were regular messages until
then (a few messages per day), but since Nov 19 there's nothing.
The last few messages are these:
2014-11-28 22:19:24 CET 7953 [540097d4.1f11:532] LOG: statement: CREATE
FUNCTION city_insert() RETURNS trigger LANGUAGE plpgsql AS $$
2014-11-28 22:19:27 CET 7953 [540097d4.1f11:533] LOG: statement: CREATE
TRIGGER city_insert_trig INSTEAD OF INSERT ON city_view
2014-11-28 22:25:13 CET 7953 [540097d4.1f11:534] LOG: statement: CREATE
FUNCTION city_delete() RETURNS trigger LANGUAGE plpgsql AS $$
2014-11-28 22:25:15 CET 7953 [540097d4.1f11:535] LOG: statement: CREATE
TRIGGER city_delete_trig INSTEAD OF DELETE ON city_view
2014-11-29 03:10:01 CET 7953 [540097d4.1f11:536] LOG: statement: CREATE
FUNCTION city_update() RETURNS trigger LANGUAGE plpgsql AS $$
2014-11-29 03:10:03 CET 7953 [540097d4.1f11:537] LOG: statement: CREATE
TRIGGER city_update_trig INSTEAD OF UPDATE ON city_view
2014-11-29 07:44:50 CET 7953 [540097d4.1f11:538] LOG: statement: INSERT
INTO city_view(city_name) VALUES('Tokyo') RETURNING *;
which matches commands halfway-through 'triggers' tests.
There are currently two queries running - one from 'triggers', one from
'updatable_views':
-[ RECORD 1 ]----+----------------------------------------------
datid | 16384
datname | regression
pid | 7953
usesysid | 10
usename | pgbuild
application_name | pg_regress
client_addr |
client_hostname |
client_port | -1
backend_start | 2014-08-29 17:10:12.243979+02
xact_start | 2014-11-29 07:44:50.452678+01
query_start | 2014-11-29 07:44:50.452678+01
state_change | 2014-11-29 07:44:50.45268+01
waiting | f
state | active
backend_xid |
backend_xmin | 3989
query | INSERT INTO city_view(city_name) VALUES('Tokyo')
RETURNING *;
-[ RECORD 2 ]----+----------------------------------------------
datid | 16384
datname | regression
pid | 7956
usesysid | 10
usename | pgbuild
application_name | pg_regress
client_addr |
client_hostname |
client_port | -1
backend_start | 2014-08-29 17:10:12.272223+02
xact_start | 2014-10-06 15:35:25.822488+02
query_start | 2014-10-06 15:35:25.822488+02
state_change | 2014-10-06 15:35:25.822491+02
waiting | f
state | active
backend_xid |
backend_xmin | 3862
query | UPDATE rw_view2 SET b='Row three' WHERE a=3 RETURNING *;
Top currently looks like this:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7956 pgbuild 20 0 242m 13m 10m R 98.8 0.2 251152:27 postgres:
pgbuild regression [local] UPDATE
7953 pgbuild 20 0 1023m 785m 11m R 98.4 10.0 248806:18 postgres:
pgbuild regression [local] INSERT
Both backends started on August 29, but the 'updatable_views' query is
running since October 6. 5 months seems like a rather long runtime, even
with CLOBBER_CACHE_RECURSIVELY).
Not sure if it's relevant, but this is the machine with Intel C compiler
(2013).
Attached is a memory context info for the 7953 backend (with more than
1GB VIRT) - not sure if it's relevant, so just in case.
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On Sun, Mar 22, 2015 at 6:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > This time I've however checked the log, and what caught my eye is that > the last log message is from November. There were regular messages until > then (a few messages per day), but since Nov 19 there's nothing. I'd try strace on the backends, to see if they are issuing any system calls. I'd also try attaching gdb and pulling a back trace, then continue, then hit ^C, and pull another backtrace, repeat about 10 times, and see if you can figure out where it's stuck. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 23.3.2015 16:02, Robert Haas wrote:
> On Sun, Mar 22, 2015 at 6:26 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> This time I've however checked the log, and what caught my eye is that
>> the last log message is from November. There were regular messages until
>> then (a few messages per day), but since Nov 19 there's nothing.
>
> I'd try strace on the backends, to see if they are issuing any system calls.
Good idea. Both backends do a bunch of open/lseek/read/close - seems
like a pattern:
open("base/16384/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"...,
512) = 512
close(10) = 0
close(9) = 0
open("base/16384/28061", O_RDWR) = 9
lseek(9, 0, SEEK_END) = 155648
lseek(9, 0, SEEK_END) = 155648
open("global/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\33\0\0\0\356\4\0\0\243m\0\0\224\v\0\0T2\0\0\275\4\0\0Z2\0\0"..., 512)
= 512
close(10) = 0
open("base/16384/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"...,
512) = 512
close(10) = 0
close(9) = 0
open("base/16384/28061", O_RDWR) = 9
lseek(9, 0, SEEK_END) = 155648
lseek(9, 0, SEEK_END) = 155648
open("base/16384/28056", O_RDWR) = 10
lseek(10, 0, SEEK_END) = 8192
lseek(10, 0, SEEK_END) = 8192
lseek(10, 0, SEEK_END) = 8192
It seems to vary a bit over time (or maybe the pattern is longer). Seems
like a consequence of the recursive cache resets, I guess.
Attached a few seconds worth of strace from each backend.
> I'd also try attaching gdb and pulling a back trace, then continue,
> then hit ^C, and pull another backtrace, repeat about 10 times, and
> see if you can figure out where it's stuck.
Hmmm, this seems a bit more interesting - see the backtraces attached.
The first thing that caught my eye is that the back traces are quite
long, and the top part is a sequence of alternating heap_open() and
AcceptInvalidationMessages() messages. Moreover, it seems to be about
the same relation - judging by the relid:
==================================================================
#7 AcceptInvalidationMessages () at inval.c:691
#8 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
#10 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
#16 AcceptInvalidationMessages () at inval.c:691
#17 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#26 AcceptInvalidationMessages () at inval.c:691
#27 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#36 AcceptInvalidationMessages () at inval.c:691
#37 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#47 AcceptInvalidationMessages () at inval.c:691
#48 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#50 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#56 AcceptInvalidationMessages () at inval.c:691
#57 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#59 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#65 AcceptInvalidationMessages () at inval.c:691
#66 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#75 AcceptInvalidationMessages () at inval.c:691
#76 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#78 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#84 AcceptInvalidationMessages () at inval.c:691
#85 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
==================================================================
The other thing is that there's a fair amount of pointers resolved like
this:
#157 0x00000000007d00ed in PostgresMain (argc=31196760, argv=0x1,
dbname=0x1c86a40 "regression", username=0x3 <Address 0x3 out of bounds>)
at postgres.c:4010
while in the next backtrace from the same backend it looks like this:
#117 0x00000000007d00ed in PostgresMain (argc=12, argv=0xf32ae0,
dbname=0x1c86a40 "regression", username=0xffffffffffffffff <Address
0xffffffffffffffff out of bounds>) at postgres.c:4010
That seems a bit strange - I would not expect this to change once the
backend is started, but maybe I'm missing something.
And finally there's a fair amount of pointers resolved as 0x7f71e74fd770
which seems odd too.
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services