Обсуждение: query slow; strace output worrisome
The SELECT show below has been running for 30+mins and the strace output is alarming: [root@dione ~]# strace -p 10083 Process 10083 attached - interrupt to quit creat(umovestr: Input/output error 0x2, 037777777777) = 1025220608 creat(umovestr: Input/output error 0x2, 037777777777) = 1025220608 creat(umovestr: Input/output error 0x2, 037777777777) = 1025220608 creat(NULL, 037777777777) = 216203264 restart_syscall(<... resuming interrupted call ...>) = 8192 creat(umovestr: Input/output error 0x2, 037777777777) = 1025220608 creat(umovestr: Input/output error 0x2, 037777777777) = 1025220608 creat(umovestr: Input/output error 0x2, 037777777777) = 1025220608 creat(umovestr: Input/output error 0x2, 037777777777) = 1025220608 however, I can find no indication of I/O errors in the postgres log. Anyone have any idea what's going on here? postgres 8.3.7 [root@dione ~]# uname -a Linux dione.ca.com 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux Thanks, Brian top: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10083 postgres 25 0 2288m 822m 817m R 100.0 1.7 28:08.79 postgres cemdb=# select procpid,query_start,current_query from pg_stat_activity; 10083 | 2010-04-05 17:18:34.989022-07 | select b.ts_id from ts_stats_transetgroup_usergroup_daily b , ts_stats_transet_user_interval c, ts_transetgroup_transets_map tm, ts_usergroup_users_map um where b.ts_transet_group_id = tm.ts_transet_group_id and tm.ts_transet_incarnation_id = c.ts_transet_incarnation_id and b.ts_usergroup_id = um.ts_usergroup_id and um.ts_user_incarnation_id = c.ts_user_incarnation_id and c.ts_interval_start_time >= '2010-03-29 21:00' and c.ts_interval_start_time < '2010-03-29 22:00' and b.ts_interval_start_time >= '2010-03-29 00:00' and b.ts_interval_start_time < '2010-03-30 00:00' cemdb=# explain select b.ts_id from ts_stats_transetgroup_usergroup_daily b , ts_stats_transet_user_interval c, ts_transetgroup_transets_map tm, ts_usergroup_users_map um where b.ts_transet_group_id = tm.ts_transet_group_id and tm.ts_transet_incarnation_id = c.ts_transet_incarnation_id and b.ts_usergroup_id = um.ts_usergroup_id and um.ts_user_incarnation_id = c.ts_user_incarnation_id and c.ts_interval_start_time >= '2010-03-29 21:00' and c.ts_interval_start_time < '2010-03-29 22:00' and b.ts_interval_start_time >= '2010-03-29 00:00' and b.ts_interval_start_time < '2010-03-30 00:00'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=1169.95..7046.23 rows=1 width=8) Hash Cond: ((b.ts_transet_group_id = tm.ts_transet_group_id) AND (c.ts_transet_incarnation_id = tm.ts_transet_incarnation_id)) -> Nested Loop (cost=1159.95..7036.15 rows=10 width=24) -> Nested Loop (cost=0.00..28.16 rows=6 width=24) -> Index Scan using ts_stats_transetgroup_usergroup_daily_starttimeindex on ts_stats_transetgroup_usergroup_daily b (cost=0.00..8.86 rows=1 width=24) Index Cond: ((ts_interval_start_time >= '2010-03-29 00:00:00-07'::timestamp with time zone) AND (ts_interval_start_time < '2010-03-30 00:00:00-07'::timestamp with time zone)) -> Index Scan using ts_usergroup_users_map_usergroupindex on ts_usergroup_users_map um (cost=0.00..19.23 rows=6 width=16) Index Cond: (um.ts_usergroup_id = b.ts_usergroup_id) -> Bitmap Heap Scan on ts_stats_transet_user_interval c (cost=1159.95..1167.97 rows=2 width=16) Recheck Cond: ((c.ts_user_incarnation_id = um.ts_user_incarnation_id) AND (c.ts_interval_start_time >= '2010-03-29 21:00:00-07'::timestamp with time zone) AND (c.ts_interval_start_time < '2010-03-29 22:00:00-07'::timestamp with time zone)) -> BitmapAnd (cost=1159.95..1159.95 rows=2 width=0) -> Bitmap Index Scan on ts_stats_transet_user_interval_userincarnationidindex (cost=0.00..14.40 rows=438 width=0) Index Cond: (c.ts_user_incarnation_id = um.ts_user_incarnation_id) -> Bitmap Index Scan on ts_stats_transet_user_interval_starttime (cost=0.00..1134.09 rows=44856 width=0) Index Cond: ((c.ts_interval_start_time >= '2010-03-29 21:00:00-07'::timestamp with time zone) AND (c.ts_interval_start_time < '2010-03-29 22:00:00-07'::timestamp with time zone)) -> Hash (cost=7.00..7.00 rows=200 width=16) -> Seq Scan on ts_transetgroup_transets_map tm (cost=0.00..7.00 rows=200 width=16) (17 rows)
On 6/04/2010 9:01 AM, Brian Cox wrote: > The SELECT show below has been running for 30+mins and the strace output > is alarming: > > [root@dione ~]# strace -p 10083 > Process 10083 attached - interrupt to quit > creat(umovestr: Input/output error > 0x2, 037777777777) = 1025220608 > creat(umovestr: Input/output error > 0x2, 037777777777) = 1025220608 > creat(umovestr: Input/output error > 0x2, 037777777777) = 1025220608 > creat(NULL, 037777777777) = 216203264 > restart_syscall(<... resuming interrupted call ...>) = 8192 > creat(umovestr: Input/output error > 0x2, 037777777777) = 1025220608 > creat(umovestr: Input/output error > 0x2, 037777777777) = 1025220608 > creat(umovestr: Input/output error > 0x2, 037777777777) = 1025220608 > creat(umovestr: Input/output error > 0x2, 037777777777) = 1025220608 > > however, I can find no indication of I/O errors in the postgres log. > Anyone have any idea what's going on here? Anything in `dmesg' (command) or /var/log/syslog ? -- Craig Ringer
On 04/05/2010 09:53 PM, Craig Ringer [craig@postnewspapers.com.au] wrote: > Anything in `dmesg' (command) or /var/log/syslog ? nothing out of the ordinary. Brian
Brian Cox wrote: > On 04/05/2010 09:53 PM, Craig Ringer [craig@postnewspapers.com.au] wrote: >> Anything in `dmesg' (command) or /var/log/syslog ? > nothing out of the ordinary. Brian I'm wondering if the issue is with strace rather than Pg. That is to say, that strace is trying to print: creat("/path/to/file", 0x2, 037777777777) = 1025220608 ... but to get "/path/to/file" it's calling umovestr() which is returning an I/O error, perhaps due to some kind of security framework like AppArmor or SELinux in place on your system. Yep, umovestr() is in util.c:857 in the strace 4.5.18 sources. It looks like it might read from the target's memory via /proc or using ptrace depending on build configuration. Either are obvious targets for security framework limitations. So what you're probably seeing is actually strace being prevented from getting some information out of the postgres backend's memory by system security policy. As for what Pg is doing: creat() returns -1 on error and a file descriptor otherwise, so the return value appears to indicate success. I'm kind of wondering what the Pg backend is actually _doing_ though - if it was using sort temp files you'd expect open()/write()/read()/close() not just creat() calls. -- Craig Ringer
On 04/06/2010 01:18 AM, Craig Ringer [craig@postnewspapers.com.au] wrote: > I'm wondering if the issue is with strace rather than Pg. That is to > say, that strace is trying to print: Thanks, Craig: I do think that this is a strace issue. > As for what Pg is doing: creat() returns -1 on error and a file > descriptor otherwise, so the return value appears to indicate success. > I'm kind of wondering what the Pg backend is actually _doing_ though - > if it was using sort temp files you'd expect > open()/write()/read()/close() not just creat() calls. My quesiton exactly: what is the backend doing calling creat() over and over again? Note that this query does complete -- in 30-40 mins. Brian
On 7/04/2010 12:24 AM, Brian Cox wrote: > On 04/06/2010 01:18 AM, Craig Ringer [craig@postnewspapers.com.au] wrote: >> I'm wondering if the issue is with strace rather than Pg. That is to >> say, that strace is trying to print: > Thanks, Craig: I do think that this is a strace issue. > >> As for what Pg is doing: creat() returns -1 on error and a file >> descriptor otherwise, so the return value appears to indicate success. >> I'm kind of wondering what the Pg backend is actually _doing_ though - >> if it was using sort temp files you'd expect >> open()/write()/read()/close() not just creat() calls. > My quesiton exactly: what is the backend doing calling creat() over and > over again? Note that this query does complete -- in 30-40 mins. I'd assume it was tempfile creation, but for the fact that there's nothing but creat() calls. However, we can't trust strace. There may be more going on that is being hidden from strace via limitations on the ptrace syscall imposed by SELinux / AppArmor / whatever. I suggest turning on the logging options in Pg that report use of tempfiles and disk-spilled sorts, then have a look and see if Pg is in fact using on-disk temp files for sorts or materialized data sets. If it is, you might find that increasing work_mem helps your query out. -- Craig Ringer
On Tue, Apr 6, 2010 at 10:32 PM, Craig Ringer <craig@postnewspapers.com.au> wrote: > On 7/04/2010 12:24 AM, Brian Cox wrote: >> >> On 04/06/2010 01:18 AM, Craig Ringer [craig@postnewspapers.com.au] wrote: >>> >>> I'm wondering if the issue is with strace rather than Pg. That is to >>> say, that strace is trying to print: >> >> Thanks, Craig: I do think that this is a strace issue. >> >>> As for what Pg is doing: creat() returns -1 on error and a file >>> descriptor otherwise, so the return value appears to indicate success. >>> I'm kind of wondering what the Pg backend is actually _doing_ though - >>> if it was using sort temp files you'd expect >>> open()/write()/read()/close() not just creat() calls. >> >> My quesiton exactly: what is the backend doing calling creat() over and >> over again? Note that this query does complete -- in 30-40 mins. > > I'd assume it was tempfile creation, but for the fact that there's nothing > but creat() calls. > > However, we can't trust strace. There may be more going on that is being > hidden from strace via limitations on the ptrace syscall imposed by SELinux > / AppArmor / whatever. > > I suggest turning on the logging options in Pg that report use of tempfiles > and disk-spilled sorts, then have a look and see if Pg is in fact using > on-disk temp files for sorts or materialized data sets. > > If it is, you might find that increasing work_mem helps your query out. Yeah, I'd start with EXPLAIN and then, if you can wait long enough, EXPLAIN ANALYZE. You'll probably find it's doing a big sort or a big hash join. ...Robert