Обсуждение: odd intermittent query hanging issue

Поиск
Список
Период
Сортировка

odd intermittent query hanging issue

От
Aaron Burnett
Дата:
Greetings,

I run a handful of queries overnight when traffic is at it's lowest on our
system. One particular query will run perfectly fine (around 5 seconds)
for several weeks, then suddenly decide to hang indefinitely and never
finish. It needs to be killed manually after several hours (I've
intentionally let it run to see if it would ever finish) in order for it
to die.

The fix _seems to be_ to drop and rebuild the index on
xrefchannelmember.member_id. The query then goes back to running in the 5
seconds and has no problem again for weeks.... until it happens again.

Has anyone heard of such a thing? And if anyone can maybe point me in
directions to investigate, it would be much appreciated.

Postgresl v 9.1.3
Ubuntu v 11.10
Linux db 3.0.0-16-generic #29-Ubuntu SMP Tue Feb 14 12:48:51 UTC 2012
x86_64 x86_64 x86_64 GNU/Linux

Query, explain plan (when ot works) and table structure below:

(Thanking you in advance for any help)

select distinct(id) from member left join xrefchannelmember xrcm on
id=xrcm.member_id where id not in (Select memberid from
xrefcampaignmember) and xrcm.channel_id in (1)  order by id asc;

explain plan:
------------------------------
 Unique  (cost=237234.66..239267.33 rows=406533 width=4) (actual
time=4790.823..4922.621 rows=418843 loops=1)
   ->  Sort  (cost=237234.66..238251.00 rows=406533 width=4) (actual
time=4790.818..4826.201 rows=418879 loops=1)
         Sort Key: member.id
         Sort Method: quicksort  Memory: 31923kB
         ->  Hash Join  (cost=167890.31..199359.99 rows=406533 width=4)
(actual time=3357.406..4532.952 rows=418879 loops=1)
               Hash Cond: (xrcm.member_id = member.id)
               ->  Seq Scan on xrefchannelmember xrcm
(cost=0.00..19273.69 rows=813066 width=4) (actual time=0.015..219.259 row
s=814421 loops=1)
                     Filter: (channel_id = 1)
               ->  Hash  (cost=162586.21..162586.21 rows=424328 width=4)
(actual time=3357.001..3357.001 rows=444626 loops=1)
                     Buckets: 65536  Batches: 1  Memory Usage: 15632kB
                     ->  Seq Scan on member  (cost=66114.02..162586.21
rows=424328 width=4) (actual time=2357.280..3216.076 rows
=444626 loops=1)
                           Filter: (NOT (hashed SubPlan 1))
                           SubPlan 1
                             ->  Seq Scan on xrefcampaignmember
(cost=0.00..57931.82 rows=3272882 width=4) (actual time=0.021..
670.086 rows=3272870 loops=1)
 Total runtime: 4963.134 ms

Table structure:

                             Table "public.xrefcampaignmember"
    Column    |           Type           |                     Modifiers

--------------+--------------------------+---------------------------------
-------------------
 campaignid   | integer                  | not null
 memberid     | integer                  | not null
 joined       | timestamp with time zone | default
('now'::text)::timestamp without time zone
 reservedslot | integer                  | default 0
Indexes:
    "XrefCampaignMember_pkey" PRIMARY KEY, btree (campaignid, memberid)
    "xcm_campaignid_idx" btree (campaignid)
    "xcm_joined_idx" btree (joined)
    "xcm_memberid_idx" btree (memberid)
    "xcm_reservedslot_idx" btree (reservedslot)







Re: odd intermittent query hanging issue

От
Steve Crawford
Дата:
On 05/18/2012 09:17 AM, Aaron Burnett wrote:
> Greetings,
>
> I run a handful of queries overnight when traffic is at it's lowest on our
> system. One particular query will run perfectly fine (around 5 seconds)
> for several weeks, then suddenly decide to hang indefinitely and never
> finish. It needs to be killed manually after several hours (I've
> intentionally let it run to see if it would ever finish) in order for it
> to die.
>
> The fix _seems to be_ to drop and rebuild the index on
> xrefchannelmember.member_id. The query then goes back to running in the 5
> seconds and has no problem again for weeks.... until it happens again.
>
> Has anyone heard of such a thing? And if anyone can maybe point me in
> directions to investigate, it would be much appreciated.

Few answers but several questions...

Is the machine busy processing the query or is it idle?

Does it start happening consistently or is it an individual query. I.e.
if you run the query again will it hang or complete? Can you get a query
plan when the query is failing?

Are there any other queries that may be locking your tables (check
pg_locks)?

Anything of interest in postgresql or system logs?

Autovacuum running properly? Are you running any large
updates/deletes/etc just prior to the query such that the  statistics
the planner is using do not reflect reality?

Does the network connection to the server still exist?

Any other weirdness happening on the machine?


> select distinct(id) from member left join xrefchannelmember xrcm on
> id=xrcm.member_id where id not in (Select memberid from
> xrefcampaignmember) and xrcm.channel_id in (1)  order by id asc;
BTW, I have an un-substantiated gut feeling that this query has room for
improvement. I would experiment with substituting "...where not exists
(select..." for the "...not in (select...". You could also try using
"select..... except select memberid from xrefcampaignmember..."

I'm also not sure I understand the left join since you have the
"xcrm.channel_id in (1)" condition so you won't return records in
"member" without a corresponding record in xrefchannelmember anyway.

I also don't understand the "xcrm.channel_id in (1)" instead of
"xcrm.channel_id = 1" unless this is a generated query and there could
be multiple ids in that condition.

Will one of the following (untested) queries return the results you want
and have a better query plan?:

select distinct(id) from member
join xrefchannelmember xrcm on id=xrcm.member_id
where xrcm.channel_id=1
except
select memberid from xrefcampaignmember
order by id asc;

(actually with except the distinct is probably superfluous as well)

or:

select distinct(id) from member
join xrefchannelmember xrcm on id=xrcm.member_id
where xrcm.channel_id=1
and not exists (select 1 from xrefcampaignmember x where x.memberid =
member.id)
order by id asc;

or:

select distinct(id) from member
where
exists (select 1 from xrefchannelmember where member_id = member.id and
channel_id=1)
and not exists (select 1 from xrefcampaignmember where memberid = member.id)
order by id asc;

Cheers,
Steve


Re: odd intermittent query hanging issue

От
Aaron Burnett
Дата:
Thanks Steve,

Answers are inserted below:


On 5/18/12 11:09 AM, "Steve Crawford" <scrawford@pinpointresearch.com>
wrote:

>On 05/18/2012 09:17 AM, Aaron Burnett wrote:
>> Greetings,
>>
>> I run a handful of queries overnight when traffic is at it's lowest on
>>our
>> system. One particular query will run perfectly fine (around 5 seconds)
>> for several weeks, then suddenly decide to hang indefinitely and never
>> finish. It needs to be killed manually after several hours (I've
>> intentionally let it run to see if it would ever finish) in order for it
>> to die.
>>
>> The fix _seems to be_ to drop and rebuild the index on
>> xrefchannelmember.member_id. The query then goes back to running in the
>>5
>> seconds and has no problem again for weeks.... until it happens again.
>>
>> Has anyone heard of such a thing? And if anyone can maybe point me in
>> directions to investigate, it would be much appreciated.
>
>Few answers but several questions...
>
>Is the machine busy processing the query or is it idle?


It is processing and in fact drives the load up a bit.

>
>Does it start happening consistently or is it an individual query. I.e.
>if you run the query again will it hang or complete? Can you get a query
>plan when the query is failing?

The query will run fine many times a night for many nights, then it will
hang. Killing that query and running it again results in the same hang
indefinitely. The only fix so far to get the query to run again is to drop
and rebuild the xrefcampaignmenber.memberid index (I misquoted which index
earlier) and then it will once again run fine for many, many days.


>
>Are there any other queries that may be locking your tables (check
>pg_locks)?

I did check, and no, no locks.


>
>Anything of interest in postgresql or system logs?

Nope.

>
>Autovacuum running properly? Are you running any large
>updates/deletes/etc just prior to the query such that the  statistics
>the planner is using do not reflect reality?

No large updates or deletes to any of the tables involved in the query.
Statistics seems just fine.


>
>Does the network connection to the server still exist?

Yes.


>
>Any other weirdness happening on the machine?

Nothing at all.
>
>
>> select distinct(id) from member left join xrefchannelmember xrcm on
>> id=xrcm.member_id where id not in (Select memberid from
>> xrefcampaignmember) and xrcm.channel_id in (1)  order by id asc;
>BTW, I have an un-substantiated gut feeling that this query has room for
>improvement. I would experiment with substituting "...where not exists
>(select..." for the "...not in (select...". You could also try using
>"select..... except select memberid from xrefcampaignmember..."
>
>I'm also not sure I understand the left join since you have the
>"xcrm.channel_id in (1)" condition so you won't return records in
>"member" without a corresponding record in xrefchannelmember anyway.
>
>I also don't understand the "xcrm.channel_id in (1)" instead of
>"xcrm.channel_id = 1" unless this is a generated query and there could
>be multiple ids in that condition.
>
>Will one of the following (untested) queries return the results you want
>and have a better query plan?:
>
>select distinct(id) from member
>join xrefchannelmember xrcm on id=xrcm.member_id
>where xrcm.channel_id=1
>except
>select memberid from xrefcampaignmember
>order by id asc;
>
>(actually with except the distinct is probably superfluous as well)
>
>or:
>
>select distinct(id) from member
>join xrefchannelmember xrcm on id=xrcm.member_id
>where xrcm.channel_id=1
>and not exists (select 1 from xrefcampaignmember x where x.memberid =
>member.id)
>order by id asc;
>
>or:
>
>select distinct(id) from member
>where
>exists (select 1 from xrefchannelmember where member_id = member.id and
>channel_id=1)
>and not exists (select 1 from xrefcampaignmember where memberid =
>member.id)
>order by id asc;

Yeah, the query is poo... autogenerated... the LEFT JOIN is not needed as
I have pointed out to the person responsible for the code many times, and
the 'in(1)' may indeed have many categories in there. But the OLY one that
hangs is the 'in(1)'


>
>Cheers,
>Steve
>


Re: odd intermittent query hanging issue

От
Steve Crawford
Дата:
On 05/18/2012 11:01 AM, Aaron Burnett wrote:
>>> ... One particular query will run perfectly fine (around 5 seconds)
>>> for several weeks, then suddenly decide to hang indefinitely and never
>>> finish....
>> Is the machine busy processing the query or is it idle?
>
> It is processing and in fact drives the load up a bit.
What CPU, disk and memory is it using? It would be very interesting to
see the query plan when things go South.

> Yeah, the query is poo... autogenerated... the LEFT JOIN is not needed as I have pointed out to the person
responsiblefor the code many times, and 
> the 'in(1)' may indeed have many categories in there. But the OLY one that
> hangs is the 'in(1)'

Is "1" the largest category? Also, how have you tuned work_mem (show
work_mem;)?

When the query gets bad do you see PostgreSQL swapping to temp files
(watch files in PGDATA/base/DB_OID/pgsql_tmp). Note that work_mem can be
set per-connection so you if it is too small for your nighttime
maintenance you can adjust it for those operations only.

Cheers,
Steve

Re: odd intermittent query hanging issue

От
"Albe Laurenz"
Дата:
Aaron Burnett wrote:
>>> I run a handful of queries overnight when traffic is at it's lowest
on our
>>> system. One particular query will run perfectly fine (around 5
seconds0)
>>> for several weeks, then suddenly decide to hang indefinitely and
never
>>> finish. It needs to be killed manually after several hours (I've
>>> intentionally let it run to see if it would ever finish) in order
for it
>>> to die.
>>>
>>> The fix _seems to be_ to drop and rebuild the index on
>>> xrefchannelmember.member_id. The query then goes back to running in
the 5
>>> seconds and has no problem again for weeks.... until it happens
again.

>> Is the machine busy processing the query or is it idle?

> It is processing and in fact drives the load up a bit.

>> Are there any other queries that may be locking your tables (check
>> pg_locks)?

> I did check, and no, no locks.

Try to get the execution plan of the query when it takes long and
see how it compares to the plan when it runs ok.

Yours,
Laurenz Albe

Re: odd intermittent query hanging issue

От
Steve Crawford
Дата:
On 05/20/2012 07:45 AM, Aaron Burnett wrote:
> Hey Steve,
>
> Just wanted to send a note of thanks for pointing me in a few new
> directions on this.
>
> Turns out that the query would swap but not all the time. When it swapped,
> it wouldn't finish, if it didn't swap it would finish in the expected
> time. Not sure why this isn't consistent...
>
> The work_mem was set at 100MB so as a test I increased it to 200MB on the
> session and so far it has consistently finished every time it has run. I'm
> not sure why this has reared it's head now as we run far more complex and
> data intensive queries than this one and they don't swap... so I guess I
> have some forensics to do.

You're welcome. Assuming that changing work_mem is a confirmed solution,
it sounds like the query is running right on the edge of running out of
work_mem and when nudged over that threshold starts to swap. PostgreSQL,
like most other things in computers, does not degrade linearly. Things
work great till you run out of some resource then degrade dramatically.

The work_mem setting specifies memory available for internal sorts and
hash operations. Looking at the original explain plan you posted I see
lots of such operations - perhaps your other queries don't need them:

explain plan:
------------------------------
  Unique  (cost=237234.66..239267.33 rows=406533 width=4) (actual
time=4790.823..4922.621 rows=418843 loops=1)
    ->   Sort  (cost=237234.66..238251.00 rows=406533 width=4) (actual
time=4790.818..4826.201 rows=418879 loops=1)
          Sort Key: member.id
          Sort Method: quicksort  Memory: 31923kB
          ->   Hash Join  (cost=167890.31..199359.99 rows=406533 width=4)
(actual time=3357.406..4532.952 rows=418879 loops=1)
                Hash Cond: (xrcm.member_id = member.id)
                ->   Seq Scan on xrefchannelmember xrcm
(cost=0.00..19273.69 rows=813066 width=4) (actual time=0.015..219.259 row
s=814421 loops=1)
                      Filter: (channel_id = 1)
                ->   Hash  (cost=162586.21..162586.21 rows=424328 width=4)
(actual time=3357.001..3357.001 rows=444626 loops=1)
                      Buckets: 65536  Batches: 1  Memory Usage: 15632kB
                      ->   Seq Scan on member  (cost=66114.02..162586.21
rows=424328 width=4) (actual time=2357.280..3216.076 rows
=444626 loops=1)
                            Filter: (NOT (hashed SubPlan 1))
                            SubPlan 1
                              ->   Seq Scan on xrefcampaignmember
(cost=0.00..57931.82 rows=3272882 width=4) (actual time=0.021..
670.086 rows=3272870 loops=1)
  Total runtime: 4963.134 ms

I'm still bothered by a couple things but I don't have a enough
understanding of internals to provide definitive advice:

1. Why does reindexing seem to temporarily solve the problem?...

2...especially when I don't see indexes being used in the query

I would definitely run explain on some of the alternative versions of
the query posted earlier to see if they do a better job of using
available indexes or reducing the amount of internal hashing/sorting. If
they do, you will have some "ammo" to take back to your developer.

Cheers,
Steve