Обсуждение: Interesting Query Performance Question

Поиск
Список
Период
Сортировка

Interesting Query Performance Question

От
"Jonathan Hoover"
Дата:
Any explanation for this is appreciated...

I had a query which just seemed to peg the processor. I decided to let it go all night, and after 16 hours, it was
stillstuck at 100% processor usage. I figured something must be wrong, so I tried a few things. Using the following
logic,I was able to get the results I was looking for in about 1 minute (instead of 16+ hours of no result). It all
camedown to getting around the use of "NOT IN (subquery)" where "subquery" has 20,000,000 rows. "IN (subquery)" was
quickas could be. Here's what I found.  

Tables:
------------------------------
|       a     |      b       |
--------------|---------------
|guid vc(32)  | guid vc(32)  |
|unitid vc(15)| unitid vc(15)|
|ftime ts     |              |
|source vc(10)|              |
------------------------------
|65,000,000 rw| 500,000 rows |
------------------------------

Here is the query that hung up for 16 hours:

A:
SELECT unitid,count(guid) AS total FROM b
WHERE guid NOT IN (SELECT DISTINCT guid FROM a) AND
 unitid IN (SELECT DISTINCT unitid FROM a)
GROUP BY unitid
ORDER BY total DESC;

I thought I'd narrow down a.guid and a.unitid and substitute these for the subqueries.
B: SELECT DISTINCT unitid INTO a_u_unitid FROM a; (     1,082 rows)
C: SELECT DISTINCT guid   INTO a_u_guid   FROM a; (20,000,000 rows)

That didn't help on its own. So I decided to try the query without the "NOT". It finished in a minute or two. Granted,
theresult is the opposite of what I wanted, but at least I knew the major source of the problem. I still had the
substitutedqueries B & C in, so it looked like this: 

D:
SELECT unitid,count(guid) AS total FROM b
WHERE guid IN (SELECT guid FROM a_u_guid) AND
 unitid IN (SELECT unitid FROM a_u_unitid)
GROUP BY unitid
ORDER BY total DESC;

Next, I decided to just get my "NOT" set of guid's from b itself. If my logic isn't correct here, someone let me know.
Idid this, and got results very quickly (a minute perhaps): 

E:
SELECT unitid,count(guid) AS total FROM b
WHERE guid NOT IN (SELECT DISTINCT guid FROM b WHERE guid IN (SELECT guid FROM a_u_guid)) AND
 unitid IN (SELECT unitid FROM a_u_unitid)
GROUP BY unitid
ORDER BY total DESC;

Keep in mind a has 65,000,000 rows and b only has 500,000. The above query (E) gave me the original results I was
lookingfor (E gives the expected results from A, if A had ever finished, assuming my logic is correct). I then decided
toget rid of the intermediate tables B & C, so I was back to: 

F:
SELECT unitid,count(guid) AS total FROM b
WHERE guid NOT IN (SELECT DISTINCT guid FROM b WHERE guid IN (SELECT DISTINCT guid FROM a) AND
 Unitid IN (SELECT DISTINCT unitid FROM a)
GROUP BY unitid
ORDER BY total DESC;

While F was running, I started typing this email. It still has not finished, but I have a feeling it will within
anotherhalf hour or so. What I am taking away from this is that the use of "NOT IN (subquery)" where subquery has a
largenumber of results will literally shut down performance and never give a result. Further, I am taking away that
creatingseparate tables with just the DISTINCT values from a table with millions of rows also saves time. The problem
herebeing that the separate table is not auto-updating when values in the master table change. 

Query E then is apparently the way to go, but shouldn't there be a way to get the query planner to take these steps
itself?If A had ever finished, I'd sure like to have seen an EXPLAIN ANALYZE on it. If F does finish, I made run an
EXPLAINANALYZE on it versus E (I kind of expected F to finish while writing this email). 

Thanks,
Jon

Re: Interesting Query Performance Question

От
Tom Lane
Дата:
"Jonathan  Hoover" <jhoover@yahoo-inc.com> writes:
> [ poor performance with NOT IN ]

> Query E then is apparently the way to go, but shouldn't there be a way
> to get the query planner to take these steps itself? If A had ever
> finished, I'd sure like to have seen an EXPLAIN ANALYZE on it.

Well, just an EXPLAIN would have told you what the plan was like.
What I suspect was happening was that your manipulations of the query
altered the planner's estimate of the number of rows in the NOT IN's
subquery, causing it to pick (or not) a hash-table-based implementation
of NOT IN.  The hashed approach is a lot faster but requires the
subquery's output to fit in work_mem.

In general, NOT IN is hard to optimize because of its weird behavior
for NULLs.  I'd suggest looking into converting the query to use an
EXISTS instead.

            regards, tom lane

Re: Interesting Query Performance Question

От
David W Noon
Дата:
On Sat, 06 Nov 2010 11:59:48 -0400, Tom Lane wrote about Re: [ADMIN]
Interesting Query Performance Question:

>In general, NOT IN is hard to optimize because of its weird behavior
>for NULLs.  I'd suggest looking into converting the query to use an
>EXISTS instead.

I have know about that issue for some years, so I always use a column
that has been declared NOT NULL for the sub-query.  Does PostgreSQL not
use that property to bypass the vagaries of NULL in an IN predicate?
--
Regards,

Dave  [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
dwnoon@ntlworld.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

Вложения

Re: Interesting Query Performance Question

От
Tom Lane
Дата:
David W Noon <dwnoon@ntlworld.com> writes:
> On Sat, 06 Nov 2010 11:59:48 -0400, Tom Lane wrote about Re: [ADMIN]
>> In general, NOT IN is hard to optimize because of its weird behavior
>> for NULLs.  I'd suggest looking into converting the query to use an
>> EXISTS instead.

> I have know about that issue for some years, so I always use a column
> that has been declared NOT NULL for the sub-query.  Does PostgreSQL not
> use that property to bypass the vagaries of NULL in an IN predicate?

No.  It's a reasonable TODO item though.  Up till fairly recently, we
didn't have infrastructure that would allow building plans that depended
for correctness on the existence of a constraint.  (In particular,
I'd want such a plan to get thrown away and redone if someone dropped
the NOT NULL constraint.)  We have most of that now, although it only
works for constraints that have entries in pg_constraint, which NOT NULL
doesn't yet.  Maybe in another few months ...

            regards, tom lane