Interesting Query Performance Question

Поиск
Список
Период
Сортировка
От Jonathan Hoover
Тема Interesting Query Performance Question
Дата
Msg-id 68C107706843A54BA7D9CAEF123ADF05010346CAE6@SP2-EX07VS03.ds.corp.yahoo.com
обсуждение исходный текст
Ответы Re: Interesting Query Performance Question
Список pgsql-admin
Any explanation for this is appreciated...

I had a query which just seemed to peg the processor. I decided to let it go all night, and after 16 hours, it was
stillstuck at 100% processor usage. I figured something must be wrong, so I tried a few things. Using the following
logic,I was able to get the results I was looking for in about 1 minute (instead of 16+ hours of no result). It all
camedown to getting around the use of "NOT IN (subquery)" where "subquery" has 20,000,000 rows. "IN (subquery)" was
quickas could be. Here's what I found.  

Tables:
------------------------------
|       a     |      b       |
--------------|---------------
|guid vc(32)  | guid vc(32)  |
|unitid vc(15)| unitid vc(15)|
|ftime ts     |              |
|source vc(10)|              |
------------------------------
|65,000,000 rw| 500,000 rows |
------------------------------

Here is the query that hung up for 16 hours:

A:
SELECT unitid,count(guid) AS total FROM b
WHERE guid NOT IN (SELECT DISTINCT guid FROM a) AND
 unitid IN (SELECT DISTINCT unitid FROM a)
GROUP BY unitid
ORDER BY total DESC;

I thought I'd narrow down a.guid and a.unitid and substitute these for the subqueries.
B: SELECT DISTINCT unitid INTO a_u_unitid FROM a; (     1,082 rows)
C: SELECT DISTINCT guid   INTO a_u_guid   FROM a; (20,000,000 rows)

That didn't help on its own. So I decided to try the query without the "NOT". It finished in a minute or two. Granted,
theresult is the opposite of what I wanted, but at least I knew the major source of the problem. I still had the
substitutedqueries B & C in, so it looked like this: 

D:
SELECT unitid,count(guid) AS total FROM b
WHERE guid IN (SELECT guid FROM a_u_guid) AND
 unitid IN (SELECT unitid FROM a_u_unitid)
GROUP BY unitid
ORDER BY total DESC;

Next, I decided to just get my "NOT" set of guid's from b itself. If my logic isn't correct here, someone let me know.
Idid this, and got results very quickly (a minute perhaps): 

E:
SELECT unitid,count(guid) AS total FROM b
WHERE guid NOT IN (SELECT DISTINCT guid FROM b WHERE guid IN (SELECT guid FROM a_u_guid)) AND
 unitid IN (SELECT unitid FROM a_u_unitid)
GROUP BY unitid
ORDER BY total DESC;

Keep in mind a has 65,000,000 rows and b only has 500,000. The above query (E) gave me the original results I was
lookingfor (E gives the expected results from A, if A had ever finished, assuming my logic is correct). I then decided
toget rid of the intermediate tables B & C, so I was back to: 

F:
SELECT unitid,count(guid) AS total FROM b
WHERE guid NOT IN (SELECT DISTINCT guid FROM b WHERE guid IN (SELECT DISTINCT guid FROM a) AND
 Unitid IN (SELECT DISTINCT unitid FROM a)
GROUP BY unitid
ORDER BY total DESC;

While F was running, I started typing this email. It still has not finished, but I have a feeling it will within
anotherhalf hour or so. What I am taking away from this is that the use of "NOT IN (subquery)" where subquery has a
largenumber of results will literally shut down performance and never give a result. Further, I am taking away that
creatingseparate tables with just the DISTINCT values from a table with millions of rows also saves time. The problem
herebeing that the separate table is not auto-updating when values in the master table change. 

Query E then is apparently the way to go, but shouldn't there be a way to get the query planner to take these steps
itself?If A had ever finished, I'd sure like to have seen an EXPLAIN ANALYZE on it. If F does finish, I made run an
EXPLAINANALYZE on it versus E (I kind of expected F to finish while writing this email). 

Thanks,
Jon

В списке pgsql-admin по дате отправления:

Предыдущее
От: "Kevin Grittner"
Дата:
Сообщение: Re: Get Postgres to use multiple proc cores?
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Interesting Query Performance Question