Any explanation for this is appreciated...
I had a query which just seemed to peg the processor. I decided to let it go all night, and after 16 hours, it was
stillstuck at 100% processor usage. I figured something must be wrong, so I tried a few things. Using the following
logic,I was able to get the results I was looking for in about 1 minute (instead of 16+ hours of no result). It all
camedown to getting around the use of "NOT IN (subquery)" where "subquery" has 20,000,000 rows. "IN (subquery)" was
quickas could be. Here's what I found.
Tables:
------------------------------
| a | b |
--------------|---------------
|guid vc(32) | guid vc(32) |
|unitid vc(15)| unitid vc(15)|
|ftime ts | |
|source vc(10)| |
------------------------------
|65,000,000 rw| 500,000 rows |
------------------------------
Here is the query that hung up for 16 hours:
A:
SELECT unitid,count(guid) AS total FROM b
WHERE guid NOT IN (SELECT DISTINCT guid FROM a) AND
unitid IN (SELECT DISTINCT unitid FROM a)
GROUP BY unitid
ORDER BY total DESC;
I thought I'd narrow down a.guid and a.unitid and substitute these for the subqueries.
B: SELECT DISTINCT unitid INTO a_u_unitid FROM a; ( 1,082 rows)
C: SELECT DISTINCT guid INTO a_u_guid FROM a; (20,000,000 rows)
That didn't help on its own. So I decided to try the query without the "NOT". It finished in a minute or two. Granted,
theresult is the opposite of what I wanted, but at least I knew the major source of the problem. I still had the
substitutedqueries B & C in, so it looked like this:
D:
SELECT unitid,count(guid) AS total FROM b
WHERE guid IN (SELECT guid FROM a_u_guid) AND
unitid IN (SELECT unitid FROM a_u_unitid)
GROUP BY unitid
ORDER BY total DESC;
Next, I decided to just get my "NOT" set of guid's from b itself. If my logic isn't correct here, someone let me know.
Idid this, and got results very quickly (a minute perhaps):
E:
SELECT unitid,count(guid) AS total FROM b
WHERE guid NOT IN (SELECT DISTINCT guid FROM b WHERE guid IN (SELECT guid FROM a_u_guid)) AND
unitid IN (SELECT unitid FROM a_u_unitid)
GROUP BY unitid
ORDER BY total DESC;
Keep in mind a has 65,000,000 rows and b only has 500,000. The above query (E) gave me the original results I was
lookingfor (E gives the expected results from A, if A had ever finished, assuming my logic is correct). I then decided
toget rid of the intermediate tables B & C, so I was back to:
F:
SELECT unitid,count(guid) AS total FROM b
WHERE guid NOT IN (SELECT DISTINCT guid FROM b WHERE guid IN (SELECT DISTINCT guid FROM a) AND
Unitid IN (SELECT DISTINCT unitid FROM a)
GROUP BY unitid
ORDER BY total DESC;
While F was running, I started typing this email. It still has not finished, but I have a feeling it will within
anotherhalf hour or so. What I am taking away from this is that the use of "NOT IN (subquery)" where subquery has a
largenumber of results will literally shut down performance and never give a result. Further, I am taking away that
creatingseparate tables with just the DISTINCT values from a table with millions of rows also saves time. The problem
herebeing that the separate table is not auto-updating when values in the master table change.
Query E then is apparently the way to go, but shouldn't there be a way to get the query planner to take these steps
itself?If A had ever finished, I'd sure like to have seen an EXPLAIN ANALYZE on it. If F does finish, I made run an
EXPLAINANALYZE on it versus E (I kind of expected F to finish while writing this email).
Thanks,
Jon