Обсуждение: "SELECT .. WHERE NOT IN" query running for hours
Hi all,
I'm using postgres 8.4.2 on a Ubuntu Linux machine.
I have several tables, one of which is named Document, which of course
represents information I need about my documents. I also have another
table, similar to the first one, called Doc2. The schema of both tables is
the following:
CREATE TABLE "Document"
(
docid integer NOT NULL DEFAULT nextval('doc_id_seq'::regclass),
hwdocid character varying(511) NOT NULL,
pubdate bigint,
finished boolean DEFAULT false,
"location" character varying(200),
title tsvector,
description tsvector,
"content" text,
CONSTRAINT pk_docid PRIMARY KEY (docid),
CONSTRAINT hwdocid_uniq UNIQUE (hwdocid)
)
WITH (
OIDS=FALSE
);
(
docid integer NOT NULL DEFAULT nextval('doc_id_seq'::regclass),
hwdocid character varying(511) NOT NULL,
pubdate bigint,
finished boolean DEFAULT false,
"location" character varying(200),
title tsvector,
description tsvector,
"content" text,
CONSTRAINT pk_docid PRIMARY KEY (docid),
CONSTRAINT hwdocid_uniq UNIQUE (hwdocid)
)
WITH (
OIDS=FALSE
);
The hwdocid in this occasion is no longer than 12 characters. The reason for
being 511 max, is because the same schema is used by other applications.
What i wish to do is dump contents from Doc2 to Document, provided that
the hwdocid from Doc2 is not present in Document (as the entries will be
similar). Doc2 contains ~100000 rows while Document contains ~1000000.
Now, I wrote a simple query to do this, which is the following:
INSERT INTO "Document" ( hwdocid, pubdate, finished, "location", title, description, "content" )
SELECT hwdocid, pubdate, finished, "location", title, description, "content"
FROM "Doc2" d2
WHERE d2.hwdocid NOT IN (
SELECT d.hwdocid
FROM "Document" d
)
SELECT hwdocid, pubdate, finished, "location", title, description, "content"
FROM "Doc2" d2
WHERE d2.hwdocid NOT IN (
SELECT d.hwdocid
FROM "Document" d
)
After running for about half an hour in pgadmin3, I stopped the execution, since I saw that
what I was doing was pretty dumb, as with every insert the Document would increase (and I
know beforehand that data from Doc2 contain unique hwdocid values). At first I thought that each
INSERT creates a new transaction, which is why it was taking so long. So I though I should do
something else..
So, I though that I should dump the documents I want to a temp table and then simply insert them in
the Document table. Before that, I wanted to see however, how many documents I was trying to
insert (as an indication of why it took so long). So I simply did the select part for those documents.
SELECT *
FROM "Doc2" d2
WHERE d2.hwdocid NOT IN (
SELECT d.hwdocid
FROM "Document" d
)
FROM "Doc2" d2
WHERE d2.hwdocid NOT IN (
SELECT d.hwdocid
FROM "Document" d
)
I submitted the query again and let it run. After running for 5 hours, I stopped the query and submitted
the "explain query". After running for ~10 minutes, I also stopped the query explanation phase. So I
re-wrote the query as:
SELECT hwdocid, pubdate, finished, "location", title, description, "content"
FROM "Doc2" d2
WHERE NOT EXISTS (
SELECT d.hwdocid
FROM "Document" d
FROM "Doc2" d2
WHERE NOT EXISTS (
SELECT d.hwdocid
FROM "Document" d
WHERE d.hwdocid = d2.hwdocid
)
)
and asked for the explanation, which was:
Hash Anti Join (cost=72484.24..90988.89 rows=1 width=317) (actual time=3815.471..9063.184 rows=63836 loops=1)
Hash Cond: ((d2.hwdocid)::text = (d.hwdocid)::text)
-> Seq Scan on "Doc2" d2 (cost=0.00..5142.54 rows=96454 width=317) (actual time=0.016..186.781 rows=96454 loops=1)
-> Hash (cost=56435.22..56435.22 rows=949922 width=12) (actual time=3814.968..3814.968 rows=948336 loops=1)
-> Seq Scan on "Document" d (cost=0.00..56435.22 rows=949922 width=12) (actual time=0.008..1926.191 rows=948336 loops=1)
Total runtime: 9159.050 ms
I then submitted it normally and got a result back in ~5-6 seconds.
Hash Cond: ((d2.hwdocid)::text = (d.hwdocid)::text)
-> Seq Scan on "Doc2" d2 (cost=0.00..5142.54 rows=96454 width=317) (actual time=0.016..186.781 rows=96454 loops=1)
-> Hash (cost=56435.22..56435.22 rows=949922 width=12) (actual time=3814.968..3814.968 rows=948336 loops=1)
-> Seq Scan on "Document" d (cost=0.00..56435.22 rows=949922 width=12) (actual time=0.008..1926.191 rows=948336 loops=1)
Total runtime: 9159.050 ms
I then submitted it normally and got a result back in ~5-6 seconds.
So my questions are:
1) Why is it taking *so* long for the first query (with the "NOT IN" ) to do even the simple select?
2) The result between the two queries should be the same. Since I am not even returned an explanation, could someone
make a (wild) guess on what is the "NOT IN" statement doing (trying to do) that is taking so long?
3) My intuition would be that, since there exists a unique constraint on hwdocid, which implies the existence of an index,
this index would be used. Isn't that so? I mean, since it is a unique field, shouldn't it just do a sequential scan on Doc2
and then simply query the index if the value exists? What am I getting wrong?
Thank you very much in advance!
Regards,
George Valkanas
On 1/6/2011 9:36 PM, Γιωργος Βαλκανας wrote: > > 1) Why is it taking *so* long for the first query (with the "NOT IN" ) > to do even the simple select? Because NOT IN has to execute the correlated subquery for every row and then check whether the requested value is in the result set, usually by doing sequential comparison. The NOT EXIST plan is also bad because there is no index but at least it can use very fast and efficient hash algorithm. Indexing the "hwdocid" column on the "Document" table or, ideally, making it a primary key, should provide an additional boost to your query. If you already do have an index, you may consider using enable_seqscan=false for this session, so that the "hwdocid" index will be used. It's a common wisdom that in the most cases NOT EXISTS will beat NOT IN. That is so all over the database world. I've seen that in Oracle applications, MS SQL applications and, of course MySQL applications. Optimizing queries is far from trivial. Μλαδεν Γογαλα -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
Fair enough!
I also turned seqscan off, so the new plan (for the NOT EXISTS) is:
Merge Anti Join (cost=0.00..212686.89 rows=1 width=313) (actual time=0.426..14921.344 rows=63836 loops=1)
Merge Cond: ((d2.hwdocid)::text = (d.hwdocid)::text)
-> Index Scan using hwdocid2_uniq on "Doc2" d2 (cost=0.00..19442.87 rows=96454 width=313) (actual time=0.130..1248.783 rows=96454 loops=1)
-> Index Scan using hwdocid_uniq on "Document" d (cost=0.00..189665.17 rows=949272 width=12) (actual time=0.085..11158.740 rows=948336 loops=1)
Total runtime: 15062.925 ms
Merge Cond: ((d2.hwdocid)::text = (d.hwdocid)::text)
-> Index Scan using hwdocid2_uniq on "Doc2" d2 (cost=0.00..19442.87 rows=96454 width=313) (actual time=0.130..1248.783 rows=96454 loops=1)
-> Index Scan using hwdocid_uniq on "Document" d (cost=0.00..189665.17 rows=949272 width=12) (actual time=0.085..11158.740 rows=948336 loops=1)
Total runtime: 15062.925 ms
Hmm.. doesn't really seem to be such a great boost on performance. But i guess I'll be sticking to this one.
So my follow-up question on the subject is this:
Are there any particular semantics for the "NOT IN" statement that cause the correlated query to execute for every row of the outter query, as opposed to the "NOT EXISTS" ? Or are there any other practical reasons, related to "IN / NOT IN", for this to be happening? Or is it simply due to implementation details of each RDBMS? I guess the former (or the 2nd one), since, as you say, this is common in most databases, but I would most appreciate an answer to clarify this.
Thanks again!
Best regards,
George
2011/1/7 Mladen Gogala <mladen.gogala@vmsinfo.com>
On 1/6/2011 9:36 PM, Γιωργος Βαλκανας wrote:Because NOT IN has to execute the correlated subquery for every row and then check whether the requested value is in the result set, usually by doing sequential comparison. The NOT EXIST plan is also bad because there is no index but at least it can use very fast and efficient hash algorithm. Indexing the "hwdocid" column on the "Document" table or, ideally, making it a primary key, should provide an additional boost to your query. If you already do have an index, you may consider using enable_seqscan=false for this session, so that the "hwdocid" index will be used. It's a common wisdom that in the most cases NOT EXISTS will beat NOT IN. That is so all over the database world. I've seen that in Oracle applications, MS SQL applications and, of course MySQL applications. Optimizing queries is far from trivial.
1) Why is it taking *so* long for the first query (with the "NOT IN" ) to do even the simple select?
Μλαδεν Γογαλα
--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com
Γιωργος Βαλκανας wrote: > > Are there any particular semantics for the "NOT IN" statement that > cause the correlated query to execute for every row of the outter > query, as opposed to the "NOT EXISTS" ? Or are there any other > practical reasons, related to "IN / NOT IN", for this to be happening? > Or is it simply due to implementation details of each RDBMS? I guess > the former (or the 2nd one), since, as you say, this is common in most > databases, but I would most appreciate an answer to clarify this. > > Thanks again! > > Best regards, > George > > > Well, I really hoped that Bruce, Robert or Greg would take on this one, but since there are no more qualified takers, I'll take a shot at this one. For the "NOT IN (result of a correlated sub-query)", the sub-query needs to be executed for every row matching the conditions on the driving table, while the !EXISTS is just a complement of join. It's all in the basic set theory which serves as a model for the relational databases. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 http://www.vmsinfo.com The Leader in Integrated Media Intelligence Solutions
On 1/7/11 1:29 AM, "??????? ????????" <lebiathan@gmail.com> wrote: > >So my follow-up question on the subject is this: > >Are there any particular semantics for the "NOT IN" statement that cause >the correlated query to execute for every row of the outter query, as >opposed to the "NOT EXISTS" ? => select * from explode_array(ARRAY[1,2,3,4,5]) where explode_array not in (0, 1, 2); explode_array --------------- 3 4 5 => select * from explode_array(ARRAY[1,2,3,4,5]) where explode_array not in (0, 1, 2, null); explode_array --------------- (0 rows) The existence of a single NULL in the "not in" segment causes no results to be returned. Postgres isn't smart enough to analyze whether the contents of the NOT IN() can contain null and choose a more optimal plan, and so it always scans ALL rows. Even if the NOT IN() is on a not null primary key. NOT IN is generally dangerous because of this behavior -- it results from the fact that '1 = null' is null, and 'not null' is equal to 'null': => select (1 = 1); ?column? ---------- t select NOT (1 = 1); ?column? ---------- f => select (1 = null); ?column? ---------- (1 row) => select NOT (1 = null); ?column? ---------- (1 row) NOT EXISTS doesn't have this problem, since NOT EXISTS essentially treats the existence of null as false, where NOT IN treats the existence of null as true. rr=> select * from (select * from explode_array(ARRAY[1,2,3,4,5])) foo where not exists (select 1 where explode_array in (0, 1, 2, null)); explode_array --------------- 3 4 5 (3 rows) Often, the best query plans result from 'LEFT JOIN WHERE right side is NULL' rather than NOT EXISTS however. I often get performance gains by switching NOT EXISTS queries to LEFT JOIN form. Though sometimes it is less performant.
Scott Carey <scott@richrelevance.com> wrote: > Often, the best query plans result from 'LEFT JOIN WHERE right > side is NULL' rather than NOT EXISTS however. I often get > performance gains by switching NOT EXISTS queries to LEFT JOIN > form. Even in 8.4 and later? I would think that the anti-join that Tom added in 8.4 would always perform at least as well as the LEFT JOIN technique you describe. -Kevin
On 1/10/11 12:37 PM, "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: >Scott Carey <scott@richrelevance.com> wrote: > >> Often, the best query plans result from 'LEFT JOIN WHERE right >> side is NULL' rather than NOT EXISTS however. I often get >> performance gains by switching NOT EXISTS queries to LEFT JOIN >> form. > >Even in 8.4 and later? I would think that the anti-join that Tom >added in 8.4 would always perform at least as well as the LEFT JOIN >technique you describe. > >-Kevin Yes, in 8.4. The query planner definitely does not treat the two as equivalent. I don't have a concrete example at hand right now, but I've been working exclusively on 8.4 since a month after it was released. It does often use an anti-join for NOT EXISTS, but does not seem to explore all avenues there. Or perhaps the statistics it has differ for some reason at that point. All I know, is that the resulting query plan differs sometimes and I'd say 3 out of 4 times the LEFT JOIN variant is more optimal when they differ. >
2011/1/10 Mladen Gogala <mladen.gogala@vmsinfo.com>: > Well, I really hoped that Bruce, Robert or Greg would take on this one, but > since there are no more qualified takers, I'll take a shot at this one. For > the "NOT IN (result of a correlated sub-query)", the sub-query needs to be > executed for every row matching the conditions on the driving table, while > the !EXISTS is just a complement of join. It's all in the basic set theory > which serves as a model for the relational databases. As Scott says, the real problem is the NULL handling. The semantics are otherwise similar. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company