Обсуждение: "SELECT .. WHERE NOT IN" query running for hours

Поиск
Список
Период
Сортировка

"SELECT .. WHERE NOT IN" query running for hours

От
Γιωργος Βαλκανας
Дата:
Hi all,

I'm using postgres 8.4.2 on a Ubuntu Linux machine.

I have several tables, one of which is named Document, which of course
represents information I need about my documents. I also have another
table, similar to the first one, called Doc2. The schema of both tables is
the following:

CREATE TABLE "Document"
(
docid integer NOT NULL DEFAULT nextval('doc_id_seq'::regclass),
hwdocid character varying(511) NOT NULL,
pubdate bigint,
finished boolean DEFAULT false,
"location" character varying(200),
title tsvector,
description tsvector,
"content" text,
CONSTRAINT pk_docid PRIMARY KEY (docid),
CONSTRAINT hwdocid_uniq UNIQUE (hwdocid)
)
WITH (
OIDS=FALSE
);

The hwdocid in this occasion is no longer than 12 characters. The reason for
being 511 max, is because the same schema is used by other applications.

What i wish to do is dump contents from Doc2 to Document, provided that
the hwdocid from Doc2 is not present in Document (as the entries will be
similar). Doc2 contains ~100000 rows while Document contains  ~1000000.

Now, I wrote a simple query to do this, which is the following:

INSERT INTO "Document" ( hwdocid, pubdate, finished, "location", title, description, "content" ) 
SELECT hwdocid, pubdate, finished, "location", title, description, "content"
FROM "Doc2" d2
WHERE d2.hwdocid NOT IN (
     SELECT d.hwdocid 
     FROM "Document" d
)

After running for about half an hour in pgadmin3, I stopped the execution, since I saw that
what I was doing was pretty dumb, as with every insert the Document would increase (and I
know beforehand that data from Doc2 contain unique hwdocid values). At first I thought that each
INSERT creates a new transaction, which is why it was taking so long. So I though I should do
something else..

So, I though that I should dump the documents I want to a temp table and then simply insert them in
the Document table. Before that, I wanted to see however, how many documents I was trying to
insert (as an indication of why it took so long). So I simply did the select part for those documents.

SELECT *
FROM "Doc2" d2
WHERE d2.hwdocid NOT IN (
     SELECT d.hwdocid 
     FROM "Document" d
)

I submitted the query again and let it run. After running for 5 hours, I stopped the query and submitted
the "explain query". After running for ~10 minutes, I also stopped the query explanation phase. So I
re-wrote the query as:

SELECT hwdocid, pubdate, finished, "location", title, description, "content"
FROM "Doc2" d2
WHERE NOT EXISTS (
     SELECT d.hwdocid 
     FROM "Document" d
     WHERE d.hwdocid = d2.hwdocid
)

and asked for the explanation, which was: 

Hash Anti Join (cost=72484.24..90988.89 rows=1 width=317) (actual time=3815.471..9063.184 rows=63836 loops=1)
 Hash Cond: ((d2.hwdocid)::text = (d.hwdocid)::text)
 -> Seq Scan on "Doc2" d2 (cost=0.00..5142.54 rows=96454 width=317) (actual time=0.016..186.781 rows=96454 loops=1)
 -> Hash (cost=56435.22..56435.22 rows=949922 width=12) (actual time=3814.968..3814.968 rows=948336 loops=1)
 -> Seq Scan on "Document" d (cost=0.00..56435.22 rows=949922 width=12) (actual time=0.008..1926.191 rows=948336 loops=1)
Total runtime: 9159.050 ms

I then submitted it normally and got a result back in ~5-6 seconds.

So my questions are:

1) Why is it taking *so* long for the first query (with the "NOT IN" ) to do even the simple select?

2) The result between the two queries should be the same. Since I am not even returned an explanation, could someone
make a (wild) guess on what is the "NOT IN" statement doing (trying to do) that is taking so long?

3) My intuition would be that, since there exists a unique constraint on hwdocid, which implies the existence of an index,
this index would be used. Isn't that so? I mean, since it is a unique field, shouldn't it just do a sequential scan on Doc2
and then simply query the index if the value exists? What am I getting wrong?

Thank you very much in advance!

Regards,
George Valkanas

Re: "SELECT .. WHERE NOT IN" query running for hours

От
Mladen Gogala
Дата:
On 1/6/2011 9:36 PM, Γιωργος Βαλκανας wrote:
>
> 1) Why is it taking *so* long for the first query (with the "NOT IN" )
> to do even the simple select?
Because NOT IN has to execute the correlated subquery for every row and
then check whether the requested value is in the result set, usually by
doing sequential comparison. The NOT EXIST plan is also bad because
there is no index but at least it can use very fast and efficient hash
algorithm. Indexing the "hwdocid" column on the "Document" table or,
ideally, making it a primary key, should provide an additional boost to
your query. If you already do have an index, you may consider using
enable_seqscan=false for this session, so that the "hwdocid" index will
be used. It's a common wisdom that in the most cases NOT EXISTS will
beat NOT IN. That is so all over the database world. I've seen that in
Oracle applications, MS SQL applications and, of course MySQL
applications. Optimizing queries is far from trivial.

Μλαδεν Γογαλα

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: "SELECT .. WHERE NOT IN" query running for hours

От
Γιωργος Βαλκανας
Дата:
Fair enough!

I also turned seqscan off, so the new plan (for the NOT EXISTS) is:

Merge Anti Join (cost=0.00..212686.89 rows=1 width=313) (actual time=0.426..14921.344 rows=63836 loops=1)
 Merge Cond: ((d2.hwdocid)::text = (d.hwdocid)::text)
 -> Index Scan using hwdocid2_uniq on "Doc2" d2 (cost=0.00..19442.87 rows=96454 width=313) (actual time=0.130..1248.783 rows=96454 loops=1)
 -> Index Scan using hwdocid_uniq on "Document" d (cost=0.00..189665.17 rows=949272 width=12) (actual time=0.085..11158.740 rows=948336 loops=1)
Total runtime: 15062.925 ms

Hmm.. doesn't really seem to be such a great boost on performance. But i guess I'll be sticking to this one.

So my follow-up question on the subject is this:

Are there any particular semantics for the "NOT IN" statement that cause the correlated query to execute for every row of the outter query, as opposed to the "NOT EXISTS" ? Or are there any other practical reasons, related to "IN / NOT IN", for this to be happening? Or is it simply due to implementation details of each RDBMS? I guess the former (or the 2nd one), since, as you say, this is common in most databases, but I would most appreciate an answer to clarify this.

Thanks again!

Best regards,
George



2011/1/7 Mladen Gogala <mladen.gogala@vmsinfo.com>
On 1/6/2011 9:36 PM, Γιωργος Βαλκανας wrote:

1) Why is it taking *so* long for the first query (with the "NOT IN" ) to do even the simple select?
Because NOT IN has to execute the correlated subquery for every row and then check whether the requested value is in the result set, usually by doing sequential comparison. The NOT EXIST plan is also bad because there is no index but at least it can use very fast and efficient hash algorithm. Indexing the "hwdocid" column on the "Document" table or, ideally, making it a primary key, should provide an additional boost to your query. If you already do have an index, you may consider using enable_seqscan=false for this session, so that the "hwdocid" index will be used. It's a common wisdom that in the most cases NOT EXISTS will beat NOT IN. That is so all over the database world. I've seen that in Oracle applications, MS SQL applications and, of course MySQL applications. Optimizing queries is far from trivial.

Μλαδεν Γογαλα

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: "SELECT .. WHERE NOT IN" query running for hours

От
Mladen Gogala
Дата:
Γιωργος Βαλκανας wrote:
>
> Are there any particular semantics for the "NOT IN" statement that
> cause the correlated query to execute for every row of the outter
> query, as opposed to the "NOT EXISTS" ? Or are there any other
> practical reasons, related to "IN / NOT IN", for this to be happening?
> Or is it simply due to implementation details of each RDBMS? I guess
> the former (or the 2nd one), since, as you say, this is common in most
> databases, but I would most appreciate an answer to clarify this.
>
> Thanks again!
>
> Best regards,
> George
>
>
>

Well, I really hoped that Bruce, Robert or Greg would take on this one,
but since there are no more qualified takers, I'll take a shot at this
one. For the "NOT IN (result of a correlated sub-query)", the sub-query
needs to be executed for every row matching the conditions on the
driving table, while the   !EXISTS is just a complement of join. It's
all in the basic set theory which serves as a model for the relational
databases.

--

Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com
The Leader in Integrated Media Intelligence Solutions




Re: "SELECT .. WHERE NOT IN" query running for hours

От
Scott Carey
Дата:

On 1/7/11 1:29 AM, "??????? ????????" <lebiathan@gmail.com> wrote:


>
>So my follow-up question on the subject is this:
>
>Are there any particular semantics for the "NOT IN" statement that cause
>the correlated query to execute for every row of the outter query, as
>opposed to the "NOT EXISTS" ?


=> select * from explode_array(ARRAY[1,2,3,4,5]) where explode_array not
in (0, 1, 2);
 explode_array
---------------
             3
             4
             5



=> select * from explode_array(ARRAY[1,2,3,4,5]) where explode_array not
in (0, 1, 2, null);
 explode_array
---------------
(0 rows)



The existence of a single NULL in the "not in" segment causes no results
to be returned.  Postgres isn't smart enough to analyze whether the
contents of the NOT IN() can contain null and choose a more optimal plan,
and so it always scans ALL rows.  Even if the NOT IN() is on a not null
primary key.   NOT IN is generally dangerous because of this behavior --
it results from the fact that '1 = null' is null, and 'not null' is equal
to  'null':

=> select (1 = 1);
 ?column?
----------
 t



select NOT (1 = 1);
 ?column?
----------
 f


=> select (1 = null);
 ?column?
----------

(1 row)


=> select NOT (1 = null);
 ?column?
----------

(1 row)





NOT EXISTS doesn't have this problem, since NOT EXISTS essentially treats
the existence of null as false, where NOT IN treats the existence of null
as true.

rr=> select * from (select * from explode_array(ARRAY[1,2,3,4,5])) foo
where not exists (select 1  where explode_array in (0, 1, 2, null));
 explode_array
---------------
             3
             4
             5
(3 rows)



Often, the best query plans result from 'LEFT JOIN WHERE right side is
NULL' rather than NOT EXISTS however.  I often get performance gains by
switching NOT EXISTS queries to LEFT JOIN form.  Though sometimes it is
less performant.






Re: "SELECT .. WHERE NOT IN" query running for hours

От
"Kevin Grittner"
Дата:
Scott Carey <scott@richrelevance.com> wrote:

> Often, the best query plans result from 'LEFT JOIN WHERE right
> side is NULL' rather than NOT EXISTS however.  I often get
> performance gains by switching NOT EXISTS queries to LEFT JOIN
> form.

Even in 8.4 and later?  I would think that the anti-join that Tom
added in 8.4 would always perform at least as well as the LEFT JOIN
technique you describe.

-Kevin

Re: "SELECT .. WHERE NOT IN" query running for hours

От
Scott Carey
Дата:

On 1/10/11 12:37 PM, "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

>Scott Carey <scott@richrelevance.com> wrote:
>
>> Often, the best query plans result from 'LEFT JOIN WHERE right
>> side is NULL' rather than NOT EXISTS however.  I often get
>> performance gains by switching NOT EXISTS queries to LEFT JOIN
>> form.
>
>Even in 8.4 and later?  I would think that the anti-join that Tom
>added in 8.4 would always perform at least as well as the LEFT JOIN
>technique you describe.
>
>-Kevin

Yes, in 8.4.  The query planner definitely does not treat the two as
equivalent.  I don't have a concrete example at hand right now, but I've
been working exclusively on 8.4 since a month after it was released.  It
does often use an anti-join for NOT EXISTS, but does not seem to explore
all avenues there.  Or perhaps the statistics  it has differ for some
reason at that point.  All I know, is that the resulting query plan
differs sometimes and I'd say 3 out of 4 times the LEFT JOIN variant is
more optimal when they differ.

>


Re: "SELECT .. WHERE NOT IN" query running for hours

От
Robert Haas
Дата:
2011/1/10 Mladen Gogala <mladen.gogala@vmsinfo.com>:
> Well, I really hoped that Bruce, Robert or Greg would take on this one, but
> since there are no more qualified takers, I'll take a shot at this one. For
> the "NOT IN (result of a correlated sub-query)", the sub-query needs to be
> executed for every row matching the conditions on the driving table, while
> the   !EXISTS is just a complement of join. It's all in the basic set theory
> which serves as a model for the relational databases.

As Scott says, the real problem is the NULL handling.  The semantics
are otherwise similar.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company