Poor query plan across OR operator

Поиск
Список
Период
Сортировка
От Mark Hills
Тема Poor query plan across OR operator
Дата
Msg-id alpine.LFD.2.01.1001261532460.29382@sys880.ldn.framestore.com
обсуждение исходный текст
Ответы Re: Poor query plan across OR operator
Re: Poor query plan across OR operator
Список pgsql-performance
One of our most-used queries performs poorly (taking over 2 seconds) and a
tiny amount of refactoring shows it can be fast (less than 1ms) by
transforming the OR case (which spans two tables) into a UNION.

I have created a simple test case (below) which shows the difference we
are seeing in query plans before and after refactoring.

Is it beyond the ability of the query planner to optimise this query
without refactoring? Or is the appropriate index missing, and if so, what
would it be?

Perhaps the refactored query is, in fact, different and could produce
different data in certain corner-cases; I can't see where this could be
though.

Your suggestions are appreciated and I hope the information is useful.
Many thanks.

Mark


-- The plans below are from PostgreSQL 8.5alpha3. Also tested with
-- similar results on PostgreSQL 8.4.2

-- Data structure where a container contains multiple items

CREATE TABLE container (
    id integer PRIMARY KEY,
    selected bool NOT NULL DEFAULT false
);

CREATE TABLE item (
    container_id integer NOT NULL
        REFERENCES container(id) ON DELETE CASCADE,
    n integer NOT NULL,
    selected bool NOT NULL DEFAULT false,
    PRIMARY KEY (container_id, n)
);

-- Partial indexes to find selected containers or selected items

CREATE INDEX container_selected
    ON container (selected)
    WHERE selected IS true;

CREATE INDEX item_selected
    ON item (selected)
    WHERE selected IS true;

-- Populate the data; for a small minority of items and containers,
-- 'selected' is true

CREATE LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION populate()
RETURNS VOID
AS $$
DECLARE
    i integer;
    j integer;
BEGIN
    FOR i IN 0..999 LOOP

        INSERT INTO container (id, selected)
            VALUES (i, RANDOM() < 0.01);

        FOR j IN 0..999 LOOP
            INSERT INTO item (container_id, n, selected)
                VALUES (i, j, RANDOM() < 0.001);
        END LOOP;

    END LOOP;
END
$$ LANGUAGE plpgsql;

SELECT populate();
VACUUM ANALYZE;

SELECT COUNT(*) FROM container; -- 1000
SELECT COUNT(*) FROM container WHERE selected IS true; -- 9
SELECT COUNT(*) FROM item; -- 1000000
SELECT COUNT(*) FROM item WHERE selected IS true; -- 1004

-- A query to find all items where the item or container is selected

EXPLAIN ANALYZE
    SELECT container_id, n
    FROM item
        INNER JOIN container ON item.container_id = container.id
    WHERE item.selected IS true
        OR container.selected IS true;

-- Resulting query plan
--
-- Hash Join  (cost=28.50..92591.11 rows=10016 width=8) (actual time=372.659..1269.207 rows=9996 loops=1)
--   Hash Cond: (item.container_id = container.id)
--   Join Filter: ((item.selected IS TRUE) OR (container.selected IS TRUE))
--   ->  Seq Scan on item  (cost=0.00..78778.68 rows=1002468 width=9) (actual time=370.590..663.764 rows=1000000
loops=1)
--   ->  Hash  (cost=16.00..16.00 rows=1000 width=5) (actual time=0.805..0.805 rows=1000 loops=1)
--         ->  Seq Scan on container  (cost=0.00..16.00 rows=1000 width=5) (actual time=0.007..0.296 rows=1000 loops=1)
-- Total runtime: 1271.676 ms
-- (7 rows)

-- The refactored SQL, which queries the same data but is fast

EXPLAIN ANALYZE
        SELECT container_id, n
        FROM item
            INNER JOIN container ON item.container_id = container.id
        WHERE item.selected IS true
    UNION
        SELECT container_id, n
        FROM item
            INNER JOIN container ON item.container_id = container.id
        WHERE container.selected IS true;

-- Resulting query plan:
--
-- HashAggregate  (cost=18018.43..18120.33 rows=10190 width=8) (actual time=22.784..26.341 rows=9996 loops=1)
--   ->  Append  (cost=28.50..17967.48 rows=10190 width=8) (actual time=0.908..16.676 rows=10004 loops=1)
--         ->  Hash Join  (cost=28.50..90.05 rows=1002 width=8) (actual time=0.907..3.113 rows=1004 loops=1)
--               Hash Cond: (public.item.container_id = public.container.id)
--               ->  Index Scan using item_selected on item  (cost=0.00..47.77 rows=1002 width=8) (actual
time=0.036..1.425rows=1004 loops=1) 
--                     Index Cond: (selected = true)
--               ->  Hash  (cost=16.00..16.00 rows=1000 width=4) (actual time=0.856..0.856 rows=1000 loops=1)
--                     ->  Seq Scan on container  (cost=0.00..16.00 rows=1000 width=4) (actual time=0.006..0.379
rows=1000loops=1) 
--         ->  Nested Loop  (cost=0.00..17775.53 rows=9188 width=8) (actual time=0.024..9.175 rows=9000 loops=1)
--               ->  Index Scan using container_selected on container  (cost=0.00..12.33 rows=9 width=4) (actual
time=0.005..0.012rows=9 loops=1) 
--                     Index Cond: (selected = true)
--               ->  Index Scan using item_pkey on item  (cost=0.00..1960.93 rows=1021 width=8) (actual
time=0.014..0.460rows=1000 loops=9) 
--                     Index Cond: (public.item.container_id = public.container.id)
-- Total runtime: 28.617 ms
-- (14 rows)

В списке pgsql-performance по дате отправления:

Предыдущее
От: nair rajiv
Дата:
Сообщение: Re: splitting data into multiple tables
Следующее
От: Grzegorz Jaśkiewicz
Дата:
Сообщение: Re: Poor query plan across OR operator