Re: [GENERAL] Perfomance of IN-clause with many elements andpossible solutions

Поиск

Список

Период

Сортировка

От	PT
Тема	Re: [GENERAL] Perfomance of IN-clause with many elements andpossible solutions
Дата	24 июля 2017 г. 01:40:31
Msg-id	20170723184031.8c97f2bedf350f4132112cde@potentialtech.com обсуждение
Ответ на	[GENERAL] Perfomance of IN-clause with many elements and possible solutions ("dilaz03 ." <dilaz03@gmail.com>)
Ответы	Re: [GENERAL] Perfomance of IN-clause with many elements and possiblesolutions Re: [GENERAL] Perfomance of IN-clause with many elements and possiblesolutions
Список	pgsql-general

Дерево обсуждения

On Sun, 23 Jul 2017 14:35:24 +0300
"dilaz03 ." <dilaz03@gmail.com> wrote:

> Hello.
>
> I have database with events with type from different souces identified by
> id. I have query which filters events by IN-clause with many ids (1-500
> ids). I see poor perfomance of IN-clause and try to investigate this
> problem.
>
> SELECT version();
>                                                       version
> -------------------------------------------------------------------------------------------------------------------
> PostgreSQL 10beta2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
> 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit
>
>
> -- Full table can fit in memory
> show shared_buffers;
>  shared_buffers
> ----------------
> 2GB
>
>
> show work_mem;
>  work_mem
> ----------
>  16MB
>
>
> SET max_parallel_workers_per_gather TO 0;
> SET max_parallel_workers TO 0;
>
> -- Create table with 10 000 000 rows with 500 bigints
> CREATE TABLE ids AS SELECT trunc(random() * 500)::bigint as id from
> generate_series(1, 10000000);
>
>
> -- IN (...)
> SELECT ('(' || string_agg(id::text, ',') || ')') AS in_clause
> FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset
>
> EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN :in_clause;
>
>   Aggregate  (cost=2654265.02..2654265.03 rows=1 width=8) (actual
> time=17268.831..17268.831 rows=1 loops=1)
>    Buffers: shared hit=44248
>    ->  Seq Scan on ids  (cost=0.00..2644260.48 rows=4001815 width=0)
> (actual time=0.066..16722.072 rows=3998646 loops=1)
>          Filter: (id = ANY
>
('{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199}'::bigint[]))
>          Rows Removed by Filter: 6001354
>          Buffers: shared hit=44248
>  Planning time: 3.324 ms
>  Execution time: 17268.907 ms

In this example you count approximately 40,000,000 values, which is
about 40% of the table. That's going to take time, and most of it
is going to be CPU (since the table fits in memory). If your table
must be structured this way and if these are the queries you're going
to run, the only thing I can expect will speed them up is a faster
processer (note, NOT more CPU cores, but faster). You don't mention
what kind of CPU you have, but I'm guessing it's already pretty fast
and you can't really get anything but marginally faster.

If you really need these queries to be faster, I would suggest
materializing the data, i.e. create a table like:

CREATE TABLE id_counts (
 id BIGINT PRIMARY KEY,
 num BIGINT
)

Then use a trigger or similar technique to keep id_counts in sync
with the id table. You can then run queries of the form:

SELECT sum(num) FROM id_counts WHERE id IN :values:

which I would wager houseboats will be significantly faster.

> -- IN (VALUES ...)
> SELECT ('(VALUES ' || string_agg('(' || id::text || ')', ',') || ')') AS
> values_clause
> FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset
>
> EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN
> :values_clause;
>   Aggregate  (cost=245006.46..245006.47 rows=1 width=8) (actual
> time=4086.188..4086.188 rows=1 loops=1)
>    Buffers: shared hit=44248
>    ->  Hash Join  (cost=7.50..235006.42 rows=4000019 width=0) (actual
> time=0.978..3557.037 rows=3998646 loops=1)
>          Hash Cond: (ids.id = "*VALUES*".column1)
>          Buffers: shared hit=44248
>          ->  Seq Scan on ids  (cost=0.00..144248.48 rows=10000048 width=8)
> (actual time=0.031..1138.542 rows=10000000 loops=1)
>                Buffers: shared hit=44248
>          ->  Hash  (cost=5.00..5.00 rows=200 width=4) (actual
> time=0.923..0.923 rows=200 loops=1)
>                Buckets: 1024  Batches: 1  Memory Usage: 16kB
>                ->  HashAggregate  (cost=3.00..5.00 rows=200 width=4)
> (actual time=0.606..0.759 rows=200 loops=1)
>                      Group Key: "*VALUES*".column1
>                      ->  Values Scan on "*VALUES*"  (cost=0.00..2.50
> rows=200 width=4) (actual time=0.003..0.330 rows=200 loops=1)
>  Planning time: 1.094 ms
>  Execution time: 4086.333 ms
>
>
> -- '...'::hstore ? id
> SELECT ('''' || string_agg(id::text || '=>NULL', ',') || '''::hstore') AS
> hstore_clause
> FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset
>
> EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE :hstore_clause ?
> id::text;
>  Planning time: 0.206 ms
>  Execution time: 5032.794 ms
>
>
> -- '...'::jsonb ? id
> SELECT ('''{' || string_agg('"' || id::text || '": null', ',') ||
> '}''::jsonb') AS jsonb_clause
> FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset
>
> EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE :jsonb_clause ?
> id::text;
>  Planning time: 0.114 ms
>  Execution time: 9277.307 ms
>
>
> IN-VALUES clause has the bestest perfomance. So I have some questions:
>
> - May be exist better solution?
> - Does PostgreSQL have support of hashset structure? Extension (I don't
> found)?
> - IN-VALUES clause adds new node to plan. Has additional node big overhead?
> How about filter by two or more IN-VALUES clause?
>
> Thanks.


--
PT <wmoran@potentialtech.com>

В списке pgsql-general по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [GENERAL] Perfomance of IN-clause with many elements andpossible solutions