Re: How bad is using queries with thousands of values for operators IN or ANY?

Поиск
Список
Период
Сортировка
От Michael Lewis
Тема Re: How bad is using queries with thousands of values for operators IN or ANY?
Дата
Msg-id CAHOFxGq+1OdxQZ0prKEgoGYTL6sfrz3aHB6xdQKFOxeJsDRA2w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: How bad is using queries with thousands of values for operators IN or ANY?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Список pgsql-general
On Tue, Sep 1, 2020 at 1:22 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
 FWIW, the attached is the dusted-off version of a part of a stalled
 development of mine, which unconditionally(!) creates on-the-fly
 statistics on VALUES list. It seems to work for certain cases,
 although the planning time increases significantly.

=$ CREATE TABLE t1 AS SELECT a, a * 2 AS b FROM generate_series(0, 99999) a;
=$ CREATE INDEX ON t1 (a);
> perl q.pl(*) | psql

*: q.pl:
> print "explain analyze select b from t1 join (values ";
> foreach $i (0..10000) {
>       print ", " if ($i > 0);
>       printf("(%d)", $i/10 + 1000);
> }
> print ") as v(v) on (v.v = t1.a);";


patched:

 Merge Join  (cost=824.25..1005.19 rows=10001 width=4) (actual time=13.513..24.285 rows=10001 loops=1)
   Merge Cond: (t1.a = "*VALUES*".column1)
   ->  Index Scan using t1_a_idx on t1  (cost=0.29..3050.29 rows=100000 width=8) (actual time=0.033..1.629 rows=2002 loops=1)
   ->  Sort  (cost=789.47..814.47 rows=10001 width=4) (actual time=12.557..14.546 rows=10001 loops=1)
         Sort Key: "*VALUES*".column1
         Sort Method: quicksort  Memory: 931kB
         ->  Values Scan on "*VALUES*"  (cost=0.00..125.01 rows=10001 width=4) (actual time=0.002..8.271 rows=10001 loops=1)
 Planning Time: 17.290 ms
 Execution Time: 26.344 ms
(9 rows)

master:
 Hash Join  (cost=250.03..2168.03 rows=10001 width=4) (actual time=14.482..77.205 rows=10001 loops=1)
   Hash Cond: (t1.a = "*VALUES*".column1)
   ->  Seq Scan on t1  (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.017..23.540 rows=100000 loops=1)
   ->  Hash  (cost=125.01..125.01 rows=10001 width=4) (actual time=13.786..13.788 rows=10001 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 480kB
         ->  Values Scan on "*VALUES*"  (cost=0.00..125.01 rows=10001 width=4) (actual time=0.002..8.503 rows=10001 loops=1)
 Planning Time: 12.365 ms
 Execution Time: 78.567 ms
(8 rows)

We end up abusing the option of creating temp tables and analyzing them to get around the pain of queries going off the rails because of bad stats or lack of stats on values. I believe most/all of the core team and perhaps most contributors are against query hints in general (with some very good reasons) but it might be amazing to have the option to incur the planning time cost in some cases at least.

For my case, I think the changes coming in PG v13 or maybe v14 for joins being helped by multivariate statistics will mitigate the pain point.

В списке pgsql-general по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Trigger transaction isolation
Следующее
От: Pavel Stehule
Дата:
Сообщение: Re: How bad is using queries with thousands of values for operators IN or ANY?