Re: POC, WIP: OR-clause support for indexes

Поиск
Список
Период
Сортировка
От a.rybakina
Тема Re: POC, WIP: OR-clause support for indexes
Дата
Msg-id 3c539d4b-1c3c-4119-a3c9-e335b81c83cf@postgrespro.ru
обсуждение исходный текст
Ответ на Re: POC, WIP: OR-clause support for indexes  (Alexander Korotkov <aekorotkov@gmail.com>)
Список pgsql-hackers
Hi!

On 15.10.2023 01:34, Alexander Korotkov wrote:
> Hi, Alena!
>
> Thank you for your work on the subject.
>
> On Wed, Oct 4, 2023 at 10:21 PM a.rybakina <a.rybakina@postgrespro.ru> wrote:
>> I fixed the kernel dump issue and all the regression tests were successful, but I discovered another problem when I
addedmy own regression tests.
 
>> Some queries that contain "or" expressions do not convert to "ANY". I have described this in more detail using diff
asexpected and real results:
 
>>
>> diff -U3 /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out
/home/alena/postgrespro__copy6/src/test/regress/results/create_index.out
>> --- /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out 2023-10-04 21:54:12.496282667 +0300
>> +++ /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out  2023-10-04 21:55:41.665422459 +0300
>> @@ -1925,17 +1925,20 @@
>>   EXPLAIN (COSTS OFF)
>>   SELECT count(*) FROM tenk1
>>     WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3) OR thousand = 41;
>> -                                               QUERY PLAN
>> ---------------------------------------------------------------------------------------------------------
>> +                                                        QUERY PLAN
>>
+---------------------------------------------------------------------------------------------------------------------------
>>    Aggregate
>>      ->  Bitmap Heap Scan on tenk1
>> -         Recheck Cond: (((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[]))) OR (thousand = 41))
>> +         Recheck Cond: ((((thousand = 42) AND (tenthous = 1)) OR ((thousand = 42) AND (tenthous = 3))) OR (thousand
=41))
 
>>            ->  BitmapOr
>> -               ->  Bitmap Index Scan on tenk1_thous_tenthous
>> -                     Index Cond: ((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[])))
>> +               ->  BitmapOr
>> +                     ->  Bitmap Index Scan on tenk1_thous_tenthous
>> +                           Index Cond: ((thousand = 42) AND (tenthous = 1))
>> +                     ->  Bitmap Index Scan on tenk1_thous_tenthous
>> +                           Index Cond: ((thousand = 42) AND (tenthous = 3))
>>                  ->  Bitmap Index Scan on tenk1_thous_tenthous
>>                        Index Cond: (thousand = 41)
>> -(8 rows)
>> +(11 rows)
> I think this query is not converted, because you only convert
> top-level ORs in the transform_ors() function.  But in the example
> given, the target OR lays under AND, which in turn lays under another
> OR.  I think you need to make transform_ors() recursive to handle
> cases like this.
>
> I wonder about the default value of the parameter or_transform_limit
> of 500. In [1] and [2] you show the execution time degradation from 0
> to ~500 OR clauses.  I made a simple SQL script with the query "SELECT
> * FROM pgbench_accounts a WHERE  aid = 1 OR aid = 2 OR ... OR aid =
> 100;". The pgbench results for a single connection in prepared mode
> are the following.
> master: 936 tps
> patched (or_transform_limit == 0) :1414 tps
> So, transformation to ANY obviously accelerates the execution.
>
> I think it's important to identify the cases where this patch causes
> the degradation.  Generally, I don't see why ANY could be executed
> slower than the equivalent OR clause.  So, the possible degradation
> cases are slower plan generation and worse plans.  I managed to find
> both.
>
> As you stated before, currently the OR transformation has a quadratic
> complexity depending on the number of or-clause-groups.  I made a
> simple test to evaluate this. containing 10000 or-clause-groups.
> SELECT * FROM pgbench_accounts a WHERE aid + 1 * bid = 1 OR aid + 2 *
> bid = 1 OR ... OR aid + 10000 * bid = 1;
> master: 316ms
> patched: 7142ms
> Note, that the current or_transform_limit GUC parameter is not capable
> of cutting such cases, because it cuts cases lower than the limit not
> higher than the limit.  In the comment, you mention that we could
> invent something like hash to handle this.  Hash should be nice, but
> the problem is that we currently don't have a generic facility to hash
> nodes (or even order them).  It would be nice to add this facility,
> that would be quite a piece of work.  I would propose to limit this
> patch for now to handle just a single Var node as a non-const side of
> the clause and implement a simple hash for Vars.
>
> Another problem is the possible generation of worse plans.  I made an
> example table with two partial indexes.
> create table test as (select (random()*10)::int x, (random()*1000) y
> from generate_series(1,1000000) i);
> create index test_x_1_y on test (y) where x = 1;
> create index test_x_2_y on test (y) where x = 2;
> vacuum analyze test;
>
> Without the transformation of ORs to ANY, our planner manages to use
> both indexes with a Bitmap scan.
> # explain select * from test where (x = 1 or x = 2) and y = 100;
>                                                    QUERY PLAN
> --------------------------------------------------------------------------------------------------------------
>   Bitmap Heap Scan on test  (cost=8.60..12.62 rows=1 width=12)
>     Recheck Cond: (((y = '100'::double precision) AND (x = 1)) OR ((y =
> '100'::double precision) AND (x = 2)))
>     ->  BitmapOr  (cost=8.60..8.60 rows=1 width=0)
>           ->  Bitmap Index Scan on test_x_1_y  (cost=0.00..4.30 rows=1 width=0)
>                 Index Cond: (y = '100'::double precision)
>           ->  Bitmap Index Scan on test_x_2_y  (cost=0.00..4.30 rows=1 width=0)
>                 Index Cond: (y = '100'::double precision)
> (7 rows)
>
> With transformation, the planner can't use indexes.
> # explain select * from test where (x = 1 or x = 2) and y = 100;
>                                   QUERY PLAN
> -----------------------------------------------------------------------------
>   Gather  (cost=1000.00..12690.10 rows=1 width=12)
>     Workers Planned: 2
>     ->  Parallel Seq Scan on test  (cost=0.00..11690.00 rows=1 width=12)
>           Filter: ((x = ANY (ARRAY[1, 2])) AND (y = '100'::double precision))
> (4 rows)
>
> The solution I see would be to tech Bitmap scan to handle ANY clause
> in the same way as the OR clause.  I think the entry point for the
> relevant logic is the choose_bitmap_and() function.
>
> Regarding the GUC parameter, I don't see we need a limit.  It's not
> yet clear whether a small number or a large number of OR clauses are
> more favorable for transformation.  I propose to have just a boolean
> enable_or_transformation GUC.
>
I removed the limit from the hook, left the option to enable it or not.

I replaced the data structure so that the groups were formed not in a 
list, but in a hash table. It seems to work fine, but I haven't figured 
out yet why in some cases the regression test results are different and 
the function doesn't work.

So far, I have formed a patch for the version where the conversion takes 
place in parsing, since so far this patch looks the most reliable for me

For convenience, I have formed a patch for the very first version so far.

I have a suspicion that the problem is in the part where we form a hash 
from a string. I'm still figuring it out.

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: José Neves
Дата:
Сообщение: RE: CDC/ETL system on top of logical replication with pgoutput, custom client
Следующее
От: Ashutosh Bapat
Дата:
Сообщение: Re: CDC/ETL system on top of logical replication with pgoutput, custom client