Re: POC, WIP: OR-clause support for indexes

Поиск
Список
Период
Сортировка
От Ranier Vilela
Тема Re: POC, WIP: OR-clause support for indexes
Дата
Msg-id CAEudQAqW3p0R5NMjOjTuCbs=HTQR5YO8WDgCjcSuArbLdDJVtw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: POC, WIP: OR-clause support for indexes  (Alena Rybakina <lena.ribackina@yandex.ru>)
Ответы Re: POC, WIP: OR-clause support for indexes  (Alena Rybakina <lena.ribackina@yandex.ru>)
Список pgsql-hackers
Em qui., 29 de jun. de 2023 às 02:50, Alena Rybakina <lena.ribackina@yandex.ru> escreveu:

Hi!

On 29.06.2023 04:36, Ranier Vilela wrote:
I don't want to be rude, but this doesn't seem very helpful.
Sorry, It was not my intention to cause interruptions.


- You made some changes, but you don't even attempt to explain what you
changed or why you changed it.
1. Reduce scope
2. Eliminate unnecessary variables
3. Eliminate unnecessary expressions
 

- You haven't even tried to compile the code, nor tested it. If it
happens to compile, wow could others even know it actually behaves the
way you wanted?

Sorry I didn't answer right away. I will try not to do this in the future thank you for your participation and help.

There's no need to apologize.
 

Yes, the scope of this patch may be small, but I am sure that it will solve the worst case of memory consumption with large numbers of "or" expressions or reduce execution and planning time.

Yeah, I also believe it will help performance.

As I have already said, I conducted a launch on a database with 20 billion data generated using a benchmark. Unfortunately, at that time I sent a not quite correct picture: the execution time, not the planning time, increases with the number of "or" expressions (execution_time.png). x is the number of or expressions, y is the execution/scheduling time.I also throw memory consumption at 50,000 "or" expressions collected by HeapTrack (where memory consumption was recorded already at the initialization stage of the 1.27GB pic3.png). I think such a transformation will allow just the same to avoid such a worst case, since in comparison with ANY memory is much less and takes little time.

SELECT FORMAT('prepare x %s AS SELECT * FROM pgbench_accounts a WHERE %s', '(' || string_agg('int', ',') || ')', string_agg(FORMAT('aid = $%s', g.id), ' or ') ) AS cmd FROM generate_series(1, 50000) AS g(id)
\gexec

SELECT FORMAT('execute x %s;', '(' || string_agg(g.id::text, ',') || ')') AS cmd FROM generate_series(1, 50000) AS g(id)
\gexec

I tried to add a transformation at the path formation stage before we form indexes (set_plain_rel_pathlist function) and at the stage when we have preprocessing of "or" expressions (getting rid of duplicates or useless conditions), but everywhere there was a problem of incorrect selectivity estimation.

CREATE TABLE tenk1 (unique1 int, unique2 int, ten int, hundred int);
insert into tenk1 SELECT x,x,x,x FROM generate_series(1,50000) as x;
CREATE INDEX a_idx1 ON tenk1(unique1);
CREATE INDEX a_idx2 ON tenk1(unique2);
CREATE INDEX a_hundred ON tenk1(hundred);

postgres=# explain analyze
select * from tenk1 a join tenk1 b on  ((a.unique2 = 3 or a.unique2 = 7)) or (a.unique1 = 1);                                                      QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------- Nested Loop  (cost=0.00..140632434.50 rows=11250150000 width=32) (actual time=0.077..373.279 rows=1350000 loops=1)   ->  Seq Scan on tenk1 b  (cost=0.00..2311.00 rows=150000 width=16) (actual time=0.037..13.941 rows=150000 loops=1)   ->  Materialize  (cost=0.00..3436.01 rows=75001 width=16) (actual time=0.000..0.001 rows=9 loops=150000)         ->  Seq Scan on tenk1 a  (cost=0.00..3061.00 rows=75001 width=16) (actual time=0.027..59.174 rows=9 loops=1)               Filter: ((unique2 = ANY (ARRAY[3, 7])) OR (unique1 = 1))               Rows Removed by Filter: 149991 Planning Time: 0.438 ms Execution Time: 407.144 ms
(8 rows)

Only by converting the expression at this stage, we do not encounter this problem.

postgres=# set enable_bitmapscan ='off';
SET
postgres=# explain analyze
select * from tenk1 a join tenk1 b on  a.unique2 = 3 or a.unique2 = 7 or a.unique1 = 1;                                                      QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------- Nested Loop  (cost=0.00..22247.02 rows=1350000 width=32) (actual time=0.094..373.627 rows=1350000 loops=1)   ->  Seq Scan on tenk1 b  (cost=0.00..2311.00 rows=150000 width=16) (actual time=0.051..14.667 rows=150000 loops=1)   ->  Materialize  (cost=0.00..3061.05 rows=9 width=16) (actual time=0.000..0.001 rows=9 loops=150000)         ->  Seq Scan on tenk1 a  (cost=0.00..3061.00 rows=9 width=16) (actual time=0.026..42.389 rows=9 loops=1)               Filter: ((unique2 = ANY ('{3,7}'::integer[])) OR (unique1 = 1))               Rows Removed by Filter: 149991 Planning Time: 0.414 ms Execution Time: 409.154 ms
(8 rows)

I compiled my original patch and there were no problems with regression tests. The only time there was a problem when I set the const_transform_or_limit variable to 0 (I have 15), as you have in the patch. To be honest, diff appears there because you had a different plan, specifically the expressions "or" are replaced by ANY (see regression.diffs).

You are right. The v3 attached shows the same diff.

Unfortunately, your patch version did not apply immediately, I did not understand the reasons, I applied it manually.

Sorry.
 

At the moment, I'm not sure that the constant is the right number for applying transformations, so I'm in search of it, to be honest. I will post my observations on this issue later. If you don't mind, I'll leave the constant equal to 15 for now.

It's hard to predict. Perhaps accounting for time on each benchmark could help decide.
 

Sorry, I don't understand well enough what is meant by points "Eliminate unnecessary variables" and "Eliminate unnecessary expressions". Can you explain in more detail?

One example is array_type.
As you can see in v2 and v3 it no longer exists.
 


Regarding the patch, there was a Warning at the compilation stage. 

In file included from ../../../src/include/nodes/bitmapset.h:21,

                 from ../../../src/include/nodes/parsenodes.h:26,

                 from ../../../src/include/catalog/objectaddress.h:17,

                 from ../../../src/include/catalog/pg_aggregate.h:24,

                 from parse_expr.c:18:

parse_expr.c: In function ‘transformBoolExprOr’:

../../../src/include/nodes/nodes.h:133:66: warning: ‘expr’ is used uninitialized [-Wuninitialized]

  133 | #define nodeTag(nodeptr)                (((const Node*)(nodeptr))->type)

      |                                                                  ^~

parse_expr.c:116:29: note: ‘expr’ was declared here

  116 |         BoolExpr           *expr;

      |                             ^~~~
Sorry, this error did not appear in my builds 

I couldn't figure out how to fix it and went back to my original version. To be honest, I don't think anything needs to be changed here.

Unfortunately, I didn't understand the reasons why, with the available or expressions, you don't even try to convert to ANY by calling transformBoolExpr, as I saw. I went back to my version.

I think it's worth checking whether the or_statement variable is positive.

I think it's worth leaving the use of the or_statement variable in its original form.

  switch (expr->boolop)
  {
    case AND_EXPR:
      opname = "AND";
      break;
    case OR_EXPR:
      opname = "OR";
      or_statement = true;
      break;
    case NOT_EXPR:
      opname = "NOT";
      break;
    default:
      elog(ERROR, "unrecognized boolop: %d", (int) expr->boolop);
      opname = NULL;    /* keep compiler quiet */
      break;
  }

  if (!or_statement || list_length(expr->args) < const_transform_or_limit)
    return transformBoolExpr(pstate, (BoolExpr *)expr_orig);

You are right, the v3 this way.

As I said earlier, these are just suggestions.
But thinking about it now, I think they can be classified as bad early optimizations.

regards,
Ranier Vilela
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: jian he
Дата:
Сообщение: Re: Incremental View Maintenance, take 2
Следующее
От: "Tristan Partin"
Дата:
Сообщение: Re: Meson build updates