Re: POC, WIP: OR-clause support for indexes
От | Ranier Vilela |
---|---|
Тема | Re: POC, WIP: OR-clause support for indexes |
Дата | |
Msg-id | CAEudQAqW3p0R5NMjOjTuCbs=HTQR5YO8WDgCjcSuArbLdDJVtw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: POC, WIP: OR-clause support for indexes (Alena Rybakina <lena.ribackina@yandex.ru>) |
Ответы |
Re: POC, WIP: OR-clause support for indexes
(Alena Rybakina <lena.ribackina@yandex.ru>)
|
Список | pgsql-hackers |
Hi!
On 29.06.2023 04:36, Ranier Vilela wrote:I don't want to be rude, but this doesn't seem very helpful.Sorry, It was not my intention to cause interruptions.
- You made some changes, but you don't even attempt to explain what you
changed or why you changed it.1. Reduce scope2. Eliminate unnecessary variables3. Eliminate unnecessary expressions
- You haven't even tried to compile the code, nor tested it. If it
happens to compile, wow could others even know it actually behaves the
way you wanted?Sorry I didn't answer right away. I will try not to do this in the future thank you for your participation and help.
Yes, the scope of this patch may be small, but I am sure that it will solve the worst case of memory consumption with large numbers of "or" expressions or reduce execution and planning time.
As I have already said, I conducted a launch on a database with 20 billion data generated using a benchmark. Unfortunately, at that time I sent a not quite correct picture: the execution time, not the planning time, increases with the number of "or" expressions (execution_time.png). x is the number of or expressions, y is the execution/scheduling time.I also throw memory consumption at 50,000 "or" expressions collected by HeapTrack (where memory consumption was recorded already at the initialization stage of the 1.27GB pic3.png). I think such a transformation will allow just the same to avoid such a worst case, since in comparison with ANY memory is much less and takes little time.
SELECT FORMAT('prepare x %s AS SELECT * FROM pgbench_accounts a WHERE %s', '(' || string_agg('int', ',') || ')', string_agg(FORMAT('aid = $%s', g.id), ' or ') ) AS cmd FROM generate_series(1, 50000) AS g(id) \gexec SELECT FORMAT('execute x %s;', '(' || string_agg(g.id::text, ',') || ')') AS cmd FROM generate_series(1, 50000) AS g(id) \gexecI tried to add a transformation at the path formation stage before we form indexes (set_plain_rel_pathlist function) and at the stage when we have preprocessing of "or" expressions (getting rid of duplicates or useless conditions), but everywhere there was a problem of incorrect selectivity estimation.
CREATE TABLE tenk1 (unique1 int, unique2 int, ten int, hundred int); insert into tenk1 SELECT x,x,x,x FROM generate_series(1,50000) as x; CREATE INDEX a_idx1 ON tenk1(unique1); CREATE INDEX a_idx2 ON tenk1(unique2); CREATE INDEX a_hundred ON tenk1(hundred); postgres=# explain analyze select * from tenk1 a join tenk1 b on ((a.unique2 = 3 or a.unique2 = 7)) or (a.unique1 = 1); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..140632434.50 rows=11250150000 width=32) (actual time=0.077..373.279 rows=1350000 loops=1) -> Seq Scan on tenk1 b (cost=0.00..2311.00 rows=150000 width=16) (actual time=0.037..13.941 rows=150000 loops=1) -> Materialize (cost=0.00..3436.01 rows=75001 width=16) (actual time=0.000..0.001 rows=9 loops=150000) -> Seq Scan on tenk1 a (cost=0.00..3061.00 rows=75001 width=16) (actual time=0.027..59.174 rows=9 loops=1) Filter: ((unique2 = ANY (ARRAY[3, 7])) OR (unique1 = 1)) Rows Removed by Filter: 149991 Planning Time: 0.438 ms Execution Time: 407.144 ms (8 rows)Only by converting the expression at this stage, we do not encounter this problem.
postgres=# set enable_bitmapscan ='off'; SET postgres=# explain analyze select * from tenk1 a join tenk1 b on a.unique2 = 3 or a.unique2 = 7 or a.unique1 = 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..22247.02 rows=1350000 width=32) (actual time=0.094..373.627 rows=1350000 loops=1) -> Seq Scan on tenk1 b (cost=0.00..2311.00 rows=150000 width=16) (actual time=0.051..14.667 rows=150000 loops=1) -> Materialize (cost=0.00..3061.05 rows=9 width=16) (actual time=0.000..0.001 rows=9 loops=150000) -> Seq Scan on tenk1 a (cost=0.00..3061.00 rows=9 width=16) (actual time=0.026..42.389 rows=9 loops=1) Filter: ((unique2 = ANY ('{3,7}'::integer[])) OR (unique1 = 1)) Rows Removed by Filter: 149991 Planning Time: 0.414 ms Execution Time: 409.154 ms (8 rows)I compiled my original patch and there were no problems with regression tests. The only time there was a problem when I set the const_transform_or_limit variable to 0 (I have 15), as you have in the patch. To be honest, diff appears there because you had a different plan, specifically the expressions "or" are replaced by ANY (see regression.diffs).
Unfortunately, your patch version did not apply immediately, I did not understand the reasons, I applied it manually.
At the moment, I'm not sure that the constant is the right number for applying transformations, so I'm in search of it, to be honest. I will post my observations on this issue later. If you don't mind, I'll leave the constant equal to 15 for now.
Sorry, I don't understand well enough what is meant by points "Eliminate unnecessary variables" and "Eliminate unnecessary expressions". Can you explain in more detail?
Regarding the patch, there was a Warning at the compilation stage.In file included from ../../../src/include/nodes/bitmapset.h:21, from ../../../src/include/nodes/parsenodes.h:26, from ../../../src/include/catalog/objectaddress.h:17, from ../../../src/include/catalog/pg_aggregate.h:24, from parse_expr.c:18: parse_expr.c: In function ‘transformBoolExprOr’: ../../../src/include/nodes/nodes.h:133:66: warning: ‘expr’ is used uninitialized [-Wuninitialized] 133 | #define nodeTag(nodeptr) (((const Node*)(nodeptr))->type) | ^~ parse_expr.c:116:29: note: ‘expr’ was declared here 116 | BoolExpr *expr; | ^~~~
I couldn't figure out how to fix it and went back to my original version. To be honest, I don't think anything needs to be changed here.
Unfortunately, I didn't understand the reasons why, with the available or expressions, you don't even try to convert to ANY by calling transformBoolExpr, as I saw. I went back to my version.
I think it's worth checking whether the or_statement variable is positive.
I think it's worth leaving the use of the or_statement variable in its original form.
switch (expr->boolop)
{
case AND_EXPR:
opname = "AND";
break;
case OR_EXPR:
opname = "OR";
or_statement = true;
break;
case NOT_EXPR:
opname = "NOT";
break;
default:
elog(ERROR, "unrecognized boolop: %d", (int) expr->boolop);
opname = NULL; /* keep compiler quiet */
break;
}
if (!or_statement || list_length(expr->args) < const_transform_or_limit)
return transformBoolExpr(pstate, (BoolExpr *)expr_orig);
Вложения
В списке pgsql-hackers по дате отправления: