Re: Row pattern recognition

Поиск
Список
Период
Сортировка
От Tatsuo Ishii
Тема Re: Row pattern recognition
Дата
Msg-id 20230908.125447.478562939918621603.t-ishii@sranhm.sra.co.jp
обсуждение исходный текст
Ответ на Re: Row pattern recognition  (Jacob Champion <jchampion@timescale.com>)
Ответы Re: Row pattern recognition  (Jacob Champion <jchampion@timescale.com>)
Список pgsql-hackers
Hi,

> Hello!
> 
>> (1) I completely changed the pattern matching engine so that it
>> performs backtracking. Now the engine evaluates all pattern elements
>> defined in PATTER against each row, saving matched pattern variables
>> in a string per row. For example if the pattern element A and B
>> evaluated to true, a string "AB" is created for current row.
> 
> If I understand correctly, this strategy assumes that one row's
> membership in a pattern variable is independent of the other rows'
> membership. But I don't think that's necessarily true:
> 
>     DEFINE
>       A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A',
>       ...

But:

UP AS price > PREV(price)

also depends on previous row, no? Can you please elaborate how your
example could break current implementation? I cannot test it because
CLASSIFIER is not implemented yet.

>> See row_is_in_reduced_frame, search_str_set and search_str_set_recurse
>> in nodeWindowAggs.c for more details. For now I use a naive depth
>> first search and probably there is a lot of rooms for optimization
>> (for example rewriting it without using
>> recursion).
> 
> The depth-first match is doing a lot of subtle work here. For example, with
> 
>     PATTERN ( A+ B+ )
>     DEFINE A AS TRUE,
>            B AS TRUE
> 
> (i.e. all rows match both variables), and three rows in the partition,
> our candidates will be tried in the order
> 
>     aaa
>     aab
>     aba
>     abb
>     ...
>     bbb
> 
> The only possible matches against our regex `^a+b+` are "aab" and "abb",
> and that order matches the preferment order, so it's fine. But it's easy
> to come up with a pattern where that's the wrong order, like
> 
>     PATTERN ( A+ (B|A)+ )
> 
> Now "aaa" will be considered before "aab", which isn't correct.

Can you explain a little bit more? I think 'aaa' matches a regular
expression 'a+(b|a)+' and should be no problem before "aab" is
considered.

> Similarly, the assumption that we want to match the longest string only
> works because we don't allow alternation yet.

Can you please clarify more on this?

> Cool, I will give this piece some more thought. Do you mind if I try to
> add some more complicated pattern quantifiers to stress the
> architecture, or would you prefer to tackle that later? Just alternation
> by itself will open up a world of corner cases.

Do you mean you want to provide a better patch for the pattern
matching part? That will be helpfull. Because I am currently working
on the aggregation part and have no time to do it. However, the
aggregation work affects the v5 patch: it needs a refactoring. So can
you wait until I release v6 patch? I hope it will be released in two
weeks or so.

> On 8/9/23 01:41, Tatsuo Ishii wrote:
>> - I found a bug with pattern matching code. It creates a string for
>>   subsequent regular expression matching. It uses the initial letter
>>   of each define variable name. For example, if the varname is "foo",
>>   then "f" is used. Obviously this makes trouble if we have two or
>>   more variables starting with same "f" (e.g. "food"). To fix this, I
>>   assign [a-z] to each variable instead of its initial letter. However
>>   this way limits us not to have more than 26 variables. I hope 26 is
>>   enough for most use cases.
> 
> There are still plenty of alphanumerics left that could be assigned...
> 
> But I'm wondering if we might want to just implement the NFA directly?
> The current implementation's Cartesian explosion can probably be pruned
> aggressively, but replaying the entire regex match once for every
> backtracked step will still duplicate a lot of work.

Not sure if you mean implementing new regular expression engine
besides src/backend/regexp. I am afraid it's not a trivial work. The
current regexp code consists of over 10k lines. What do you think?

> I've attached another test case; it looks like last_value() is depending
> on some sort of side effect from either first_value() or nth_value(). I
> know the window frame itself is still under construction, so apologies
> if this is an expected failure.

Thanks. Fortunately current code which I am working passes the new
test. I will include it in the next (v6) patch.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: old_snapshot_threshold bottleneck on replica
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Eager page freeze criteria clarification