Re: row filtering for logical replication

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: row filtering for logical replication
Дата
Msg-id 20210128022032.eq2qqc6zxkqn5syt@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: row filtering for logical replication  (Önder Kalacı <onderkalaci@gmail.com>)
Список pgsql-hackers
Hi,

On 2020-12-17 09:43:30 +0300, Önder Kalacı wrote:
> The above part can be considered the core of the logic, executed per tuple.
> As far as I can see, it has two downsides.
> 
> First, calling `expression_planner()` for every tuple can be quite
> expensive. I created a sample table, loaded data and ran a quick benchmark
> to see its effect. I attached the very simple script that I used to
> reproduce the issue on my laptop. I'm pretty sure you can find nicer ways
> of doing similar perf tests, just sharing as a reference.
> 
> The idea of the test is to add a WHERE clause to a table, but none of the
> tuples are filtered out. They just go through this code-path and send it to
> the remote node.
> 
> #rows       Patched    | Master
> 1M  00:00:25.067536    | 00:00:16.633988
> 10M  00:04:50.770791    | 00:02:40.945358
> 
> 
> So, it seems a significant overhead to me. What do you think?

That seems almost prohibitively expensive. I think at the very least
some of this work would need to be done in a cached manner, e.g. via
get_rel_sync_entry().


> Secondly, probably more importantly, allowing any operator is as dangerous
> as allowing any function as users can create/overload operator(s).

That's not safe, indeed. It's not even just create/overloading
operators, as far as I can tell the expression can contain just plain
function calls.

The issue also isn't primarily that the user can overload functions,
it's that logical decoding is a limited environment, and not everything
is safe to do within. You e.g. only catalog tables can be
accessed. Therefore I don't think we can allow arbitrary expressions.


> The other problematic area was the performance, as calling
> `expression_planner()` for every tuple can be very expensive. To avoid
> that, it might be considered to ask users to provide a function instead of
> a free form WHERE clause, such that if the function returns true, the tuple
> is sent. The allowed functions need to be immutable SQL functions with bool
> return type. As we can parse the  SQL functions, we should be able to allow
> only functions that rely on the above mentioned procs. We can apply as many
> restrictions (such as no modification query) as possible. For example, see
> below:
> ```

I don't think that would get us very far.

From a safety aspect: A function's body can be changed by the user at
any time, therefore we cannot rely on analyses of the function's body.

From a performance POV: SQL functions are planned at every invocation,
so that'd not buy us much either.


I think what you would have to do instead is to ensure that the
expression is "simple enough", and then process it into a cheaply
executable format in get_rel_sync_entry(). I'd suggest that in the first
version you just allow a simple ANDed list of 'foo.bar op constant'
expressions.

Does that make sense?

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: Thoughts on "killed tuples" index hint bits support on standby
Следующее
От: "Tang, Haiying"
Дата:
Сообщение: RE: [POC] Fast COPY FROM command for the table with foreign partitions