Re: COPY FROM WHEN condition

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: COPY FROM WHEN condition
Дата
Msg-id c1f302f0-d806-6c64-fe6c-3febe03a4eb1@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: COPY FROM WHEN condition  (Surafel Temesgen <surafel3000@gmail.com>)
Ответы Re: COPY FROM WHEN condition  (Surafel Temesgen <surafel3000@gmail.com>)
Re: COPY FROM WHEN condition  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On 11/26/18 2:25 PM, Surafel Temesgen wrote:
> 
> 
> On Sat, Nov 24, 2018 at 5:09 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
> 
> 
>     1) While comparing this to the FILTER clause we already have for
>     aggregates, I've noticed the aggregate version is
> 
>         FILTER '(' WHERE a_expr ')'
> 
>     while here we have
> 
>         FILTER '(' a_expr ')'
> 
>     For a while I was thinking that maybe we should use the same syntax
>     here, but I don't think so. The WHERE bit seems rather unnecessary and
>     we probably implemented it only because it's required by SQL standard,
>     which does not apply to COPY. So I think this is fine.
> 
>  
> ok
> 
> 
>     2) The various parser checks emit errors like this:
> 
>         case EXPR_KIND_COPY_FILTER:
>             err = _("cannot use subquery in copy from FILTER condition");
>             break;
> 
>     I think the "copy from" should be capitalized, to make it clear that it
>     refers to a COPY FROM command and not a copy of something.
> 
> 
>     3) I think there should be regression tests for these prohibited things,
>     i.e. for a set-returning function, for a non-existent column, etc.
> 
> 
> fixed
> 
> 
>     4) What might be somewhat confusing for users is that the filter uses a
>     single snapshot to evaluate the conditions for all rows. That is, one
>     might do this
> 
>         create or replace function f() returns int as $$
>             select count(*)::int from t;
>         $$ language sql;
> 
>     and hope that
> 
>         copy t from '/...' filter (f() <= 100);
> 
>     only ever imports the first 100 rows - but that's not true, of course,
>     because f() uses the snapshot acquired at the very beginning. For
>     example INSERT SELECT does behave differently:
> 
>         test=# copy t from '/home/user/t.data' filter (f() < 100);
>         COPY 81
>         test=# insert into t select * from t where f() < 100;
>         INSERT 0 19
> 
>     Obviously, this is not an issue when the filter clause references only
>     the input row (which I assume will be the primary use case).
> 
>     Not sure if this is expected / appropriate behavior, and if the patch
>     needs to do something else here.
> 
> Comparing with overhead of setting snapshot before evaluating every row
> and considering this
> 
> kind of usage is not frequent it seems to me the behavior is acceptable
> 

I'm not really buying the argument that this behavior is acceptable
simply because using the feature like this will be uncommon. That seems
like a rather weak reason to accept it.

I however agree we don't want to make COPY less efficient, at least not
unless absolutely necessary. But I think we can handle this simply by
restricting what's allowed to appear the FILTER clause.

It should be fine to allow IMMUTABLE and STABLE functions, but not
VOLATILE ones. That should fix the example I shared, because f() would
not be allowed.

We could mark is as STABLE explicitly, which would make it usable in the
FILTER clause, but STABLE says "same result for all calls in the
statement" so the behavior would be expected and essentially legal (in a
way it'd be mislabeling, but we trust it on other places too).

So I think there are four options:

(a) accept the current behavior (single snapshot, same result for all
function calls)

(b) prohibit VOLATILE functions in the FILTER clause altogether

(c) allow VOLATILE functions in the FILTER clause, but change the
behavior to make the behavior sane


I definitely vote -1 on (a) unless someone presents a much better
argument for allowing it than "it's uncommon".

Which leaves us with (b) and (c). Clearly, (b) is simpler to implement,
because it (c) needs to do the detection too, and then some additional
stuff. I'm not sure how much more complex (c) is, compared to (b).

After eyeballing the copy.c code for a bit, I think it would need to use
CIM_SINGLE when there are volatile functions, similarly to volatile
default values (see volatile_defexprs), and then increment the command
ID after each insert. Of course, we don't want to do this by default,
but only when actually needed (with VOLATILE functions), because the
multi-inserts are quite a bit more efficient.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Rowley
Дата:
Сообщение: Re: PostgreSQL Limits and lack of documentation about them.
Следующее
От: Tatsuo Ishii
Дата:
Сообщение: Re: idle-in-transaction timeout error does not give a hint