Обсуждение: COPY FROM WHEN condition
Hello,
Currently we can not moves data from a file to a table based on some condition on a certain column but I think there are many use case for it that worth supporting. Attache is a patch for escaping a row that does not satisfy WHEN condition from inserting into a table and its work on the top of commit b68ff3ea672c06
and the syntax is
COPY table_name [ ( column_name [, ...] ) ]
FROM { 'filename' | PROGRAM 'command' | STDIN }
[ [ WITH ] ( option [, ...] ) ]
[ WHEN condition ]
comment ?
Regards
Surafel
Вложения
## Surafel Temesgen (surafel3000@gmail.com): > Currently we can not moves data from a file to a table based on some > condition on a certain column You can: COPY ( query ) TO 'filename'; There's even an example in the documentation: https://www.postgresql.org/docs/10/static/sql-copy.html "To copy into a file just the countries whose names start with 'A': COPY (SELECT * FROM country WHERE country_name LIKE 'A%') TO '/usr1/proj/bray/sql/a_list_countries.copy';" Regards, Christoph -- Spare Space.
You can:
COPY ( query ) TO 'filename';
On Thu, Oct 11, 2018 at 12:00 PM Christoph Moench-Tegeder <cmt@burggraben.net> wrote:You can:
COPY ( query ) TO 'filename';it is for COPY FROMregardsSurafel
https://commitfest.postgresql.org/12/869/
On Thu, Oct 11, 2018 at 05:12:48AM -0400, Corey Huinker wrote: > On Thu, Oct 11, 2018 at 5:04 AM Surafel Temesgen <surafel3000@gmail.com> > wrote: > > > > > > > On Thu, Oct 11, 2018 at 12:00 PM Christoph Moench-Tegeder < > > cmt@burggraben.net> wrote: > > > >> You can: > >> COPY ( query ) TO 'filename'; > >> > > it is for COPY FROM > > > > regards > > Surafel > > > > It didn't get far, but you may want to take a look at a rejected patch for > copy_srf() (set returning function) > https://www.postgresql.org/message-id/CADkLM%3DdoeiWQX4AGtDNG4PsWfSXz3ai7kY%3DPZm3sUhsUeev9Bg%40mail.gmail.com > https://commitfest.postgresql.org/12/869/ > > Having a set returning function gives you the full expressiveness of SQL, > at the cost of an extra materialization step. I wonder whether something JIT-like could elide this. A very interesting subset of such WHEN clauses could be pretty straight-forward to implement in a pretty efficient way. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Hello, I want test this patch, but can't apply it. (master)$ git apply copy_from_when_con_v1.patch error: patch failed: src/backend/commands/copy.c:849 error: src/backend/commands/copy.c: patch does not apply fix or another way, let me know. Best regards, Myungkyu, Lim
Hello,
I want test this patch, but can't apply it.
(master)$ git apply copy_from_when_con_v1.patch
try (master)$ patch -p1 < copy_from_when_con_v1.patch
regards
Surafel
Hi, I've taken a quick look at this on the way back from pgconf.eu, and it seems like a nice COPY improvement in a fairly good shape. Firstly, I think it's a valuable because it allows efficiently importing a subset of data. Currently, we either have to create an intermediate table, copy all the data into that, do the filtering, and then insert the subset into the actual target table. Another common approach that I see in practice is using file_fdw to do this, but it's not particularly straight-forward (having to create the FDW servers etc. first) not efficient (particularly for large amounts of data). This feature allows skipping this extra step (at least in simpler ETL cases). So, I like the idea and I think it makes sense. A couple of comments regarding the code/docs: 1) I think this deserves at least some regression tests. Plenty of tests already use COPY, but there's no coverage for the new piece. So let's add a new test suite, or maybe add a couple of tests into copy2.sql. 2) In copy.sqml, the new item is defined like this <term><literal>WHEN Clause</literal></term> I suggest we use just <term><literal>WHEN</literal></term>, that's what the other items do (see ENCODING). The other thing is that this does not say what expressions are allowed in the WHEN clause. It seems pretty close to WHEN clause for triggers, which e.g. mentions that subselects are not allowed. I'm pretty sure that's true here too, because it fails like this (118 = T_SubLink): test=# copy t(a,b,c) from '/tmp/t.data' when ((select 1) < 10); ERROR: unrecognized node type: 118 So, the patch needs to detect this, produce a reasonable error message and document the limitations in copy.sqml, just like we do for CREATE TRIGGER. 3) For COPY TO, the WHEN clause is accepted but ignored, leading to confusing cases like this: test=# copy t(a,b,c) to '/tmp/t.data' when ((select 100) < 10); COPY 151690 So, it contains subselect, but unlike COPY FROM it does not fail (because we never execute it). The fun part is that the expression is logically false, so a user might expect it to filter rows, yet we copy everything. IMHO we need to either error-out in these cases, complaining about WHEN not being supported for COPY TO, or make it work (effectively treating it as a simpler alternative to COPY (subselect) TO). 4) There are some minor code style issues in copy.c - the variable is misnamed as when_cluase, there are no spaces after 'if' etc. See the attached patch fixing this. AFAICS we could just get rid of the extra when_cluase variable and mess with the cstate->whenClause directly, depending on how (3) gets fixed. 5) As I mentioned, the CREATE TRIGGER already has WHEN clause, but it requires it to be 'WHEN (expr)'. I suggest we do the same thing here, requiring the parentheses. 6) The skip logic in CopyFrom() seems to be slightly wrong. It does work, but the next_record label is defined after CHECK_FOR_INTERRUPTS() so a COPY will not respond to Ctrl-C unless it finds a row matching the WHEN condition. If you have a highly selective condition, that's a bit inconvenient. It also skips MemoryContextSwitchTo(GetPerTupleMemoryContext(estate)); so I wonder what the heap_form_tuple() right after the next_record label will use for tuples right after a skipped one. I'd bet it'll use the oldcontext (essentially the long-lived context), essentially making it a memory leak. So I suggest to get rid of the next_record label, and use 'continue' instead of the 'goto next_record'. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Hello, Basically, this patch worked very well in my tests. > 3) For COPY TO, the WHEN clause is accepted but ignored, leading to confusing cases like this: I found same issue. postgres=# copy t1 to '/home/lmk/t1.data' when c1 < 1; In the 'COPY TO' statement, 'WHEN clause' does not do any extra work. (include error or hint) > 4) There are some minor code style issues in copy.c - the variable is misnamed as when_cluase, there are no spaces after'if' etc. See the attached patch fixing this. It is recommended to use 'pg tool' when you finish development. src/tools/pgindent/pgindent pgindent is used to fix the source code style to conform to pg standards. Best regards, Myungkyu, Lim
1) I think this deserves at least some regression tests. Plenty of tests
already use COPY, but there's no coverage for the new piece. So let's
add a new test suite, or maybe add a couple of tests into copy2.sql.
2) In copy.sqml, the new item is defined like this
<term><literal>WHEN Clause</literal></term>
I suggest we use just <term><literal>WHEN</literal></term>, that's what
the other items do (see ENCODING).
The other thing is that this does not say what expressions are allowed
in the WHEN clause. It seems pretty close to WHEN clause for triggers,
which e.g. mentions that subselects are not allowed. I'm pretty sure
that's true here too, because it fails like this (118 = T_SubLink):
test=# copy t(a,b,c) from '/tmp/t.data' when ((select 1) < 10);
ERROR: unrecognized node type: 118
So, the patch needs to detect this, produce a reasonable error message
and document the limitations in copy.sqml, just like we do for CREATE
TRIGGER.
3) For COPY TO, the WHEN clause is accepted but ignored, leading to
confusing cases like this:
test=# copy t(a,b,c) to '/tmp/t.data' when ((select 100) < 10);
COPY 151690
So, it contains subselect, but unlike COPY FROM it does not fail
(because we never execute it). The fun part is that the expression is
logically false, so a user might expect it to filter rows, yet we copy
everything.
IMHO we need to either error-out in these cases, complaining about WHEN
not being supported for COPY TO, or make it work (effectively treating
it as a simpler alternative to COPY (subselect) TO).
English is not my first language but I chose error-out because WHEN condition for COPY TO seems to me semantically incorrect
AFAICS we could just get rid of the extra when_cluase variable and mess
with the cstate->whenClause directly, depending on how (3) gets fixed.
I did it this way because CopyState structure memory allocate and initialize in BeginCopyFrom but the analysis done before it
5) As I mentioned, the CREATE TRIGGER already has WHEN clause, but it
requires it to be 'WHEN (expr)'. I suggest we do the same thing here,
requiring the parentheses.
6) The skip logic in CopyFrom() seems to be slightly wrong. It does
work, but the next_record label is defined after CHECK_FOR_INTERRUPTS()
so a COPY will not respond to Ctrl-C unless it finds a row matching the
WHEN condition. If you have a highly selective condition, that's a bit
inconvenient.
It also skips
MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
so I wonder what the heap_form_tuple() right after the next_record label
will use for tuples right after a skipped one. I'd bet it'll use the
oldcontext (essentially the long-lived context), essentially making it
a memory leak.
So I suggest to get rid of the next_record label, and use 'continue'
instead of the 'goto next_record'.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On Tue, Oct 30, 2018 at 11:47 PM Surafel Temesgen <surafel3000@gmail.com> wrote: > > > Hi, > Thank you for looking at it . > On Sun, Oct 28, 2018 at 7:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> >> 1) I think this deserves at least some regression tests. Plenty of tests >> already use COPY, but there's no coverage for the new piece. So let's >> add a new test suite, or maybe add a couple of tests into copy2.sql. >> >> >> 2) In copy.sqml, the new item is defined like this >> >> <term><literal>WHEN Clause</literal></term> >> >> I suggest we use just <term><literal>WHEN</literal></term>, that's what >> the other items do (see ENCODING). >> >> The other thing is that this does not say what expressions are allowed >> in the WHEN clause. It seems pretty close to WHEN clause for triggers, >> which e.g. mentions that subselects are not allowed. I'm pretty sure >> that's true here too, because it fails like this (118 = T_SubLink): >> >> test=# copy t(a,b,c) from '/tmp/t.data' when ((select 1) < 10); >> ERROR: unrecognized node type: 118 >> >> So, the patch needs to detect this, produce a reasonable error message >> and document the limitations in copy.sqml, just like we do for CREATE >> TRIGGER. > > fixed >> >> >> 3) For COPY TO, the WHEN clause is accepted but ignored, leading to >> confusing cases like this: >> >> test=# copy t(a,b,c) to '/tmp/t.data' when ((select 100) < 10); >> COPY 151690 >> >> So, it contains subselect, but unlike COPY FROM it does not fail >> (because we never execute it). The fun part is that the expression is >> logically false, so a user might expect it to filter rows, yet we copy >> everything. >> >> IMHO we need to either error-out in these cases, complaining about WHEN >> not being supported for COPY TO, or make it work (effectively treating >> it as a simpler alternative to COPY (subselect) TO). > > English is not my first language but I chose error-out because WHEN condition for COPY TO seems to me semantically incorrect >> >> >> AFAICS we could just get rid of the extra when_cluase variable and mess >> with the cstate->whenClause directly, depending on how (3) gets fixed. > > I did it this way because CopyState structure memory allocate and initialize in BeginCopyFrom but the analysis done beforeit >> >> >> 5) As I mentioned, the CREATE TRIGGER already has WHEN clause, but it >> requires it to be 'WHEN (expr)'. I suggest we do the same thing here, >> requiring the parentheses. >> >> >> 6) The skip logic in CopyFrom() seems to be slightly wrong. It does >> work, but the next_record label is defined after CHECK_FOR_INTERRUPTS() >> so a COPY will not respond to Ctrl-C unless it finds a row matching the >> WHEN condition. If you have a highly selective condition, that's a bit >> inconvenient. >> >> It also skips >> >> MemoryContextSwitchTo(GetPerTupleMemoryContext(estate)); >> >> so I wonder what the heap_form_tuple() right after the next_record label >> will use for tuples right after a skipped one. I'd bet it'll use the >> oldcontext (essentially the long-lived context), essentially making it >> a memory leak. >> >> So I suggest to get rid of the next_record label, and use 'continue' >> instead of the 'goto next_record'. >> > fixed >> I've looked at this patch and tested. When I use a function returning string in WHEN clause I got the following error: =# copy test from '/tmp/aaa.csv' (format 'csv') when (lower(t) = 'hello'); ERROR: could not determine which collation to use for lower() function HINT: Use the COLLATE clause to set the collation explicitly. CONTEXT: COPY hoge, line 1: "1,hoge,2018-01-01" And then although I specified COLLATE I got an another error (127 = T_CollateExpr): =# copy test from '/tmp/aaa.csv' (format 'csv') when (lower(t) collate "en_US" = 'hello'); ERROR: unrecognized node type: 127 This error doesn't happen if I put the similar condition in WHEN clause for triggers. I think the patch needs to produce a reasonable error message. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Oct 11, 2018, at 10:35 AM, David Fetter <david@fetter.org> wrote: > >> It didn't get far, but you may want to take a look at a rejected patch for >> copy_srf() (set returning function) >> https://www.postgresql.org/message-id/CADkLM%3DdoeiWQX4AGtDNG4PsWfSXz3ai7kY%3DPZm3sUhsUeev9Bg%40mail.gmail.com >> https://commitfest.postgresql.org/12/869/ >> >> Having a set returning function gives you the full expressiveness of SQL, >> at the cost of an extra materialization step. > > I wonder whether something JIT-like could elide this. A very > interesting subset of such WHEN clauses could be pretty > straight-forward to implement in a pretty efficient way. Are you thinking something like having a COPY command that provides results in such a way that they could be referenced ina FROM clause (perhaps a COPY that defines a cursor…)?
On Wed, Oct 31, 2018 at 11:21:33PM +0000, Nasby, Jim wrote: > On Oct 11, 2018, at 10:35 AM, David Fetter <david@fetter.org> wrote: > > > >> It didn't get far, but you may want to take a look at a rejected patch for > >> copy_srf() (set returning function) > >> https://www.postgresql.org/message-id/CADkLM%3DdoeiWQX4AGtDNG4PsWfSXz3ai7kY%3DPZm3sUhsUeev9Bg%40mail.gmail.com > >> https://commitfest.postgresql.org/12/869/ > >> > >> Having a set returning function gives you the full expressiveness of SQL, > >> at the cost of an extra materialization step. > > > > I wonder whether something JIT-like could elide this. A very > > interesting subset of such WHEN clauses could be pretty > > straight-forward to implement in a pretty efficient way. > > Are you thinking something like having a COPY command that provides > results in such a way that they could be referenced in a FROM clause > (perhaps a COPY that defines a cursor…)? That would also be nice, but what I was thinking of was that some highly restricted subset of cases of SQL in general could lend themselves to levels of optimization that would be impractical in other contexts. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
> Are you thinking something like having a COPY command that provides
> results in such a way that they could be referenced in a FROM clause
> (perhaps a COPY that defines a cursor…)?
That would also be nice, but what I was thinking of was that some
highly restricted subset of cases of SQL in general could lend
themselves to levels of optimization that would be impractical in
other contexts.
Also, what would we be saving computationally? The whole file (or program output) has to be consumed no matter what, the columns have to be parsed no matter what. At least some of the columns have to be converted to their assigned datatypes enough to know whether or not to filter the row, but we might be able push that logic inside a copy. I'm thinking of something like this:
SELECT x.a, sum(x.b)FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, b numeric, c text, d date, e json) )WHERE x.d >= '2018-11-01'
If we go the COPY-WHEN route, then we have to make up new syntax for every possible future optimization.
> Are you thinking something like having a COPY command that provides
> results in such a way that they could be referenced in a FROM clause
> (perhaps a COPY that defines a cursor…)?
That would also be nice, but what I was thinking of was that some
highly restricted subset of cases of SQL in general could lend
themselves to levels of optimization that would be impractical in
other contexts.If COPY (or a syntactical equivalent) can return a result set, then the whole of SQL is available to filter and aggregate the results and we don't have to invent new syntax, or endure confusion whenCOPY-WHEN syntax behaves subtly different from a similar FROM-WHERE.
Also, what would we be saving computationally? The whole file (or program output) has to be consumed no matter what, the columns have to be parsed no matter what. At least some of the columns have to be converted to their assigned datatypes enough to know whether or not to filter the row, but we might be able push that logic inside a copy. I'm thinking of something like this:SELECT x.a, sum(x.b)FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, b numeric, c text, d date, e json) )WHERE x.d >= '2018-11-01'
In this case, there is the opportunity to see the following optimizations:- columns c and e are never referenced, and need never be turned into a datum (though we might do so just to confirm that they conform to the data type)- if column d is converted first, we can filter on it and avoid converting columns a,b- whatever optimizations we can infer from knowing that the two surviving columns will go directly into an aggregateIf we go this route, we can train the planner to notice other optimizations and add those mechanisms at that time, and then existing code gets faster.
If we go the COPY-WHEN route, then we have to make up new syntax for every possible future optimization.
On Tue, Oct 30, 2018 at 11:47 PM Surafel Temesgen <surafel3000@gmail.com> wrote:
I've looked at this patch and tested.
When I use a function returning string in WHEN clause I got the following error:
=# copy test from '/tmp/aaa.csv' (format 'csv') when (lower(t) = 'hello');
ERROR: could not determine which collation to use for lower() function
HINT: Use the COLLATE clause to set the collation explicitly.
CONTEXT: COPY hoge, line 1: "1,hoge,2018-01-01"
And then although I specified COLLATE I got an another error (127 =
T_CollateExpr):
=# copy test from '/tmp/aaa.csv' (format 'csv') when (lower(t) collate
"en_US" = 'hello');
ERROR: unrecognized node type: 127
This error doesn't happen if I put the similar condition in WHEN
clause for triggers. I think the patch needs to produce a reasonable
error message.
Вложения
Pavel Stehule wrote: > > SELECT x.a, sum(x.b) > > FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, b > > numeric, c text, d date, e json) ) > > WHERE x.d >= '2018-11-01' > > > > > Without some special feature this example is not extra useful. It is based > on copy on server that can use only super user with full rights. And if superuser, one might use the file data wrapper [1] to get the same results without the need for the explicit COPY in the query. BTW, this brings into question whether the [WHEN condition] clause should be included in the options of file_fdw, as it supports pretty much all COPY options. Also, psql's \copy should gain the option too? [1] https://www.postgresql.org/docs/current/static/file-fdw.html Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 11/02/2018 03:57 AM, Corey Huinker wrote: > > Are you thinking something like having a COPY command that provides > > results in such a way that they could be referenced in a FROM clause > > (perhaps a COPY that defines a cursor…)? > > That would also be nice, but what I was thinking of was that some > highly restricted subset of cases of SQL in general could lend > themselves to levels of optimization that would be impractical in > other contexts. > > > If COPY (or a syntactical equivalent) can return a result set, then the > whole of SQL is available to filter and aggregate the results and we > don't have to invent new syntax, or endure confusion whenCOPY-WHEN > syntax behaves subtly different from a similar FROM-WHERE. > > Also, what would we be saving computationally? The whole file (or > program output) has to be consumed no matter what, the columns have to > be parsed no matter what. At least some of the columns have to be > converted to their assigned datatypes enough to know whether or not to > filter the row, but we might be able push that logic inside a copy. I'm > thinking of something like this: > > SELECT x.a, sum(x.b) > FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, > b numeric, c text, d date, e json) ) > WHERE x.d >= '2018-11-01' > > > In this case, there is the /opportunity/ to see the following optimizations: > - columns c and e are never referenced, and need never be turned into a > datum (though we might do so just to confirm that they conform to the > data type) > - if column d is converted first, we can filter on it and avoid > converting columns a,b > - whatever optimizations we can infer from knowing that the two > surviving columns will go directly into an aggregate > > If we go this route, we can train the planner to notice other > optimizations and add those mechanisms at that time, and then existing > code gets faster. > > If we go the COPY-WHEN route, then we have to make up new syntax for > every possible future optimization. IMHO those two things address vastly different use-cases. The COPY WHEN case deals with filtering data while importing them into a database, while what you're describing seems to be more about querying data stored in a CSV file. But we already do have a solution for that - FDW, and I'd say it's working pretty well. And AFAIK it does give you tools to implement most of what you're asking for. I don't see why should we bolt this on top of COPY, or how is it an alternative to COPY WHEN. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 01, 2018 at 10:57:25PM -0400, Corey Huinker wrote: > > > > > Are you thinking something like having a COPY command that provides > > > results in such a way that they could be referenced in a FROM clause > > > (perhaps a COPY that defines a cursor…)? > > > > That would also be nice, but what I was thinking of was that some > > highly restricted subset of cases of SQL in general could lend > > themselves to levels of optimization that would be impractical in > > other contexts. > > If COPY (or a syntactical equivalent) can return a result set, then the > whole of SQL is available to filter and aggregate the results and we don't > have to invent new syntax, or endure confusion whenCOPY-WHEN syntax behaves > subtly different from a similar FROM-WHERE. That's an excellent point. > Also, what would we be saving computationally? The whole file (or program > output) has to be consumed no matter what, the columns have to be parsed no > matter what. At least some of the columns have to be converted to their > assigned datatypes enough to know whether or not to filter the row, but we > might be able push that logic inside a copy. I'm thinking of something like > this: > > SELECT x.a, sum(x.b) > FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, b numeric, c text, d date, e json) ) Apologies for bike-shedding, but wouldn't the following be a better fit with the current COPY? COPY t(a integer, b numeric, c text, d date, e json) FROM '/path/to/foo.txt' WITH (FORMAT CSV, INLINE) > WHERE x.d >= '2018-11-01' > > > In this case, there is the *opportunity* to see the following optimizations: > - columns c and e are never referenced, and need never be turned into a > datum (though we might do so just to confirm that they conform to the data > type) That sounds like something that could go inside the WITH extension I'm proposing above. [STRICT_TYPE boolean DEFAULT true]? This might not be something that has to be in version 1. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Fri, Nov 02, 2018 at 12:58:12PM +0100, Daniel Verite wrote: > Pavel Stehule wrote: > > > > SELECT x.a, sum(x.b) > > > FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, b > > > numeric, c text, d date, e json) ) > > > WHERE x.d >= '2018-11-01' > > > > > > > > Without some special feature this example is not extra useful. It is based > > on copy on server that can use only super user with full rights. > > And if superuser, one might use the file data wrapper [1] to get > the same results without the need for the explicit COPY in the query. > > BTW, this brings into question whether the [WHEN condition] clause > should be included in the options of file_fdw, as it supports pretty > much all COPY options. > > Also, psql's \copy should gain the option too? tl;dr: \copy support is a very large can of worms. psql's \copy is something which should probably be handled separately from COPY, as it's both a way to access the filesystem without superuser permission and an interface to the COPY part of the protocol. It seems like poor design to add grammar to support a single client, so we'd need to think about this in terms of what we want to support on the client side independent of specific clients. It also seems like a violation of separation of concerns to couple FEBE to grammar, so there'd need to be some way to do those things separately, too. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
> SELECT x.a, sum(x.b)
> FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, b numeric, c text, d date, e json) )
Apologies for bike-shedding, but wouldn't the following be a better
fit with the current COPY?
COPY t(a integer, b numeric, c text, d date, e json) FROM '/path/to/foo.txt' WITH (FORMAT CSV, INLINE)
David Fetter wrote: > It also seems like a violation of separation of concerns to couple > FEBE to grammar, so there'd need to be some way to do those things > separately, too. After re-reading psql/copy.c, I withdraw what I said upthread: it doesn't appear necessary to add anything to support the WHEN condition with \copy. \copy does have a dedicated mini-parser, but it doesn't try to recognize every option: it's only concerned with getting the bits of information that are needed to perform the client-side work: - whether it's a copy from or to - what exact form and value has the 'filename' argument immediately after from or to: '<file path>' | PROGRAM '<command>' | stdin | stdout | pstdout | pstdout It doesn't really care what the options are, just where they are in the buffer, so they can be copied into the COPY SQL statement. From the code: * The documented syntax is: * \copy tablename [(columnlist)] from|to filename [options] * \copy ( query stmt ) to filename [options] The WHEN clause would be part of the [options], which are handled as simply as this in parse_slash_copy(): /* Collect the rest of the line (COPY options) */ token = strtokx(NULL, "", NULL, NULL, 0, false, false, pset.encoding); if (token) result->after_tofrom = pg_strdup(token); So unless there's something particular in the WHEN clause expression that could make this strtokx() invocation error out, this should work directly. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
As a newcomer to this patch, when I read this example: COPY table_name WHEN (some_condition) .. I expect COPY to only be run when the condition is true, and I do not expect the WHEN clause to filter rows. I'm curiouswhat you think about: COPY table_name WHERE (some_condition) Users should already be familiar with the idea that WHERE performs a filter.
On 11/9/18 4:51 PM, Adam Berlin wrote: > As a newcomer to this patch, when I read this example: > > COPY table_name WHEN (some_condition) > > .. I expect COPY to only be run when the condition is true, and I do not expect the WHEN clause to filter rows. I'm curious what you think about: > Hmmm. Currently we have WHEN clause in three places: 1) triggers, where it specifies arbitrary expression on rows, determining whether the trigger should be invoked at all 2) event triggers, where it does about the same thing as for regular triggers, but only expressions "column IN (...)" are allowed 3) CASE WHEN ... THEN ... I'd say 3 is rather unrelated to this discussion, and 1+2 seem to support you objection to using WHEN for COPY, as it kinda specifies whether the command itself should be executed. > COPY table_name WHERE (some_condition) > > Users should already be familiar with the idea that WHERE performs a filter. > So, what about using FILTER here? We already use it for aggregates when filtering rows to process. That being said, I have no strong feelings either way. I'd be OK with both WHEN and WHERE. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>> COPY table_name WHERE (some_condition) >> >> Users should already be familiar with the idea that WHERE performs a filter. >> > So, what about using FILTER here? We already use it for aggregates when > filtering rows to process. > That being said, I have no strong feelings either way. I'd be OK with > both WHEN and WHERE. I don't think it's an important point, In gram.y, where_clause: WHERE a_expr { $$ = $2; } | /*EMPTY*/ { $$ = NULL; } ; This is similar to the 'opt_when_clause' in this patch. So, I think 'WHERE' is a better form. BTW, 3rd patch worked very well in my tests. However, some wrong code style still exists. Node *whenClause= NULL; cstate->whenClause=whenClause; Best regards, Myungkyu, Lim
So, what about using FILTER here? We already use it for aggregates when
filtering rows to process.
Вложения
On 11/23/18 12:14 PM, Surafel Temesgen wrote: > > > On Sun, Nov 11, 2018 at 11:59 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > So, what about using FILTER here? We already use it for aggregates when > filtering rows to process. > > i think its good idea and describe its purpose more. Attache is a > patch that use FILTER instead Thanks, looks good to me. A couple of minor points: 1) While comparing this to the FILTER clause we already have for aggregates, I've noticed the aggregate version is FILTER '(' WHERE a_expr ')' while here we have FILTER '(' a_expr ')' For a while I was thinking that maybe we should use the same syntax here, but I don't think so. The WHERE bit seems rather unnecessary and we probably implemented it only because it's required by SQL standard, which does not apply to COPY. So I think this is fine. 2) The various parser checks emit errors like this: case EXPR_KIND_COPY_FILTER: err = _("cannot use subquery in copy from FILTER condition"); break; I think the "copy from" should be capitalized, to make it clear that it refers to a COPY FROM command and not a copy of something. 3) I think there should be regression tests for these prohibited things, i.e. for a set-returning function, for a non-existent column, etc. 4) What might be somewhat confusing for users is that the filter uses a single snapshot to evaluate the conditions for all rows. That is, one might do this create or replace function f() returns int as $$ select count(*)::int from t; $$ language sql; and hope that copy t from '/...' filter (f() <= 100); only ever imports the first 100 rows - but that's not true, of course, because f() uses the snapshot acquired at the very beginning. For example INSERT SELECT does behave differently: test=# copy t from '/home/user/t.data' filter (f() < 100); COPY 81 test=# insert into t select * from t where f() < 100; INSERT 0 19 Obviously, this is not an issue when the filter clause references only the input row (which I assume will be the primary use case). Not sure if this is expected / appropriate behavior, and if the patch needs to do something else here. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, 24 Nov 2018 at 02:09, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 11/23/18 12:14 PM, Surafel Temesgen wrote: > > On Sun, Nov 11, 2018 at 11:59 PM Tomas Vondra > > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > So, what about using FILTER here? We already use it for aggregates when > > filtering rows to process. > > > > i think its good idea and describe its purpose more. Attache is a > > patch that use FILTER instead > FWIW, I vote for just using WHERE here. Right now we have 2 syntaxes for filtering rows in queries, both of which use WHERE immediately before the condition: 1). SELECT ... FROM ... WHERE condition 2). SELECT agg_fn FILTER (WHERE condition) FROM ... I'm not a huge fan of (2), but that's the SQL standard, so we're stuck with it. There's a certain consistency in it's use of WHERE to introduce the condition, and preceding that with FILTER helps to distinguish it from any later WHERE clause. But what you'd be adding here would be a 3rd syntax 3). COPY ... FROM ... FILTER condition which IMO will just lead to confusion. Regards, Dean
Right now we have 2 syntaxes for filtering rows in queries, both of
which use WHERE immediately before the condition:
1). SELECT ... FROM ... WHERE condition
2). SELECT agg_fn FILTER (WHERE condition) FROM ...
I'm not a huge fan of (2), but that's the SQL standard, so we're stuck
with it. There's a certain consistency in it's use of WHERE to
introduce the condition, and preceding that with FILTER helps to
distinguish it from any later WHERE clause. But what you'd be adding
here would be a 3rd syntax
3). COPY ... FROM ... FILTER condition
which IMO will just lead to confusion.
your case is for retrieving data but this is for deciding which data to insert and word FILTER I think describe it more and not lead to confusion.
1) While comparing this to the FILTER clause we already have for
aggregates, I've noticed the aggregate version is
FILTER '(' WHERE a_expr ')'
while here we have
FILTER '(' a_expr ')'
For a while I was thinking that maybe we should use the same syntax
here, but I don't think so. The WHERE bit seems rather unnecessary and
we probably implemented it only because it's required by SQL standard,
which does not apply to COPY. So I think this is fine.
2) The various parser checks emit errors like this:
case EXPR_KIND_COPY_FILTER:
err = _("cannot use subquery in copy from FILTER condition");
break;
I think the "copy from" should be capitalized, to make it clear that it
refers to a COPY FROM command and not a copy of something.
3) I think there should be regression tests for these prohibited things,
i.e. for a set-returning function, for a non-existent column, etc.
4) What might be somewhat confusing for users is that the filter uses a
single snapshot to evaluate the conditions for all rows. That is, one
might do this
create or replace function f() returns int as $$
select count(*)::int from t;
$$ language sql;
and hope that
copy t from '/...' filter (f() <= 100);
only ever imports the first 100 rows - but that's not true, of course,
because f() uses the snapshot acquired at the very beginning. For
example INSERT SELECT does behave differently:
test=# copy t from '/home/user/t.data' filter (f() < 100);
COPY 81
test=# insert into t select * from t where f() < 100;
INSERT 0 19
Obviously, this is not an issue when the filter clause references only
the input row (which I assume will be the primary use case).
Not sure if this is expected / appropriate behavior, and if the patch
needs to do something else here.
Comparing with overhead of setting snapshot before evaluating every row and considering this
kind of usage is not frequent it seems to me the behavior is acceptable
Вложения
On 11/26/18 2:25 PM, Surafel Temesgen wrote: > > > On Sat, Nov 24, 2018 at 5:09 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > 1) While comparing this to the FILTER clause we already have for > aggregates, I've noticed the aggregate version is > > FILTER '(' WHERE a_expr ')' > > while here we have > > FILTER '(' a_expr ')' > > For a while I was thinking that maybe we should use the same syntax > here, but I don't think so. The WHERE bit seems rather unnecessary and > we probably implemented it only because it's required by SQL standard, > which does not apply to COPY. So I think this is fine. > > > ok > > > 2) The various parser checks emit errors like this: > > case EXPR_KIND_COPY_FILTER: > err = _("cannot use subquery in copy from FILTER condition"); > break; > > I think the "copy from" should be capitalized, to make it clear that it > refers to a COPY FROM command and not a copy of something. > > > 3) I think there should be regression tests for these prohibited things, > i.e. for a set-returning function, for a non-existent column, etc. > > > fixed > > > 4) What might be somewhat confusing for users is that the filter uses a > single snapshot to evaluate the conditions for all rows. That is, one > might do this > > create or replace function f() returns int as $$ > select count(*)::int from t; > $$ language sql; > > and hope that > > copy t from '/...' filter (f() <= 100); > > only ever imports the first 100 rows - but that's not true, of course, > because f() uses the snapshot acquired at the very beginning. For > example INSERT SELECT does behave differently: > > test=# copy t from '/home/user/t.data' filter (f() < 100); > COPY 81 > test=# insert into t select * from t where f() < 100; > INSERT 0 19 > > Obviously, this is not an issue when the filter clause references only > the input row (which I assume will be the primary use case). > > Not sure if this is expected / appropriate behavior, and if the patch > needs to do something else here. > > Comparing with overhead of setting snapshot before evaluating every row > and considering this > > kind of usage is not frequent it seems to me the behavior is acceptable > I'm not really buying the argument that this behavior is acceptable simply because using the feature like this will be uncommon. That seems like a rather weak reason to accept it. I however agree we don't want to make COPY less efficient, at least not unless absolutely necessary. But I think we can handle this simply by restricting what's allowed to appear the FILTER clause. It should be fine to allow IMMUTABLE and STABLE functions, but not VOLATILE ones. That should fix the example I shared, because f() would not be allowed. We could mark is as STABLE explicitly, which would make it usable in the FILTER clause, but STABLE says "same result for all calls in the statement" so the behavior would be expected and essentially legal (in a way it'd be mislabeling, but we trust it on other places too). So I think there are four options: (a) accept the current behavior (single snapshot, same result for all function calls) (b) prohibit VOLATILE functions in the FILTER clause altogether (c) allow VOLATILE functions in the FILTER clause, but change the behavior to make the behavior sane I definitely vote -1 on (a) unless someone presents a much better argument for allowing it than "it's uncommon". Which leaves us with (b) and (c). Clearly, (b) is simpler to implement, because it (c) needs to do the detection too, and then some additional stuff. I'm not sure how much more complex (c) is, compared to (b). After eyeballing the copy.c code for a bit, I think it would need to use CIM_SINGLE when there are volatile functions, similarly to volatile default values (see volatile_defexprs), and then increment the command ID after each insert. Of course, we don't want to do this by default, but only when actually needed (with VOLATILE functions), because the multi-inserts are quite a bit more efficient. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
(c) allow VOLATILE functions in the FILTER clause, but change the
behavior to make the behavior sane
Which leaves us with (b) and (c). Clearly, (b) is simpler to implement,
because it (c) needs to do the detection too, and then some additional
stuff. I'm not sure how much more complex (c) is, compared to (b).
Вложения
On 11/30/18 2:00 PM, Surafel Temesgen wrote: > > > On Thu, Nov 29, 2018 at 2:17 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > (c) allow VOLATILE functions in the FILTER clause, but change the > behavior to make the behavior sane > > Did changing the behavior means getting new snapshot before evaluating > a tuple to ensure the function sees results of any previously executed > queries or there are other mechanism that can make the behavior sane? > I think it should be enough just to switch to CIM_SINGLE and increment the command counter after each inserted row. > Which leaves us with (b) and (c). Clearly, (b) is simpler to implement, > because it (c) needs to do the detection too, and then some additional > stuff. I'm not sure how much more complex (c) is, compared to (b). > > The attache patch implement option b prohibit VOLATILE functions but i > am open to change OK. I'll take a look. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
After reading this thread, I think I like WHERE better than FILTER. Tally: WHERE: Adam Berlin, Lim Myungkyu, Dean Rasheed, yours truly FILTER: Tomas Vondra, Surafel Temesgen Couldn't find others expressing an opinion in this regard. On 2018-Nov-30, Tomas Vondra wrote: > I think it should be enough just to switch to CIM_SINGLE and increment the > command counter after each inserted row. Do we apply command counter increment per row with some other COPY option? Per-row CCI makes me a bit uncomfortable because with you'd get in trouble with a large copy. I think it's particularly nasty here, precisely because you may want to filter out some rows of a very large file, and the CCI may prevent that from working. I'm not convinced by the example case of reading how many tuples you've imported so far in the WHERE/WHEN/FILTER clause each time (that'd become incrementally slower as it progresses). -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/4/18 10:44 AM, Alvaro Herrera wrote: > After reading this thread, I think I like WHERE better than FILTER. > Tally: > > WHERE: Adam Berlin, Lim Myungkyu, Dean Rasheed, yours truly > FILTER: Tomas Vondra, Surafel Temesgen > > Couldn't find others expressing an opinion in this regard. > While I still like FILTER more, I won't object to using WHERE if others thinks it's a better choice. > On 2018-Nov-30, Tomas Vondra wrote: > >> I think it should be enough just to switch to CIM_SINGLE and >> increment the command counter after each inserted row. > > Do we apply command counter increment per row with some other COPY > option? I don't think we increment the command counter anywhere, most likely because COPY is not allowed to run any queries directly so far. > Per-row CCI makes me a bit uncomfortable because with you'd get in > trouble with a large copy. I think it's particularly nasty here, > precisely because you may want to filter out some rows of a very > large file, and the CCI may prevent that from working. Sure. > I'm not convinced by the example case of reading how many tuples > you've imported so far in the WHERE/WHEN/FILTER clause each time > (that'd become incrementally slower as it progresses). > Well, not sure how else am I supposed to convince you? It's an example of a behavior that's IMHO surprising and inconsistent with things that might be reasonably expected to behave similarly. It may not be a perfect example, but that's the price for simplicity. FWIW, another way to achieve mostly the same filtering feature is a BEFORE INSERT trigger: create or replace function copy_filter() returns trigger as $$ declare v_c int; begin select count(*) into v_c from t; if v_c >= 100 then return null; end if; return NEW; end; $$ language plpgsql; create trigger filter before insert on t for each row execute procedure copy_filter(); This behaves consistently with INSERT, i.e. it enforces the total count constraint the same way. And the COPY FILTER behaves differently. FWIW I do realize this is not a particularly great check - for example, it will not see effects of concurrent transactions etc. All I'm saying is I find it annoying/strange that it behaves differently. Also, considering the trigger does the right thing, maybe I spoke too early about the command counter not being incremented? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Nov 28, 2018 at 6:17 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Comparing with overhead of setting snapshot before evaluating every row > > and considering this > > > > kind of usage is not frequent it seems to me the behavior is acceptable > > I'm not really buying the argument that this behavior is acceptable > simply because using the feature like this will be uncommon. That seems > like a rather weak reason to accept it. > > I however agree we don't want to make COPY less efficient, at least not > unless absolutely necessary. But I think we can handle this simply by > restricting what's allowed to appear the FILTER clause. > > It should be fine to allow IMMUTABLE and STABLE functions, but not > VOLATILE ones. That should fix the example I shared, because f() would > not be allowed. I don't think that's a very good solution. It's perfectly sensible for someone to want to do WHERE/FILTER random() < 0.01 to load a smattering of rows, and this would rule that out for no very good reason. I think it would be fine to just document that if the filter condition examines the state of the database, it will not see the results of the COPY which is in progress. I'm pretty sure there are other cases - for example with triggers - where you can get code to run that can't see the results of the command currently in progress, so I don't really buy the idea that having a feature which works that way is categorically unacceptable. I agree that we can't justify flagrantly wrong behavior on the grounds that correct behavior is expensive or on the grounds that the incorrect cases will be rare. However, when there's more than one sensible behavior, it's not unreasonable to pick one that is easier to implement or cheaper or whatever, and that's the category into which I would put this decision. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/6/18 4:52 PM, Robert Haas wrote: > On Wed, Nov 28, 2018 at 6:17 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >>> Comparing with overhead of setting snapshot before evaluating every row >>> and considering this >>> >>> kind of usage is not frequent it seems to me the behavior is acceptable >> >> I'm not really buying the argument that this behavior is acceptable >> simply because using the feature like this will be uncommon. That seems >> like a rather weak reason to accept it. >> >> I however agree we don't want to make COPY less efficient, at least not >> unless absolutely necessary. But I think we can handle this simply by >> restricting what's allowed to appear the FILTER clause. >> >> It should be fine to allow IMMUTABLE and STABLE functions, but not >> VOLATILE ones. That should fix the example I shared, because f() would >> not be allowed. > > I don't think that's a very good solution. It's perfectly sensible > for someone to want to do WHERE/FILTER random() < 0.01 to load a > smattering of rows, and this would rule that out for no very good > reason. > Good point. I agree that's a much more plausible use case for this feature, and forbidding volatile functions would break it. > I think it would be fine to just document that if the filter condition > examines the state of the database, it will not see the results of the > COPY which is in progress. I'm pretty sure there are other cases - > for example with triggers - where you can get code to run that can't > see the results of the command currently in progress, so I don't > really buy the idea that having a feature which works that way is > categorically unacceptable. > > I agree that we can't justify flagrantly wrong behavior on the grounds > that correct behavior is expensive or on the grounds that the > incorrect cases will be rare. However, when there's more than one > sensible behavior, it's not unreasonable to pick one that is easier to > implement or cheaper or whatever, and that's the category into which I > would put this decision. > OK, makes sense. I withdraw my objections to the original behavior, and agree it's acceptable if it's reasonably documented. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
After reading this thread, I think I like WHERE better than FILTER.
Tally:
WHERE: Adam Berlin, Lim Myungkyu, Dean Rasheed, yours truly
FILTER: Tomas Vondra, Surafel Temesgen
Вложения
On 12/12/18 7:05 AM, Surafel Temesgen wrote: > > > On Tue, Dec 4, 2018 at 12:44 PM Alvaro Herrera <alvherre@2ndquadrant.com > <mailto:alvherre@2ndquadrant.com>> wrote: > > After reading this thread, I think I like WHERE better than FILTER. > Tally: > > WHERE: Adam Berlin, Lim Myungkyu, Dean Rasheed, yours truly > FILTER: Tomas Vondra, Surafel Temesgen > > > > accepting tally result i change the keyword to WHERE condition and > allow volatile function with single insertion method Can you also update the docs to mention that the functions called from the WHERE clause does not see effects of the COPY itself? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Can you also update the docs to mention that the functions called from
the WHERE clause does not see effects of the COPY itself?
Вложения
On 12/13/18 8:09 AM, Surafel Temesgen wrote: > > > On Wed, Dec 12, 2018 at 9:28 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > Can you also update the docs to mention that the functions called from > the WHERE clause does not see effects of the COPY itself? > > > /Of course, i also add same comment to insertion method selection > / FWIW I've marked this as RFC and plan to get it committed this week. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/14/19 10:25 PM, Tomas Vondra wrote: > On 12/13/18 8:09 AM, Surafel Temesgen wrote: >> >> >> On Wed, Dec 12, 2018 at 9:28 PM Tomas Vondra >> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: >> >> >> Can you also update the docs to mention that the functions called from >> the WHERE clause does not see effects of the COPY itself? >> >> >> /Of course, i also add same comment to insertion method selection >> / > > FWIW I've marked this as RFC and plan to get it committed this week. > Pushed, thanks for the patch. cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Pushed, thanks for the patch.
cheers
Hi, On 2019-01-20 00:24:05 +0100, Tomas Vondra wrote: > On 1/14/19 10:25 PM, Tomas Vondra wrote: > > On 12/13/18 8:09 AM, Surafel Temesgen wrote: > >> > >> > >> On Wed, Dec 12, 2018 at 9:28 PM Tomas Vondra > >> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > >> > >> > >> Can you also update the docs to mention that the functions called from > >> the WHERE clause does not see effects of the COPY itself? > >> > >> > >> /Of course, i also add same comment to insertion method selection > >> / > > > > FWIW I've marked this as RFC and plan to get it committed this week. > > > > Pushed, thanks for the patch. While rebasing the pluggable storage patch ontop of this I noticed that the qual appears to be evaluated in query context. Isn't that a bad idea? ISMT it should have been evaluated a few lines above, before the: /* Triggers and stuff need to be invoked in query context. */ MemoryContextSwitchTo(oldcontext); Yes, that'd require moving the ExecStoreHeapTuple(), but that seems ok? Greetings, Andres Freund
On 1/20/19 8:24 PM, Andres Freund wrote: > Hi, > > On 2019-01-20 00:24:05 +0100, Tomas Vondra wrote: >> On 1/14/19 10:25 PM, Tomas Vondra wrote: >>> On 12/13/18 8:09 AM, Surafel Temesgen wrote: >>>> >>>> >>>> On Wed, Dec 12, 2018 at 9:28 PM Tomas Vondra >>>> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: >>>> >>>> >>>> Can you also update the docs to mention that the functions called from >>>> the WHERE clause does not see effects of the COPY itself? >>>> >>>> >>>> /Of course, i also add same comment to insertion method selection >>>> / >>> >>> FWIW I've marked this as RFC and plan to get it committed this week. >>> >> >> Pushed, thanks for the patch. > > While rebasing the pluggable storage patch ontop of this I noticed that > the qual appears to be evaluated in query context. Isn't that a bad > idea? ISMT it should have been evaluated a few lines above, before the: > > /* Triggers and stuff need to be invoked in query context. */ > MemoryContextSwitchTo(oldcontext); > > Yes, that'd require moving the ExecStoreHeapTuple(), but that seems ok? > Yes, I agree. It's a bit too late for me to hack and push stuff, but I'll fix that tomorrow. cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-01-20 21:00:21 -0500, Tomas Vondra wrote: > > > On 1/20/19 8:24 PM, Andres Freund wrote: > > Hi, > > > > On 2019-01-20 00:24:05 +0100, Tomas Vondra wrote: > > > On 1/14/19 10:25 PM, Tomas Vondra wrote: > > > > On 12/13/18 8:09 AM, Surafel Temesgen wrote: > > > > > > > > > > > > > > > On Wed, Dec 12, 2018 at 9:28 PM Tomas Vondra > > > > > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > > > > > > > > > > > > > Can you also update the docs to mention that the functions called from > > > > > the WHERE clause does not see effects of the COPY itself? > > > > > > > > > > > > > > > /Of course, i also add same comment to insertion method selection > > > > > / > > > > > > > > FWIW I've marked this as RFC and plan to get it committed this week. > > > > > > > > > > Pushed, thanks for the patch. > > > > While rebasing the pluggable storage patch ontop of this I noticed that > > the qual appears to be evaluated in query context. Isn't that a bad > > idea? ISMT it should have been evaluated a few lines above, before the: > > > > /* Triggers and stuff need to be invoked in query context. */ > > MemoryContextSwitchTo(oldcontext); > > > > Yes, that'd require moving the ExecStoreHeapTuple(), but that seems ok? > > > > Yes, I agree. It's a bit too late for me to hack and push stuff, but I'll > fix that tomorrow. NP. On second thought, the problem is probably smaller than I thought at first, because ExecQual() switches to the econtext's per-tuple memory context. But it's only reset once for each batch, so there's some wastage. At least worth a comment. Greetings, Andres Freund
On 2019-01-20 18:08:05 -0800, Andres Freund wrote: > On 2019-01-20 21:00:21 -0500, Tomas Vondra wrote: > > > > > > On 1/20/19 8:24 PM, Andres Freund wrote: > > > Hi, > > > > > > On 2019-01-20 00:24:05 +0100, Tomas Vondra wrote: > > > > On 1/14/19 10:25 PM, Tomas Vondra wrote: > > > > > On 12/13/18 8:09 AM, Surafel Temesgen wrote: > > > > > > > > > > > > > > > > > > On Wed, Dec 12, 2018 at 9:28 PM Tomas Vondra > > > > > > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > > > > > > > > > > > > > > > > Can you also update the docs to mention that the functions called from > > > > > > the WHERE clause does not see effects of the COPY itself? > > > > > > > > > > > > > > > > > > /Of course, i also add same comment to insertion method selection > > > > > > / > > > > > > > > > > FWIW I've marked this as RFC and plan to get it committed this week. > > > > > > > > > > > > > Pushed, thanks for the patch. > > > > > > While rebasing the pluggable storage patch ontop of this I noticed that > > > the qual appears to be evaluated in query context. Isn't that a bad > > > idea? ISMT it should have been evaluated a few lines above, before the: > > > > > > /* Triggers and stuff need to be invoked in query context. */ > > > MemoryContextSwitchTo(oldcontext); > > > > > > Yes, that'd require moving the ExecStoreHeapTuple(), but that seems ok? > > > > > > > Yes, I agree. It's a bit too late for me to hack and push stuff, but I'll > > fix that tomorrow. > > NP. On second thought, the problem is probably smaller than I thought at > first, because ExecQual() switches to the econtext's per-tuple memory > context. But it's only reset once for each batch, so there's some > wastage. At least worth a comment. I'm tired, but perhaps its actually worse - what's being reset currently is the ESTate's per-tuple context: if (nBufferedTuples == 0) { /* * Reset the per-tuple exprcontext. We can only do this if the * tuple buffer is empty. (Calling the context the per-tuple * memory context is a bit of a misnomer now.) */ ResetPerTupleExprContext(estate); } but the quals are evaluated in the ExprContext's: ExecQual(ExprState *state, ExprContext *econtext) ... ret = ExecEvalExprSwitchContext(state, econtext, &isnull); which is created with: /* Get an EState's per-output-tuple exprcontext, making it if first use */ #define GetPerTupleExprContext(estate) \ ((estate)->es_per_tuple_exprcontext ? \ (estate)->es_per_tuple_exprcontext : \ MakePerTupleExprContext(estate)) and creates its own context: /* * Create working memory for expression evaluation in this context. */ econtext->ecxt_per_tuple_memory = AllocSetContextCreate(estate->es_query_cxt, "ExprContext", ALLOCSET_DEFAULT_SIZES); so this is currently just never reset. Seems just using ExecQualAndReset() ought to be sufficient? Greetings, Andres Freund
On 1/21/19 3:12 AM, Andres Freund wrote: > On 2019-01-20 18:08:05 -0800, Andres Freund wrote: >> On 2019-01-20 21:00:21 -0500, Tomas Vondra wrote: >>> >>> >>> On 1/20/19 8:24 PM, Andres Freund wrote: >>>> Hi, >>>> >>>> On 2019-01-20 00:24:05 +0100, Tomas Vondra wrote: >>>>> On 1/14/19 10:25 PM, Tomas Vondra wrote: >>>>>> On 12/13/18 8:09 AM, Surafel Temesgen wrote: >>>>>>> >>>>>>> >>>>>>> On Wed, Dec 12, 2018 at 9:28 PM Tomas Vondra >>>>>>> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: >>>>>>> >>>>>>> >>>>>>> Can you also update the docs to mention that the functions called from >>>>>>> the WHERE clause does not see effects of the COPY itself? >>>>>>> >>>>>>> >>>>>>> /Of course, i also add same comment to insertion method selection >>>>>>> / >>>>>> >>>>>> FWIW I've marked this as RFC and plan to get it committed this week. >>>>>> >>>>> >>>>> Pushed, thanks for the patch. >>>> >>>> While rebasing the pluggable storage patch ontop of this I noticed that >>>> the qual appears to be evaluated in query context. Isn't that a bad >>>> idea? ISMT it should have been evaluated a few lines above, before the: >>>> >>>> /* Triggers and stuff need to be invoked in query context. */ >>>> MemoryContextSwitchTo(oldcontext); >>>> >>>> Yes, that'd require moving the ExecStoreHeapTuple(), but that seems ok? >>>> >>> >>> Yes, I agree. It's a bit too late for me to hack and push stuff, but I'll >>> fix that tomorrow. >> >> NP. On second thought, the problem is probably smaller than I thought at >> first, because ExecQual() switches to the econtext's per-tuple memory >> context. But it's only reset once for each batch, so there's some >> wastage. At least worth a comment. > > I'm tired, but perhaps its actually worse - what's being reset currently > is the ESTate's per-tuple context: > > if (nBufferedTuples == 0) > { > /* > * Reset the per-tuple exprcontext. We can only do this if the > * tuple buffer is empty. (Calling the context the per-tuple > * memory context is a bit of a misnomer now.) > */ > ResetPerTupleExprContext(estate); > } > > but the quals are evaluated in the ExprContext's: > > ExecQual(ExprState *state, ExprContext *econtext) > ... > ret = ExecEvalExprSwitchContext(state, econtext, &isnull); > > > which is created with: > > /* Get an EState's per-output-tuple exprcontext, making it if first use */ > #define GetPerTupleExprContext(estate) \ > ((estate)->es_per_tuple_exprcontext ? \ > (estate)->es_per_tuple_exprcontext : \ > MakePerTupleExprContext(estate)) > > and creates its own context: > /* > * Create working memory for expression evaluation in this context. > */ > econtext->ecxt_per_tuple_memory = > AllocSetContextCreate(estate->es_query_cxt, > "ExprContext", > ALLOCSET_DEFAULT_SIZES); > > so this is currently just never reset. So context, much simple. Wow. > Seems just using ExecQualAndReset() ought to be sufficient? > Seems like it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/21/19 3:12 AM, Andres Freund wrote: > On 2019-01-20 18:08:05 -0800, Andres Freund wrote: >> On 2019-01-20 21:00:21 -0500, Tomas Vondra wrote: >>> >>> >>> On 1/20/19 8:24 PM, Andres Freund wrote: >>>> Hi, >>>> >>>> On 2019-01-20 00:24:05 +0100, Tomas Vondra wrote: >>>>> On 1/14/19 10:25 PM, Tomas Vondra wrote: >>>>>> On 12/13/18 8:09 AM, Surafel Temesgen wrote: >>>>>>> >>>>>>> >>>>>>> On Wed, Dec 12, 2018 at 9:28 PM Tomas Vondra >>>>>>> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: >>>>>>> >>>>>>> >>>>>>> Can you also update the docs to mention that the functions called from >>>>>>> the WHERE clause does not see effects of the COPY itself? >>>>>>> >>>>>>> >>>>>>> /Of course, i also add same comment to insertion method selection >>>>>>> / >>>>>> >>>>>> FWIW I've marked this as RFC and plan to get it committed this week. >>>>>> >>>>> >>>>> Pushed, thanks for the patch. >>>> >>>> While rebasing the pluggable storage patch ontop of this I noticed that >>>> the qual appears to be evaluated in query context. Isn't that a bad >>>> idea? ISMT it should have been evaluated a few lines above, before the: >>>> >>>> /* Triggers and stuff need to be invoked in query context. */ >>>> MemoryContextSwitchTo(oldcontext); >>>> >>>> Yes, that'd require moving the ExecStoreHeapTuple(), but that seems ok? >>>> >>> >>> Yes, I agree. It's a bit too late for me to hack and push stuff, but I'll >>> fix that tomorrow. >> >> NP. On second thought, the problem is probably smaller than I thought at >> first, because ExecQual() switches to the econtext's per-tuple memory >> context. But it's only reset once for each batch, so there's some >> wastage. At least worth a comment. > > I'm tired, but perhaps its actually worse - what's being reset currently > is the ESTate's per-tuple context: > > if (nBufferedTuples == 0) > { > /* > * Reset the per-tuple exprcontext. We can only do this if the > * tuple buffer is empty. (Calling the context the per-tuple > * memory context is a bit of a misnomer now.) > */ > ResetPerTupleExprContext(estate); > } > > but the quals are evaluated in the ExprContext's: > > ExecQual(ExprState *state, ExprContext *econtext) > ... > ret = ExecEvalExprSwitchContext(state, econtext, &isnull); > > > which is created with: > > /* Get an EState's per-output-tuple exprcontext, making it if first use */ > #define GetPerTupleExprContext(estate) \ > ((estate)->es_per_tuple_exprcontext ? \ > (estate)->es_per_tuple_exprcontext : \ > MakePerTupleExprContext(estate)) > > and creates its own context: > /* > * Create working memory for expression evaluation in this context. > */ > econtext->ecxt_per_tuple_memory = > AllocSetContextCreate(estate->es_query_cxt, > "ExprContext", > ALLOCSET_DEFAULT_SIZES); > > so this is currently just never reset. Actually, no. The ResetPerTupleExprContext boils down to MemoryContextReset((econtext)->ecxt_per_tuple_memory) and ExecEvalExprSwitchContext does this MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory); So it's resetting the right context, although only on batch boundary. But now I see 31f38174 does this: else if (cstate->whereClause != NULL || contain_volatile_functions(cstate->whereClause)) { ... insertMethod = CIM_SINGLE; } so it does not do batching with WHERE. But the condition seems wrong, I guess it should be && instead of ||. Will investigate in the morning. > Seems just using ExecQualAndReset() ought to be sufficient? > That may still be the right thing to do. cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/21/19 4:33 AM, Tomas Vondra wrote: > > > On 1/21/19 3:12 AM, Andres Freund wrote: >> On 2019-01-20 18:08:05 -0800, Andres Freund wrote: >>> On 2019-01-20 21:00:21 -0500, Tomas Vondra wrote: >>>> >>>> >>>> On 1/20/19 8:24 PM, Andres Freund wrote: >>>>> Hi, >>>>> >>>>> On 2019-01-20 00:24:05 +0100, Tomas Vondra wrote: >>>>>> On 1/14/19 10:25 PM, Tomas Vondra wrote: >>>>>>> On 12/13/18 8:09 AM, Surafel Temesgen wrote: >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Dec 12, 2018 at 9:28 PM Tomas Vondra >>>>>>>> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> Can you also update the docs to mention that the functions called from >>>>>>>> the WHERE clause does not see effects of the COPY itself? >>>>>>>> >>>>>>>> >>>>>>>> /Of course, i also add same comment to insertion method selection >>>>>>>> / >>>>>>> >>>>>>> FWIW I've marked this as RFC and plan to get it committed this week. >>>>>>> >>>>>> >>>>>> Pushed, thanks for the patch. >>>>> >>>>> While rebasing the pluggable storage patch ontop of this I noticed that >>>>> the qual appears to be evaluated in query context. Isn't that a bad >>>>> idea? ISMT it should have been evaluated a few lines above, before the: >>>>> >>>>> /* Triggers and stuff need to be invoked in query context. */ >>>>> MemoryContextSwitchTo(oldcontext); >>>>> >>>>> Yes, that'd require moving the ExecStoreHeapTuple(), but that seems ok? >>>>> >>>> >>>> Yes, I agree. It's a bit too late for me to hack and push stuff, but I'll >>>> fix that tomorrow. >>> >>> NP. On second thought, the problem is probably smaller than I thought at >>> first, because ExecQual() switches to the econtext's per-tuple memory >>> context. But it's only reset once for each batch, so there's some >>> wastage. At least worth a comment. >> >> I'm tired, but perhaps its actually worse - what's being reset currently >> is the ESTate's per-tuple context: >> >> if (nBufferedTuples == 0) >> { >> /* >> * Reset the per-tuple exprcontext. We can only do this if the >> * tuple buffer is empty. (Calling the context the per-tuple >> * memory context is a bit of a misnomer now.) >> */ >> ResetPerTupleExprContext(estate); >> } >> >> but the quals are evaluated in the ExprContext's: >> >> ExecQual(ExprState *state, ExprContext *econtext) >> ... >> ret = ExecEvalExprSwitchContext(state, econtext, &isnull); >> >> >> which is created with: >> >> /* Get an EState's per-output-tuple exprcontext, making it if first use */ >> #define GetPerTupleExprContext(estate) \ >> ((estate)->es_per_tuple_exprcontext ? \ >> (estate)->es_per_tuple_exprcontext : \ >> MakePerTupleExprContext(estate)) >> >> and creates its own context: >> /* >> * Create working memory for expression evaluation in this context. >> */ >> econtext->ecxt_per_tuple_memory = >> AllocSetContextCreate(estate->es_query_cxt, >> "ExprContext", >> ALLOCSET_DEFAULT_SIZES); >> >> so this is currently just never reset. > > Actually, no. The ResetPerTupleExprContext boils down to > > MemoryContextReset((econtext)->ecxt_per_tuple_memory) > > and ExecEvalExprSwitchContext does this > > MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory); > > So it's resetting the right context, although only on batch boundary. > But now I see 31f38174 does this: > > else if (cstate->whereClause != NULL || > contain_volatile_functions(cstate->whereClause)) > { > ... > insertMethod = CIM_SINGLE; > } > > so it does not do batching with WHERE. But the condition seems wrong, I > guess it should be && instead of ||. Will investigate in the morning. > I think the condition can be just if (contain_volatile_functions(cstate->whereClause)) { ... } Per the attached patch. Surafel, do you agree? >> Seems just using ExecQualAndReset() ought to be sufficient? >> > > That may still be the right thing to do. > Actually, no, because that would reset the context far too early (and it's easy to trigger segfaults). So the reset would have to happen after processing the row, not this early. But I think the current behavior is actually OK, as it matches what we do for defexprs. And the comment before ResetPerTupleExprContext says this: /* * Reset the per-tuple exprcontext. We can only do this if the * tuple buffer is empty. (Calling the context the per-tuple * memory context is a bit of a misnomer now.) */ So the per-tuple context is not quite per-tuple anyway. Sure, we might rework that but I don't think that's an issue in this patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Hi, On 2019-01-21 16:22:11 +0100, Tomas Vondra wrote: > > > On 1/21/19 4:33 AM, Tomas Vondra wrote: > > > > > > On 1/21/19 3:12 AM, Andres Freund wrote: > >> On 2019-01-20 18:08:05 -0800, Andres Freund wrote: > >>> On 2019-01-20 21:00:21 -0500, Tomas Vondra wrote: > >>>> > >>>> > >>>> On 1/20/19 8:24 PM, Andres Freund wrote: > >>>>> Hi, > >>>>> > >>>>> On 2019-01-20 00:24:05 +0100, Tomas Vondra wrote: > >>>>>> On 1/14/19 10:25 PM, Tomas Vondra wrote: > >>>>>>> On 12/13/18 8:09 AM, Surafel Temesgen wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> On Wed, Dec 12, 2018 at 9:28 PM Tomas Vondra > >>>>>>>> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> Can you also update the docs to mention that the functions called from > >>>>>>>> the WHERE clause does not see effects of the COPY itself? > >>>>>>>> > >>>>>>>> > >>>>>>>> /Of course, i also add same comment to insertion method selection > >>>>>>>> / > >>>>>>> > >>>>>>> FWIW I've marked this as RFC and plan to get it committed this week. > >>>>>>> > >>>>>> > >>>>>> Pushed, thanks for the patch. > >>>>> > >>>>> While rebasing the pluggable storage patch ontop of this I noticed that > >>>>> the qual appears to be evaluated in query context. Isn't that a bad > >>>>> idea? ISMT it should have been evaluated a few lines above, before the: > >>>>> > >>>>> /* Triggers and stuff need to be invoked in query context. */ > >>>>> MemoryContextSwitchTo(oldcontext); > >>>>> > >>>>> Yes, that'd require moving the ExecStoreHeapTuple(), but that seems ok? > >>>>> > >>>> > >>>> Yes, I agree. It's a bit too late for me to hack and push stuff, but I'll > >>>> fix that tomorrow. > >>> > >>> NP. On second thought, the problem is probably smaller than I thought at > >>> first, because ExecQual() switches to the econtext's per-tuple memory > >>> context. But it's only reset once for each batch, so there's some > >>> wastage. At least worth a comment. > >> > >> I'm tired, but perhaps its actually worse - what's being reset currently > >> is the ESTate's per-tuple context: > >> > >> if (nBufferedTuples == 0) > >> { > >> /* > >> * Reset the per-tuple exprcontext. We can only do this if the > >> * tuple buffer is empty. (Calling the context the per-tuple > >> * memory context is a bit of a misnomer now.) > >> */ > >> ResetPerTupleExprContext(estate); > >> } > >> > >> but the quals are evaluated in the ExprContext's: > >> > >> ExecQual(ExprState *state, ExprContext *econtext) > >> ... > >> ret = ExecEvalExprSwitchContext(state, econtext, &isnull); > >> > >> > >> which is created with: > >> > >> /* Get an EState's per-output-tuple exprcontext, making it if first use */ > >> #define GetPerTupleExprContext(estate) \ > >> ((estate)->es_per_tuple_exprcontext ? \ > >> (estate)->es_per_tuple_exprcontext : \ > >> MakePerTupleExprContext(estate)) > >> > >> and creates its own context: > >> /* > >> * Create working memory for expression evaluation in this context. > >> */ > >> econtext->ecxt_per_tuple_memory = > >> AllocSetContextCreate(estate->es_query_cxt, > >> "ExprContext", > >> ALLOCSET_DEFAULT_SIZES); > >> > >> so this is currently just never reset. > > > > Actually, no. The ResetPerTupleExprContext boils down to > > > > MemoryContextReset((econtext)->ecxt_per_tuple_memory) > > > > and ExecEvalExprSwitchContext does this > > > > MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory); > > > > So it's resetting the right context, although only on batch boundary. > >> Seems just using ExecQualAndReset() ought to be sufficient? > >> > > > > That may still be the right thing to do. > > > > Actually, no, because that would reset the context far too early (and > it's easy to trigger segfaults). So the reset would have to happen after > processing the row, not this early. Yea, sorry, I was too tired yesterday evening. I'd spent 10h splitting up the pluggable storage patch into individual pieces... > But I think the current behavior is actually OK, as it matches what we > do for defexprs. And the comment before ResetPerTupleExprContext says this: > > /* > * Reset the per-tuple exprcontext. We can only do this if the > * tuple buffer is empty. (Calling the context the per-tuple > * memory context is a bit of a misnomer now.) > */ > > So the per-tuple context is not quite per-tuple anyway. Sure, we might > rework that but I don't think that's an issue in this patch. I'm *not* convinced by this. I think it's bad enough that we do this for normal COPY, but for WHEN, we could end up *never* resetting before the end. Consider a case where a single tuple is inserted, and then *all* rows are filtered. I think this needs a separate econtext that's reset every round. Or alternatively you could fix the code not to rely on per-tuple not being reset when tuples are buffered - that actually ought to be fairly simple. Greetings, Andres Freund
On 1/21/19 7:51 PM, Andres Freund wrote: > Hi, > > On 2019-01-21 16:22:11 +0100, Tomas Vondra wrote: >> >> >> On 1/21/19 4:33 AM, Tomas Vondra wrote: >>> >>> >>> On 1/21/19 3:12 AM, Andres Freund wrote: >>>> On 2019-01-20 18:08:05 -0800, Andres Freund wrote: >>>>> On 2019-01-20 21:00:21 -0500, Tomas Vondra wrote: >>>>>> >>>>>> >>>>>> On 1/20/19 8:24 PM, Andres Freund wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On 2019-01-20 00:24:05 +0100, Tomas Vondra wrote: >>>>>>>> On 1/14/19 10:25 PM, Tomas Vondra wrote: >>>>>>>>> On 12/13/18 8:09 AM, Surafel Temesgen wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Dec 12, 2018 at 9:28 PM Tomas Vondra >>>>>>>>>> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Can you also update the docs to mention that the functions called from >>>>>>>>>> the WHERE clause does not see effects of the COPY itself? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> /Of course, i also add same comment to insertion method selection >>>>>>>>>> / >>>>>>>>> >>>>>>>>> FWIW I've marked this as RFC and plan to get it committed this week. >>>>>>>>> >>>>>>>> >>>>>>>> Pushed, thanks for the patch. >>>>>>> >>>>>>> While rebasing the pluggable storage patch ontop of this I noticed that >>>>>>> the qual appears to be evaluated in query context. Isn't that a bad >>>>>>> idea? ISMT it should have been evaluated a few lines above, before the: >>>>>>> >>>>>>> /* Triggers and stuff need to be invoked in query context. */ >>>>>>> MemoryContextSwitchTo(oldcontext); >>>>>>> >>>>>>> Yes, that'd require moving the ExecStoreHeapTuple(), but that seems ok? >>>>>>> >>>>>> >>>>>> Yes, I agree. It's a bit too late for me to hack and push stuff, but I'll >>>>>> fix that tomorrow. >>>>> >>>>> NP. On second thought, the problem is probably smaller than I thought at >>>>> first, because ExecQual() switches to the econtext's per-tuple memory >>>>> context. But it's only reset once for each batch, so there's some >>>>> wastage. At least worth a comment. >>>> >>>> I'm tired, but perhaps its actually worse - what's being reset currently >>>> is the ESTate's per-tuple context: >>>> >>>> if (nBufferedTuples == 0) >>>> { >>>> /* >>>> * Reset the per-tuple exprcontext. We can only do this if the >>>> * tuple buffer is empty. (Calling the context the per-tuple >>>> * memory context is a bit of a misnomer now.) >>>> */ >>>> ResetPerTupleExprContext(estate); >>>> } >>>> >>>> but the quals are evaluated in the ExprContext's: >>>> >>>> ExecQual(ExprState *state, ExprContext *econtext) >>>> ... >>>> ret = ExecEvalExprSwitchContext(state, econtext, &isnull); >>>> >>>> >>>> which is created with: >>>> >>>> /* Get an EState's per-output-tuple exprcontext, making it if first use */ >>>> #define GetPerTupleExprContext(estate) \ >>>> ((estate)->es_per_tuple_exprcontext ? \ >>>> (estate)->es_per_tuple_exprcontext : \ >>>> MakePerTupleExprContext(estate)) >>>> >>>> and creates its own context: >>>> /* >>>> * Create working memory for expression evaluation in this context. >>>> */ >>>> econtext->ecxt_per_tuple_memory = >>>> AllocSetContextCreate(estate->es_query_cxt, >>>> "ExprContext", >>>> ALLOCSET_DEFAULT_SIZES); >>>> >>>> so this is currently just never reset. >>> >>> Actually, no. The ResetPerTupleExprContext boils down to >>> >>> MemoryContextReset((econtext)->ecxt_per_tuple_memory) >>> >>> and ExecEvalExprSwitchContext does this >>> >>> MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory); >>> >>> So it's resetting the right context, although only on batch boundary. > >>>> Seems just using ExecQualAndReset() ought to be sufficient? >>>> >>> >>> That may still be the right thing to do. >>> >> >> Actually, no, because that would reset the context far too early (and >> it's easy to trigger segfaults). So the reset would have to happen after >> processing the row, not this early. > > Yea, sorry, I was too tired yesterday evening. I'd spent 10h splitting > up the pluggable storage patch into individual pieces... > > >> But I think the current behavior is actually OK, as it matches what we >> do for defexprs. And the comment before ResetPerTupleExprContext says this: >> >> /* >> * Reset the per-tuple exprcontext. We can only do this if the >> * tuple buffer is empty. (Calling the context the per-tuple >> * memory context is a bit of a misnomer now.) >> */ >> >> So the per-tuple context is not quite per-tuple anyway. Sure, we might >> rework that but I don't think that's an issue in this patch. > > I'm *not* convinced by this. I think it's bad enough that we do this for > normal COPY, but for WHEN, we could end up *never* resetting before the > end. Consider a case where a single tuple is inserted, and then *all* > rows are filtered. I think this needs a separate econtext that's reset > every round. Or alternatively you could fix the code not to rely on > per-tuple not being reset when tuples are buffered - that actually ought > to be fairly simple. > I think separating the per-tuple and per-batch contexts is the right thing to do, here. It seems the batching was added somewhat later and using the per-tuple context is rather confusing. cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I think the condition can be just
if (contain_volatile_functions(cstate->whereClause)) { ... }
On 1/21/19 11:15 PM, Tomas Vondra wrote: > > > On 1/21/19 7:51 PM, Andres Freund wrote: >> Hi, >> >> On 2019-01-21 16:22:11 +0100, Tomas Vondra wrote: >>> >>> >>> On 1/21/19 4:33 AM, Tomas Vondra wrote: >>>> >>>> >>>> On 1/21/19 3:12 AM, Andres Freund wrote: >>>>> On 2019-01-20 18:08:05 -0800, Andres Freund wrote: >>>>>> On 2019-01-20 21:00:21 -0500, Tomas Vondra wrote: >>>>>>> >>>>>>> >>>>>>> On 1/20/19 8:24 PM, Andres Freund wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> On 2019-01-20 00:24:05 +0100, Tomas Vondra wrote: >>>>>>>>> On 1/14/19 10:25 PM, Tomas Vondra wrote: >>>>>>>>>> On 12/13/18 8:09 AM, Surafel Temesgen wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Dec 12, 2018 at 9:28 PM Tomas Vondra >>>>>>>>>>> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Can you also update the docs to mention that the functions called from >>>>>>>>>>> the WHERE clause does not see effects of the COPY itself? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> /Of course, i also add same comment to insertion method selection >>>>>>>>>>> / >>>>>>>>>> >>>>>>>>>> FWIW I've marked this as RFC and plan to get it committed this week. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Pushed, thanks for the patch. >>>>>>>> >>>>>>>> While rebasing the pluggable storage patch ontop of this I noticed that >>>>>>>> the qual appears to be evaluated in query context. Isn't that a bad >>>>>>>> idea? ISMT it should have been evaluated a few lines above, before the: >>>>>>>> >>>>>>>> /* Triggers and stuff need to be invoked in query context. */ >>>>>>>> MemoryContextSwitchTo(oldcontext); >>>>>>>> >>>>>>>> Yes, that'd require moving the ExecStoreHeapTuple(), but that seems ok? >>>>>>>> >>>>>>> >>>>>>> Yes, I agree. It's a bit too late for me to hack and push stuff, but I'll >>>>>>> fix that tomorrow. >>>>>> >>>>>> NP. On second thought, the problem is probably smaller than I thought at >>>>>> first, because ExecQual() switches to the econtext's per-tuple memory >>>>>> context. But it's only reset once for each batch, so there's some >>>>>> wastage. At least worth a comment. >>>>> >>>>> I'm tired, but perhaps its actually worse - what's being reset currently >>>>> is the ESTate's per-tuple context: >>>>> >>>>> if (nBufferedTuples == 0) >>>>> { >>>>> /* >>>>> * Reset the per-tuple exprcontext. We can only do this if the >>>>> * tuple buffer is empty. (Calling the context the per-tuple >>>>> * memory context is a bit of a misnomer now.) >>>>> */ >>>>> ResetPerTupleExprContext(estate); >>>>> } >>>>> >>>>> but the quals are evaluated in the ExprContext's: >>>>> >>>>> ExecQual(ExprState *state, ExprContext *econtext) >>>>> ... >>>>> ret = ExecEvalExprSwitchContext(state, econtext, &isnull); >>>>> >>>>> >>>>> which is created with: >>>>> >>>>> /* Get an EState's per-output-tuple exprcontext, making it if first use */ >>>>> #define GetPerTupleExprContext(estate) \ >>>>> ((estate)->es_per_tuple_exprcontext ? \ >>>>> (estate)->es_per_tuple_exprcontext : \ >>>>> MakePerTupleExprContext(estate)) >>>>> >>>>> and creates its own context: >>>>> /* >>>>> * Create working memory for expression evaluation in this context. >>>>> */ >>>>> econtext->ecxt_per_tuple_memory = >>>>> AllocSetContextCreate(estate->es_query_cxt, >>>>> "ExprContext", >>>>> ALLOCSET_DEFAULT_SIZES); >>>>> >>>>> so this is currently just never reset. >>>> >>>> Actually, no. The ResetPerTupleExprContext boils down to >>>> >>>> MemoryContextReset((econtext)->ecxt_per_tuple_memory) >>>> >>>> and ExecEvalExprSwitchContext does this >>>> >>>> MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory); >>>> >>>> So it's resetting the right context, although only on batch boundary. >> >>>>> Seems just using ExecQualAndReset() ought to be sufficient? >>>>> >>>> >>>> That may still be the right thing to do. >>>> >>> >>> Actually, no, because that would reset the context far too early (and >>> it's easy to trigger segfaults). So the reset would have to happen after >>> processing the row, not this early. >> >> Yea, sorry, I was too tired yesterday evening. I'd spent 10h splitting >> up the pluggable storage patch into individual pieces... >> >> >>> But I think the current behavior is actually OK, as it matches what we >>> do for defexprs. And the comment before ResetPerTupleExprContext says this: >>> >>> /* >>> * Reset the per-tuple exprcontext. We can only do this if the >>> * tuple buffer is empty. (Calling the context the per-tuple >>> * memory context is a bit of a misnomer now.) >>> */ >>> >>> So the per-tuple context is not quite per-tuple anyway. Sure, we might >>> rework that but I don't think that's an issue in this patch. >> >> I'm *not* convinced by this. I think it's bad enough that we do this for >> normal COPY, but for WHEN, we could end up *never* resetting before the >> end. Consider a case where a single tuple is inserted, and then *all* >> rows are filtered. I think this needs a separate econtext that's reset >> every round. Or alternatively you could fix the code not to rely on >> per-tuple not being reset when tuples are buffered - that actually ought >> to be fairly simple. >> > > I think separating the per-tuple and per-batch contexts is the right > thing to do, here. It seems the batching was added somewhat later and > using the per-tuple context is rather confusing. > OK, here is a WIP patch doing that. It creates a new "batch" context, and allocates tuples in it (instead of the per-tuple context). The per-tuple context is now reset always, irrespectedly of nBufferedTuples. And the batch context is reset every time the batch is emptied. It turned out to be a tad more complex due to partitioning, because when we find the partitions do not match, the tuple is already allocated in the "current" context (be it per-tuple or batch). So we can't just free the whole context at that point. The old code worked around this by alternating two contexts, but that seems a bit too cumbersome to me, so the patch simply copies the tuple to the new context. That allows us to reset the batch context always, right after emptying the buffer. I need to do some benchmarking to see if the extra copy causes any regression. Overall, separating the contexts makes it quite a bit clearer. I'm not entirely happy about the per-tuple context being "implicit" (hidden in executor context) while the batch context being explicitly created, but there's not much I can do about that. The patch also includes the fix correcting the volatility check on WHERE clause, although that shall be committed separately. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Hi, On 2019-01-22 18:35:21 +0100, Tomas Vondra wrote: > On 1/21/19 11:15 PM, Tomas Vondra wrote: > > On 1/21/19 7:51 PM, Andres Freund wrote: > >> I'm *not* convinced by this. I think it's bad enough that we do this for > >> normal COPY, but for WHEN, we could end up *never* resetting before the > >> end. Consider a case where a single tuple is inserted, and then *all* > >> rows are filtered. I think this needs a separate econtext that's reset > >> every round. Or alternatively you could fix the code not to rely on > >> per-tuple not being reset when tuples are buffered - that actually ought > >> to be fairly simple. > >> > > > > I think separating the per-tuple and per-batch contexts is the right > > thing to do, here. It seems the batching was added somewhat later and > > using the per-tuple context is rather confusing. > > > > OK, here is a WIP patch doing that. It creates a new "batch" context, > and allocates tuples in it (instead of the per-tuple context). The > per-tuple context is now reset always, irrespectedly of nBufferedTuples. > And the batch context is reset every time the batch is emptied. > > It turned out to be a tad more complex due to partitioning, because when > we find the partitions do not match, the tuple is already allocated in > the "current" context (be it per-tuple or batch). So we can't just free > the whole context at that point. The old code worked around this by > alternating two contexts, but that seems a bit too cumbersome to me, so > the patch simply copies the tuple to the new context. That allows us to > reset the batch context always, right after emptying the buffer. I need > to do some benchmarking to see if the extra copy causes any regression. > > Overall, separating the contexts makes it quite a bit clearer. I'm not > entirely happy about the per-tuple context being "implicit" (hidden in > executor context) while the batch context being explicitly created, but > there's not much I can do about that. I think the extra copy is probably OK for now - as part of the pluggable storage series I've converted COPY to use slots, which should make that less of an issue - I've not done that, but I actually assume we could remove the whole batch context afterwards. Greetings, Andres Freund
On 1/22/19 10:00 AM, Surafel Temesgen wrote: > > > On Mon, Jan 21, 2019 at 6:22 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > I think the condition can be just > > if (contain_volatile_functions(cstate->whereClause)) { ... } > > I've pushed a fix for the volatility check. Attached is a patch for the other issue, creating a separate batch context long the lines outlined in the previous email. It's a bit too late for me to push it now, especially right before a couple of days off. So I'll push that in a couple of days. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On Wed, 23 Jan 2019 at 06:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > It turned out to be a tad more complex due to partitioning, because when > we find the partitions do not match, the tuple is already allocated in > the "current" context (be it per-tuple or batch). So we can't just free > the whole context at that point. The old code worked around this by > alternating two contexts, but that seems a bit too cumbersome to me, so > the patch simply copies the tuple to the new context. That allows us to > reset the batch context always, right after emptying the buffer. I need > to do some benchmarking to see if the extra copy causes any regression. I agree that fixing the problem mentioned by separating out tuple and batch contexts is a good idea, but I disagree with removing the alternating batch context idea. FWIW the reason the alternating contexts went in wasn't just for fun, it was on performance grounds. There's a lengthy discussion in [1] about not adding any new performance regressions to COPY. To be more specific, Peter complains about a regression of 5% in [2]. It's pretty disheartening to see the work there being partially undone. Here are my performance tests of with and without your change to the memory contexts (I missed where you posted your results). $ cat bench.pl for (my $i=0; $i < 8912891; $i++) { print "1\n1\n2\n2\n"; } 36a1281f86c: (with your change) postgres=# copy listp from program $$perl ~/bench.pl$$ delimiter '|'; COPY 35651564 Time: 26825.142 ms (00:26.825) Time: 27202.117 ms (00:27.202) Time: 26266.705 ms (00:26.267) 4be058fe9ec: (before your change) postgres=# copy listp from program $$perl ~/bench.pl$$ delimiter '|'; COPY 35651564 Time: 25645.460 ms (00:25.645) Time: 25698.193 ms (00:25.698) Time: 25737.251 ms (00:25.737) So looks like your change slows this code down 4% for this test case. That's about twice as bad as the 2.2% regression that I had to solve for the multi-insert partition patch (of which you've removed much of the code from) I'd really like to see the alternating batch context code being put back in to fix this. Of course, if you have a better idea, then we can look into that, but just removing code that was carefully written and benchmarked without any new benchmarks to prove that it's okay to do so seems wrong to me. [1] https://www.postgresql.org/message-id/flat/CAKJS1f93DeHN%2B9RrD9jYn0iF_o89w2B%2BU8-Ao5V1kd8Cf7oSGQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/2b40468d-6be0-e795-c363-0015bc9fa747%402ndquadrant.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 1/29/19 8:18 AM, David Rowley wrote: > On Wed, 23 Jan 2019 at 06:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> It turned out to be a tad more complex due to partitioning, because when >> we find the partitions do not match, the tuple is already allocated in >> the "current" context (be it per-tuple or batch). So we can't just free >> the whole context at that point. The old code worked around this by >> alternating two contexts, but that seems a bit too cumbersome to me, so >> the patch simply copies the tuple to the new context. That allows us to >> reset the batch context always, right after emptying the buffer. I need >> to do some benchmarking to see if the extra copy causes any regression. > > I agree that fixing the problem mentioned by separating out tuple and > batch contexts is a good idea, but I disagree with removing the > alternating batch context idea. FWIW the reason the alternating > contexts went in wasn't just for fun, it was on performance grounds. > There's a lengthy discussion in [1] about not adding any new > performance regressions to COPY. To be more specific, Peter complains > about a regression of 5% in [2]. > > It's pretty disheartening to see the work there being partially undone. > Whoops :-( > Here are my performance tests of with and without your change to the > memory contexts (I missed where you posted your results). > > $ cat bench.pl > for (my $i=0; $i < 8912891; $i++) { > print "1\n1\n2\n2\n"; > } > 36a1281f86c: (with your change) > > postgres=# copy listp from program $$perl ~/bench.pl$$ delimiter '|'; > COPY 35651564 > Time: 26825.142 ms (00:26.825) > Time: 27202.117 ms (00:27.202) > Time: 26266.705 ms (00:26.267) > > 4be058fe9ec: (before your change) > > postgres=# copy listp from program $$perl ~/bench.pl$$ delimiter '|'; > COPY 35651564 > Time: 25645.460 ms (00:25.645) > Time: 25698.193 ms (00:25.698) > Time: 25737.251 ms (00:25.737) > > So looks like your change slows this code down 4% for this test case. > That's about twice as bad as the 2.2% regression that I had to solve > for the multi-insert partition patch (of which you've removed much of > the code from) > > I'd really like to see the alternating batch context code being put > back in to fix this. Of course, if you have a better idea, then we can > look into that, but just removing code that was carefully written and > benchmarked without any new benchmarks to prove that it's okay to do > so seems wrong to me. > Yeah, that's not very nice. Do you suggest to revert 36a1281f86c0f, or shall we try to massage it a bit first? ISTM we could simply create two batch memory contexts and alternate those, just like with the expression contexts before. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/29/19 8:18 AM, David Rowley wrote: > ... > Here are my performance tests of with and without your change to the > memory contexts (I missed where you posted your results). > > $ cat bench.pl > for (my $i=0; $i < 8912891; $i++) { > print "1\n1\n2\n2\n"; > } > 36a1281f86c: (with your change) > > postgres=# copy listp from program $$perl ~/bench.pl$$ delimiter '|'; > COPY 35651564 > Time: 26825.142 ms (00:26.825) > Time: 27202.117 ms (00:27.202) > Time: 26266.705 ms (00:26.267) > > 4be058fe9ec: (before your change) > > postgres=# copy listp from program $$perl ~/bench.pl$$ delimiter '|'; > COPY 35651564 > Time: 25645.460 ms (00:25.645) > Time: 25698.193 ms (00:25.698) > Time: 25737.251 ms (00:25.737) > How do I reproduce this? I don't see this test in the thread you linked, so I'm not sure how many partitions you were using, what's the structure of the table etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On January 29, 2019 6:25:59 AM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >On 1/29/19 8:18 AM, David Rowley wrote: >> ... >> Here are my performance tests of with and without your change to the >> memory contexts (I missed where you posted your results). >> >> $ cat bench.pl >> for (my $i=0; $i < 8912891; $i++) { >> print "1\n1\n2\n2\n"; >> } >> 36a1281f86c: (with your change) >> >> postgres=# copy listp from program $$perl ~/bench.pl$$ delimiter '|'; >> COPY 35651564 >> Time: 26825.142 ms (00:26.825) >> Time: 27202.117 ms (00:27.202) >> Time: 26266.705 ms (00:26.267) >> >> 4be058fe9ec: (before your change) >> >> postgres=# copy listp from program $$perl ~/bench.pl$$ delimiter '|'; >> COPY 35651564 >> Time: 25645.460 ms (00:25.645) >> Time: 25698.193 ms (00:25.698) >> Time: 25737.251 ms (00:25.737) >> > >How do I reproduce this? I don't see this test in the thread you >linked, >so I'm not sure how many partitions you were using, what's the >structure >of the table etc. I think I might have a patch addressing the problem incidentally. For pluggable storage I slotified copy.c, which also removesthe first heap_form_tuple. Quite possible that nothing more is needed. I've removed the batch context altogether inyesterday's rebase, there was no need anymore. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, 30 Jan 2019 at 03:26, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On 1/29/19 8:18 AM, David Rowley wrote: > > (details about performance regression) > > How do I reproduce this? I don't see this test in the thread you linked, > so I'm not sure how many partitions you were using, what's the structure > of the table etc. The setup is: create table listp (a int) partition by list(a); create table listp1 partition of listp for values in(1); create table listp2 partition of listp for values in(2); -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, 30 Jan 2019 at 04:22, Andres Freund <andres@anarazel.de> wrote: > I think I might have a patch addressing the problem incidentally. For pluggable storage I slotified copy.c, which alsoremoves the first heap_form_tuple. Quite possible that nothing more is needed. I've removed the batch context altogetherin yesterday's rebase, there was no need anymore. In your patch, where do the batched tuples get stored before the heap insert is done? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, 29 Jan 2019 at 22:51, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On 1/29/19 8:18 AM, David Rowley wrote: > > So looks like your change slows this code down 4% for this test case. > > That's about twice as bad as the 2.2% regression that I had to solve > > for the multi-insert partition patch (of which you've removed much of > > the code from) > > > > I'd really like to see the alternating batch context code being put > > back in to fix this. Of course, if you have a better idea, then we can > > look into that, but just removing code that was carefully written and > > benchmarked without any new benchmarks to prove that it's okay to do > > so seems wrong to me. > > > > Yeah, that's not very nice. Do you suggest to revert 36a1281f86c0f, or > shall we try to massage it a bit first? ISTM we could simply create two > batch memory contexts and alternate those, just like with the expression > contexts before. I don't think a revert should be done. I didn't check in detail, but it seems what you did fixed the memory leak issue for when many tuples are filtered out. I'd like to have more detail on what Andres is proposing, but otherwise, I'd imagine we can just have two batch contexts and alternate between them, as before, but also keep the per-tuple context too. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2019-01-30 10:05:35 +1300, David Rowley wrote: > On Wed, 30 Jan 2019 at 04:22, Andres Freund <andres@anarazel.de> wrote: > > I think I might have a patch addressing the problem incidentally. For pluggable storage I slotified copy.c, which alsoremoves the first heap_form_tuple. Quite possible that nothing more is needed. I've removed the batch context altogetherin yesterday's rebase, there was no need anymore. > > In your patch, where do the batched tuples get stored before the heap > insert is done? There's one slot for each batched tuple (they are reused). Before materialization the tuples solely exist in tts_isnull/values into which NextCopyFrom() directly parses the values. Tuples never get extracted from the slot in copy.c itself anymore, table_multi_insert() accepts slots. Not quite sure whether I've answered your question? Greetings, Andres Freund
On Wed, 30 Jan 2019 at 10:12, Andres Freund <andres@anarazel.de> wrote: > > On 2019-01-30 10:05:35 +1300, David Rowley wrote: > > On Wed, 30 Jan 2019 at 04:22, Andres Freund <andres@anarazel.de> wrote: > > > I think I might have a patch addressing the problem incidentally. For pluggable storage I slotified copy.c, which alsoremoves the first heap_form_tuple. Quite possible that nothing more is needed. I've removed the batch context altogetherin yesterday's rebase, there was no need anymore. > > > > In your patch, where do the batched tuples get stored before the heap > > insert is done? > > There's one slot for each batched tuple (they are reused). Before > materialization the tuples solely exist in tts_isnull/values into which > NextCopyFrom() directly parses the values. Tuples never get extracted > from the slot in copy.c itself anymore, table_multi_insert() accepts > slots. Not quite sure whether I've answered your question? I think so. I imagine that should also speed up COPY WHERE too as it'll no longer form a tuple before possibly discarding it. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2019-01-30 10:33:30 +1300, David Rowley wrote: > On Wed, 30 Jan 2019 at 10:12, Andres Freund <andres@anarazel.de> wrote: > > > > On 2019-01-30 10:05:35 +1300, David Rowley wrote: > > > On Wed, 30 Jan 2019 at 04:22, Andres Freund <andres@anarazel.de> wrote: > > > > I think I might have a patch addressing the problem incidentally. For pluggable storage I slotified copy.c, whichalso removes the first heap_form_tuple. Quite possible that nothing more is needed. I've removed the batch context altogetherin yesterday's rebase, there was no need anymore. > > > > > > In your patch, where do the batched tuples get stored before the heap > > > insert is done? > > > > There's one slot for each batched tuple (they are reused). Before > > materialization the tuples solely exist in tts_isnull/values into which > > NextCopyFrom() directly parses the values. Tuples never get extracted > > from the slot in copy.c itself anymore, table_multi_insert() accepts > > slots. Not quite sure whether I've answered your question? > > I think so. I imagine that should also speed up COPY WHERE too as > it'll no longer form a tuple before possibly discarding it. Right. I found some issues in my patch (stupid implementation of copying from one slot to the other), but after fixing that I get: master: Time: 16013.509 ms (00:16.014) Time: 16836.110 ms (00:16.836) Time: 16636.796 ms (00:16.637) pluggable storage: Time: 15974.243 ms (00:15.974) Time: 16183.442 ms (00:16.183) Time: 16055.192 ms (00:16.055) (with a truncate between each run) So that seems a bit better. Albeit at the cost of having a few, on demand creatd, empty slots for each encountered partition. I'm pretty sure we can optimize that further... Greetings, Andres Freund
On Wed, 27 Mar 2019 at 18:49, Andres Freund <andres@anarazel.de> wrote: > Here's a version that applies onto HEAD. It needs a bit more cleanup > (I'm not sure I like the addition of ri_batchInsertSlots to store slots > for each partition, there's some duplicated code around slot array and > slot creation, docs for the multi insert callback). > > When measuring inserting into unlogged partitions (to remove the > overhead of WAL logging, which makes it harder to see the raw > performance difference), I see a few percent gains of the slotified > version over the current code. Same with insertions that have fewer > partition switches. > > Sorry for posting this here - David, it'd be cool if you could take a > look anyway. You've played a lot with this code. Hi, I had a look at this and performance has improved again, thanks. However, I'm not sure if the patch is exactly what we need, let me explain. When I was working on 0d5f05cde01, I think my original idea was just to use the bufferedTuples[] array and always multi insert into partitioned tables unless something like on insert triggers stopped that. The reason I ended up with all this CIM_MULTI and CIM_MULTI_CONDITIONAL stuff is due to Melanie finding a regression when the target partition changed on each tuple. At the time I wondered if it might be worth trying to have multiple buffers or to partition the existing buffer and store tuples belonging to different partitions, then just flush them when they're full. Looking at the patch it seems that you have added an array of slots per partition, so does that not kinda make all that CIM_MULTI / CIM_MULTI_CONDITIONAL code redundant? ... we could just flush each partition's buffer when it gets full. There's not much of a need to flush the buffer when the partition changes since we're using a different buffer for the next partition anyway. I also wonder about memory usage here. If we're copying to a partitioned table with 10k partitions and we get over MAX_BUFFERED_TUPLES in a row for each partition, we'll end up with quite a lot of slots. So, in short. I think the patch either is going to cause possible memory problems during COPY FROM, or it leaves code that is pretty much redundant. ... Or I've just misunderstood the patch :) Does this make sense? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2019-03-28 20:48:47 +1300, David Rowley wrote: > I had a look at this and performance has improved again, thanks. > However, I'm not sure if the patch is exactly what we need, let me > explain. I'm not entirely sure either, I just haven't really seen an alternative that's convincing. > When I was working on 0d5f05cde01, I think my original idea was just > to use the bufferedTuples[] array and always multi insert into > partitioned tables unless something like on insert triggers stopped > that. The reason I ended up with all this CIM_MULTI and > CIM_MULTI_CONDITIONAL stuff is due to Melanie finding a regression > when the target partition changed on each tuple. At the time I > wondered if it might be worth trying to have multiple buffers or to > partition the existing buffer and store tuples belonging to different > partitions, then just flush them when they're full. Looking at the > patch it seems that you have added an array of slots per partition, so > does that not kinda make all that CIM_MULTI / CIM_MULTI_CONDITIONAL > code redundant? ... we could just flush each partition's buffer when > it gets full. There's not much of a need to flush the buffer when the > partition changes since we're using a different buffer for the next > partition anyway. Well, there is in a way - that'd lead to pretty significant memory usage if the tuples are a bit wider. I think there's a separate optimization, where we keep track of the partitions with unflushed changes and track the buffer size across all partitions. But that seems like a followup patch. > I also wonder about memory usage here. If we're copying to a > partitioned table with 10k partitions and we get over > MAX_BUFFERED_TUPLES in a row for each partition, we'll end up with > quite a lot of slots. We do - but they're not actually that large. In comparison to the catcache, relcache, partition routing etc information per partition it's not a huge amount of memory. I'd earlier just destroyed the slots on partition change, but that's not exactly free either... Greetings, Andres Freund
On Fri, 29 Mar 2019 at 01:15, Andres Freund <andres@anarazel.de> wrote: > On 2019-03-28 20:48:47 +1300, David Rowley wrote: > > I had a look at this and performance has improved again, thanks. > > However, I'm not sure if the patch is exactly what we need, let me > > explain. > > I'm not entirely sure either, I just haven't really seen an alternative > that's convincing. I wonder if instead of having the array of slots in ResultRelInfo, have a struct that's local to copy.c containing the array and the number of tuples stored so far. For partitioned tables, we could store this struct in a hashtable by partition Oid. When the partition changes check if we've got this partition Oid in the hash table and keep adding tuples until the buffer fills. We could keep a global count of the number of tuple stored in all the slot arrays and flush all of them when it gets full. The trade-off here would be that instead of flushing on each partition change, we'd do a hash table lookup on each partition change and possibly create a new array of slots. This would allow us to get rid of the code that conditionally switches on/off the batching based on how often the partition is changing. The key to it being better would hang on the hash lookup + multi-row-inserts being faster than single-row-inserts. I'm just not too sure about how to handle getting rid of the slots when we flush all the tuples. Getting rid of them might be a waste, but it might also stop the code creating tens of millions of slots in the worst case. Maybe to fix that we could get rid of the slots in arrays that didn't get any use at all when we flush the tuples, as indicated by a 0 tuple count. This would require a hash seq scan, but maybe we could keep that cheap by flushing early if we get too many distinct partitions. That would save the table from getting bloated if there happened to be a point in the copy stream where we saw high numbers of distinct partitions with just a few tuples each. Multi-inserts won't help much in that case anyway. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2019-04-01 02:00:26 +1300, David Rowley wrote: > On Fri, 29 Mar 2019 at 01:15, Andres Freund <andres@anarazel.de> wrote: > > On 2019-03-28 20:48:47 +1300, David Rowley wrote: > > > I had a look at this and performance has improved again, thanks. > > > However, I'm not sure if the patch is exactly what we need, let me > > > explain. > > > > I'm not entirely sure either, I just haven't really seen an alternative > > that's convincing. > > I wonder if instead of having the array of slots in ResultRelInfo, > have a struct that's local to copy.c containing the array and the > number of tuples stored so far. For partitioned tables, we could > store this struct in a hashtable by partition Oid. When the partition > changes check if we've got this partition Oid in the hash table and > keep adding tuples until the buffer fills. We could keep a global > count of the number of tuple stored in all the slot arrays and flush > all of them when it gets full. > > The trade-off here would be that instead of flushing on each partition > change, we'd do a hash table lookup on each partition change and > possibly create a new array of slots. This would allow us to get rid > of the code that conditionally switches on/off the batching based on > how often the partition is changing. The key to it being better would > hang on the hash lookup + multi-row-inserts being faster than > single-row-inserts. It's not clear to me why this separate hashtable is useful / necessary. We're essentially already doing such a lookup for the partition routing - what's the point of having a repetition of the same mapping? I don't see any benefit of storing the slots separately from that. > I'm just not too sure about how to handle getting rid of the slots > when we flush all the tuples. Getting rid of them might be a waste, > but it might also stop the code creating tens of millions of slots in > the worst case. Maybe to fix that we could get rid of the slots in > arrays that didn't get any use at all when we flush the tuples, as > indicated by a 0 tuple count. This would require a hash seq scan, but > maybe we could keep that cheap by flushing early if we get too many > distinct partitions. I'd suspect a better approach to this might be to just have a linked list of partitions with slots in them, that'd avoid unnecessarily going through all the partitions that got copied into them. I'm not convinced this is necessary, but if I were to implement this, I'd leave the slots in the ResultRelInfo, but additionally keep track of how many slots have been created. When creating new slots, and some limit has been reached, I'd just go to the front of that list and delete those slots (probably delaying doing so to the the point of flushing pending insertions). That seems like it'd not actually be that much changed code over my current patch, and would avoid deleting slots in the common case. I'll work on pushing all the other pending tableam patches today - leaving COPY the last non pluggable part. You'd written in a private email that you might try to work on this on Monday, so I think I'll give this a shot on Tuesday if you've not gotten around till then? I'd like to push this sooner than the exact end of the freeze... > That would save the table from getting bloated if > there happened to be a point in the copy stream where we saw high > numbers of distinct partitions with just a few tuples each. > Multi-inserts won't help much in that case anyway. I thought your benchmarking showed that you saw benefits after even two tuples? Greetings, Andres Freund
On Mon, 1 Apr 2019 at 08:05, Andres Freund <andres@anarazel.de> wrote: > I'll work on pushing all the other pending tableam patches today - > leaving COPY the last non pluggable part. You'd written in a private > email that you might try to work on this on Monday, so I think I'll give > this a shot on Tuesday if you've not gotten around till then? I'd like > to push this sooner than the exact end of the freeze... I worked on this, but I've not got the patch finished yet. Things not done: 1. Fails regression tests. Possibly due to a bug in tuple conversion. 2. Thought about ERROR line numbers in copy stream. It's possible we may need an array of uint64s to store the line number per slot. 3. Testing. 4. Code comments. 5. Code sanity check. 6. Thought about if it's a problem if we ERROR during the copy after having already inserted tuples that come after the tuple causing the error. (It's possible that the errors would become out of order.) However, the performance looks pretty good. $ cat bench.pl for (my $i=0; $i < 8912891; $i++) { print "1\n1\n2\n2\n"; } $ cat bench_same.pl for (my $i=0; $i < 8912891; $i++) { print "1\n1\n1\n1\n"; } create table listp(a int) partition by list(a); create table listp1 partition of listp for values in(1); create table listp2 partition of listp for values in(2); -- Test 1: Change partition every 2nd tuple. master + v23-0001-tableam-multi_insert-and-slotify-COPY.patch # copy listp from program $$perl ~/bench.pl$$ delimiter '|'; Time: 17894.625 ms (00:17.895) master + attached # copy listp from program $$perl ~/bench.pl$$ delimiter '|'; Time: 10615.761 ms (00:10.616) -- Test 2: Same partition each time. master + v23-0001-tableam-multi_insert-and-slotify-COPY.patch # copy listp from program $$perl ~/bench_same.pl$$ delimiter '|'; Time: 19234.960 ms (00:19.235) master + attached # copy listp from program $$perl ~/bench_same.pl$$ delimiter '|'; Time: 9064.802 ms (00:09.065) Of course, it is possible that some of the bugs account for some of the improved time... but the rows did seem to be in the table afterwards. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
Hi, On 2019-04-02 04:23:20 +1300, David Rowley wrote: > On Mon, 1 Apr 2019 at 08:05, Andres Freund <andres@anarazel.de> wrote: > > I'll work on pushing all the other pending tableam patches today - > > leaving COPY the last non pluggable part. You'd written in a private > > email that you might try to work on this on Monday, so I think I'll give > > this a shot on Tuesday if you've not gotten around till then? I'd like > > to push this sooner than the exact end of the freeze... > > I worked on this, but I've not got the patch finished yet. Thanks! I'm not quite clear whether you planning to continue working on this, or whether this is a handoff? Either is fine with me, just trying to avoid unnecessary work / delay. > Of course, it is possible that some of the bugs account for some of > the improved time... but the rows did seem to be in the table > afterwards. The speedup likely seems to be from having much larger batches, right? Unless we insert into enough partitions that batching doesn't ever encounter two tuples, we'll now efficiently use multi_insert even if there's no consecutive tuples. > if (cstate->rel) > { > - Datum *values; > - bool *nulls; > + TupleTableSlot *slot; > TableScanDesc scandesc; > - HeapTuple tuple; > - > - values = (Datum *) palloc(num_phys_attrs * sizeof(Datum)); > - nulls = (bool *) palloc(num_phys_attrs * sizeof(bool)); > > scandesc = table_beginscan(cstate->rel, GetActiveSnapshot(), 0, NULL); > + slot = table_slot_create(cstate->rel, NULL); > > processed = 0; > - while ((tuple = heap_getnext(scandesc, ForwardScanDirection)) != NULL) > + while (table_scan_getnextslot(scandesc, ForwardScanDirection, slot)) > { > CHECK_FOR_INTERRUPTS(); > > - /* Deconstruct the tuple ... faster than repeated heap_getattr */ > - heap_deform_tuple(tuple, tupDesc, values, nulls); > + /* Deconstruct the tuple ... */ > + slot_getallattrs(slot); > > /* Format and send the data */ > - CopyOneRowTo(cstate, values, nulls); > + CopyOneRowTo(cstate, slot); > processed++; > } > > + ExecDropSingleTupleTableSlot(slot); > table_endscan(scandesc); > - > - pfree(values); > - pfree(nulls); > } Hm, I should just have committed this part separately. Not sure why I didn't... > +#define MAX_BUFFERED_TUPLES 1000 > +#define MAX_BUFFERED_BYTES 65535 > +#define MAX_PARTITION_BUFFERS 16 > + > +typedef struct { > + Oid relid; > + TupleTableSlot **slots; > + ResultRelInfo *resultRelInfo; > + int nused; > +} CopyMultiInsertBuffer; > + > +typedef struct { > + HTAB *multiInsertBufferTab; > + CopyMultiInsertBuffer *buffer; > + int bufferedTuples; > + int bufferedBytes; > + int nbuffers; > +} CopyMultiInsertInfo; To me it seems a bit better to have this as generic facility - we really should use multi_insert in more places (like inserting the pg_attribute rows). But admittedly that can just be done later. > +static inline void > +CopyMultiInsertBuffer_RemoveBuffer(CopyMultiInsertInfo *miinfo, CopyMultiInsertBuffer *buffer) > +{ > + Oid relid = buffer->relid; > + int i = 0; > + > + while (buffer->slots[i] != NULL) > + ExecClearTuple(buffer->slots[i++]); > + pfree(buffer->slots); > + > + hash_search(miinfo->multiInsertBufferTab, (void *) &relid, HASH_REMOVE, > + NULL); > + miinfo->nbuffers--; > +} Hm, this leaves dangling slots, right? > +static inline void > +CopyMultiInsertInfo_SetCurrentBuffer(CopyMultiInsertInfo *miinfo, > + ResultRelInfo *rri) > +{ > + CopyMultiInsertBuffer *buffer; > + Oid relid = RelationGetRelid(rri->ri_RelationDesc); > + bool found; > + > + Assert(miinfo->multiInsertBufferTab != NULL); > + > + buffer = hash_search(miinfo->multiInsertBufferTab, &relid, HASH_ENTER, &found); > + > + if (!found) > + { > + buffer->relid = relid; > + buffer->slots = palloc0(MAX_BUFFERED_TUPLES * sizeof(TupleTableSlot *)); > + buffer->resultRelInfo = rri; > + buffer->nused = 0; > + miinfo->nbuffers++; > + } > + > + miinfo->buffer = buffer; > +} It still seems wrong to me to just perform a second hashtable search here, givent that we've already done the partition dispatch. > if (proute) > insertMethod = CIM_MULTI_CONDITIONAL; > else > insertMethod = CIM_MULTI; Hm, why do we still have CIM_MULTI and CIM_MULTI_CONDITIONAL? > /* > - * Tests have shown that using multi-inserts when the > - * partition changes on every tuple slightly decreases the > - * performance, however, there are benefits even when only > - * some batches have just 2 tuples, so let's enable > - * multi-inserts even when the average is quite low. > + * Enable multi-inserts when the partition has BEFORE/INSTEAD > + * OF triggers, or if the partition is a foreign partition. > */ > leafpart_use_multi_insert = insertMethod == CIM_MULTI_CONDITIONAL && > - avgTuplesPerPartChange >= 1.3 && > - !has_before_insert_row_trig && > - !has_instead_insert_row_trig && > - resultRelInfo->ri_FdwRoutine == NULL; > + !has_before_insert_row_trig && > + !has_instead_insert_row_trig && > + resultRelInfo->ri_FdwRoutine == NULL; Not for now / this commit, but I kinda feel like we should determine whether it's safe to multi-insert at a more centralized location. Greetings, Andres Freund
On Tue, 2 Apr 2019 at 05:19, Andres Freund <andres@anarazel.de> wrote: > > On 2019-04-02 04:23:20 +1300, David Rowley wrote: > > On Mon, 1 Apr 2019 at 08:05, Andres Freund <andres@anarazel.de> wrote: > > > I'll work on pushing all the other pending tableam patches today - > > > leaving COPY the last non pluggable part. You'd written in a private > > > email that you might try to work on this on Monday, so I think I'll give > > > this a shot on Tuesday if you've not gotten around till then? I'd like > > > to push this sooner than the exact end of the freeze... > > > > I worked on this, but I've not got the patch finished yet. > > Thanks! I'm not quite clear whether you planning to continue working on > this, or whether this is a handoff? Either is fine with me, just trying > to avoid unnecessary work / delay. I can, if you've not. I was hoping to gauge if you thought the approach was worth pursuing. > > Of course, it is possible that some of the bugs account for some of > > the improved time... but the rows did seem to be in the table > > afterwards. > > The speedup likely seems to be from having much larger batches, right? > Unless we insert into enough partitions that batching doesn't ever > encounter two tuples, we'll now efficiently use multi_insert even if > there's no consecutive tuples. I imagine that's part of it, but I was surprised that the test that inserts into the same partition also was improved fairly significantly. > > +static inline void > > +CopyMultiInsertBuffer_RemoveBuffer(CopyMultiInsertInfo *miinfo, CopyMultiInsertBuffer *buffer) > > +{ > > + Oid relid = buffer->relid; > > + int i = 0; > > + > > + while (buffer->slots[i] != NULL) > > + ExecClearTuple(buffer->slots[i++]); > > + pfree(buffer->slots); > > + > > + hash_search(miinfo->multiInsertBufferTab, (void *) &relid, HASH_REMOVE, > > + NULL); > > + miinfo->nbuffers--; > > +} > > Hm, this leaves dangling slots, right? Probably. I didn't quite finish figuring out how a slot should have all its memory released. > It still seems wrong to me to just perform a second hashtable search > here, givent that we've already done the partition dispatch. The reason I thought this was a good idea is that if we use the ResultRelInfo to buffer the tuples then there's no end to how many tuple slots can exist as the code in copy.c has no control over how many ResultRelInfos are created. > > if (proute) > > insertMethod = CIM_MULTI_CONDITIONAL; > > else > > insertMethod = CIM_MULTI; > > Hm, why do we still have CIM_MULTI and CIM_MULTI_CONDITIONAL? We don't know what partitions will be foreign or have triggers that don't allow us to multi-insert. Multi-inserts are still conditional based on that. > Not for now / this commit, but I kinda feel like we should determine > whether it's safe to multi-insert at a more centralized location. Yeah, I was thinking that might be nice but teaching those Multi-insert functions about that means they'd need to handle non-multi inserts too. That might only be a problem that highlights that the functions I added are not named very well. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2019-04-02 13:41:57 +1300, David Rowley wrote: > On Tue, 2 Apr 2019 at 05:19, Andres Freund <andres@anarazel.de> wrote: > > > > On 2019-04-02 04:23:20 +1300, David Rowley wrote: > > > On Mon, 1 Apr 2019 at 08:05, Andres Freund <andres@anarazel.de> wrote: > > > > I'll work on pushing all the other pending tableam patches today - > > > > leaving COPY the last non pluggable part. You'd written in a private > > > > email that you might try to work on this on Monday, so I think I'll give > > > > this a shot on Tuesday if you've not gotten around till then? I'd like > > > > to push this sooner than the exact end of the freeze... > > > > > > I worked on this, but I've not got the patch finished yet. > > > > Thanks! I'm not quite clear whether you planning to continue working on > > this, or whether this is a handoff? Either is fine with me, just trying > > to avoid unnecessary work / delay. > > I can, if you've not. I was hoping to gauge if you thought the > approach was worth pursuing. I think it's worth pursuing, with the caveats below. I'm going to focus on docs the not-very-long rest of today, but I definitely could work on this afterwards. But I also would welcome any help. Let me know... > > > +static inline void > > > +CopyMultiInsertBuffer_RemoveBuffer(CopyMultiInsertInfo *miinfo, CopyMultiInsertBuffer *buffer) > > > +{ > > > + Oid relid = buffer->relid; > > > + int i = 0; > > > + > > > + while (buffer->slots[i] != NULL) > > > + ExecClearTuple(buffer->slots[i++]); > > > + pfree(buffer->slots); > > > + > > > + hash_search(miinfo->multiInsertBufferTab, (void *) &relid, HASH_REMOVE, > > > + NULL); > > > + miinfo->nbuffers--; > > > +} > > > > Hm, this leaves dangling slots, right? > > Probably. I didn't quite finish figuring out how a slot should have > all its memory released. ExecDropSingleTupleTableSlot() ought to do the job (if not added to a tuptable/estate - which we shouldn't as currently one-by-one removal from therein is expensive). > > It still seems wrong to me to just perform a second hashtable search > > here, givent that we've already done the partition dispatch. > > The reason I thought this was a good idea is that if we use the > ResultRelInfo to buffer the tuples then there's no end to how many > tuple slots can exist as the code in copy.c has no control over how > many ResultRelInfos are created. To me those aren't contradictory - we're going to have a ResultRelInfo for each partition either way, but there's nothing preventing copy.c from cleaning up subsidiary data in it. What I was thinking is that we'd just keep track of a list of ResultRelInfos with bulk insert slots, and occasionally clean them up. That way we avoid the secondary lookup, while also managing the amount of slots. Greetings, Andres Freund
On Tue, 2 Apr 2019 at 13:59, Andres Freund <andres@anarazel.de> wrote: > > On 2019-04-02 13:41:57 +1300, David Rowley wrote: > > On Tue, 2 Apr 2019 at 05:19, Andres Freund <andres@anarazel.de> wrote: > > > Thanks! I'm not quite clear whether you planning to continue working on > > > this, or whether this is a handoff? Either is fine with me, just trying > > > to avoid unnecessary work / delay. > > > > I can, if you've not. I was hoping to gauge if you thought the > > approach was worth pursuing. > > I think it's worth pursuing, with the caveats below. I'm going to focus > on docs the not-very-long rest of today, but I definitely could work on > this afterwards. But I also would welcome any help. Let me know... I'm looking now. I'll post something when I get it into some better shape than it us now. > > > It still seems wrong to me to just perform a second hashtable search > > > here, givent that we've already done the partition dispatch. > > > > The reason I thought this was a good idea is that if we use the > > ResultRelInfo to buffer the tuples then there's no end to how many > > tuple slots can exist as the code in copy.c has no control over how > > many ResultRelInfos are created. > > To me those aren't contradictory - we're going to have a ResultRelInfo > for each partition either way, but there's nothing preventing copy.c > from cleaning up subsidiary data in it. What I was thinking is that > we'd just keep track of a list of ResultRelInfos with bulk insert slots, > and occasionally clean them up. That way we avoid the secondary lookup, > while also managing the amount of slots. The problem that I see with that is you can't just add to that list when the partition changes. You must check if the ResultRelInfo is already in the list or not since we could change partitions and change back again. For a list with just a few elements checking list_member_ptr should be pretty cheap, but I randomly did choose that we try to keep just the last 16 partitions worth of buffers. I don't think checking list_member_ptr in a 16 element list is likely to be faster than a hash table lookup, do you? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2019-04-02 14:06:52 +1300, David Rowley wrote: > On Tue, 2 Apr 2019 at 13:59, Andres Freund <andres@anarazel.de> wrote: > > > > On 2019-04-02 13:41:57 +1300, David Rowley wrote: > > > On Tue, 2 Apr 2019 at 05:19, Andres Freund <andres@anarazel.de> wrote: > > > > Thanks! I'm not quite clear whether you planning to continue working on > > > > this, or whether this is a handoff? Either is fine with me, just trying > > > > to avoid unnecessary work / delay. > > > > > > I can, if you've not. I was hoping to gauge if you thought the > > > approach was worth pursuing. > > > > I think it's worth pursuing, with the caveats below. I'm going to focus > > on docs the not-very-long rest of today, but I definitely could work on > > this afterwards. But I also would welcome any help. Let me know... > > I'm looking now. I'll post something when I get it into some better > shape than it us now. Cool. > > > > It still seems wrong to me to just perform a second hashtable search > > > > here, givent that we've already done the partition dispatch. > > > > > > The reason I thought this was a good idea is that if we use the > > > ResultRelInfo to buffer the tuples then there's no end to how many > > > tuple slots can exist as the code in copy.c has no control over how > > > many ResultRelInfos are created. > > > > To me those aren't contradictory - we're going to have a ResultRelInfo > > for each partition either way, but there's nothing preventing copy.c > > from cleaning up subsidiary data in it. What I was thinking is that > > we'd just keep track of a list of ResultRelInfos with bulk insert slots, > > and occasionally clean them up. That way we avoid the secondary lookup, > > while also managing the amount of slots. > > The problem that I see with that is you can't just add to that list > when the partition changes. You must check if the ResultRelInfo is > already in the list or not since we could change partitions and change > back again. For a list with just a few elements checking > list_member_ptr should be pretty cheap, but I randomly did choose that > we try to keep just the last 16 partitions worth of buffers. I don't > think checking list_member_ptr in a 16 element list is likely to be > faster than a hash table lookup, do you? Why do we need that list membership check? If you append the ResultRelInfo to the list when creating the ResultRelInfo's slots array, you don't need to touch the list after a partition lookup - you know it's a member if the ResultRelInfo has a slot array. Then you only need to iterate the list when you want to drop slots to avoid using too much memory - and that's also a sequential scan of the hash table in your approach, right? Greetings, Andres Freund
On Tue, 2 Apr 2019 at 14:11, Andres Freund <andres@anarazel.de> wrote: > Why do we need that list membership check? If you append the > ResultRelInfo to the list when creating the ResultRelInfo's slots array, > you don't need to touch the list after a partition lookup - you know > it's a member if the ResultRelInfo has a slot array. Then you only need > to iterate the list when you want to drop slots to avoid using too much > memory - and that's also a sequential scan of the hash table in your > approach, right? Okay, you're right, we could just determine if we've got a new ResultRelInfo by the lack of any bulk insert slots being set in it. However, I've ended up not doing it that way as the patch requires more than just an array of TupleTableSlots to be stored in the ResultRelInfo, it'll pretty much need all of what I have in CopyMultiInsertBuffer, which includes line numbers for error reporting, a BulkInsertState and a counter to track how many of the slots are used. I had thoughts that we could just tag the whole CopyMultiInsertBuffer in ResultRelInfo, but that requires moving it somewhere a bit more generic. Another thought would be that we have something like "void *extra;" in ResultRelInfo that we can just borrow for copy.c. It may be interesting to try this out to see if it saves much in the way of performance. I've got the patch into a working state now and wrote a bunch of comments. I've not really done a thorough read back of the code again. I normally wait until the light is coming from a different angle before doing that, but there does not seem to be much time to wait for that in this case, so here's v2. Performance seems to be about the same as what I reported yesterday. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On 2019-04-03 06:41:49 +1300, David Rowley wrote: > However, I've ended up not doing it that way as the patch requires > more than just an array of TupleTableSlots to be stored in the > ResultRelInfo, it'll pretty much need all of what I have in > CopyMultiInsertBuffer, which includes line numbers for error > reporting, a BulkInsertState and a counter to track how many of the > slots are used. I had thoughts that we could just tag the whole > CopyMultiInsertBuffer in ResultRelInfo, but that requires moving it > somewhere a bit more generic. Another thought would be that we have > something like "void *extra;" in ResultRelInfo that we can just borrow > for copy.c. It may be interesting to try this out to see if it saves > much in the way of performance. Hm, we could just forwad declare struct CopyMultiInsertBuffer in a header, and only define it in copy.c. That doesn't sound insane to me. > I've got the patch into a working state now and wrote a bunch of > comments. I've not really done a thorough read back of the code again. > I normally wait until the light is coming from a different angle > before doing that, but there does not seem to be much time to wait for > that in this case, so here's v2. Performance seems to be about the > same as what I reported yesterday. Cool. > +/* > + * Multi-Insert handling code for COPY FROM. Inserts are performed in > + * batches of up to MAX_BUFFERED_TUPLES. This should probably be moved to the top of the file. > + * When COPY FROM is used on a partitioned table, we attempt to maintain > + * only MAX_PARTITION_BUFFERS buffers at once. Buffers that are completely > + * unused in each batch are removed so that we end up not keeping buffers for > + * partitions that we might not insert into again. > + */ > +#define MAX_BUFFERED_TUPLES 1000 > +#define MAX_BUFFERED_BYTES 65535 > +#define MAX_PARTITION_BUFFERS 15 > +/* > + * CopyMultiInsertBuffer > + * Stores multi-insert data related to a single relation in CopyFrom. > + */ > +typedef struct Please don't create anonymous structs - they can't be forward declared, and some debugging tools handle them worse than named structs. And it makes naming CopyMultiInsertBuffer in the comment less necessary. > +{ > + Oid relid; /* Relation ID this insert data is for */ > + TupleTableSlot **slots; /* Array of MAX_BUFFERED_TUPLES to store > + * tuples */ > + uint64 *linenos; /* Line # of tuple in copy stream */ It could make sense to allocate those two together, or even as part of the CopyMultiInsertBuffer itself, to reduce allocator overhead. > + else > + { > + CopyMultiInsertBuffer *buffer = (CopyMultiInsertBuffer *) palloc(sizeof(CopyMultiInsertBuffer)); > + > + /* Non-partitioned table. Just setup the buffer for the table. */ > + buffer->relid = RelationGetRelid(rri->ri_RelationDesc); > + buffer->slots = palloc0(MAX_BUFFERED_TUPLES * sizeof(TupleTableSlot *)); > + buffer->linenos = palloc(MAX_BUFFERED_TUPLES * sizeof(uint64)); > + buffer->resultRelInfo = rri; > + buffer->bistate = GetBulkInsertState(); > + buffer->nused = 0; > + miinfo->multiInsertBufferTab = NULL; > + miinfo->buffer = buffer; > + miinfo->nbuffers = 1; > + } Can this be moved into a helper function? > + /* > + * heap_multi_insert leaks memory, so switch to short-lived memory context > + * before calling it. > + */ s/heap_multi_insert/table_multi_insert/ > + /* > + * If there are any indexes, update them for all the inserted tuples, and > + * run AFTER ROW INSERT triggers. > + */ > + if (resultRelInfo->ri_NumIndices > 0) > + { > + for (i = 0; i < nBufferedTuples; i++) > + { > + List *recheckIndexes; > + > + cstate->cur_lineno = buffer->linenos[i]; > + recheckIndexes = > + ExecInsertIndexTuples(buffer->slots[i], estate, false, NULL, > + NIL); > + ExecARInsertTriggers(estate, resultRelInfo, > + buffer->slots[i], > + recheckIndexes, cstate->transition_capture); > + list_free(recheckIndexes); > + } > + } > + > + /* > + * There's no indexes, but see if we need to run AFTER ROW INSERT triggers > + * anyway. > + */ > + else if (resultRelInfo->ri_TrigDesc != NULL && > + (resultRelInfo->ri_TrigDesc->trig_insert_after_row || > + resultRelInfo->ri_TrigDesc->trig_insert_new_table)) > + { > + for (i = 0; i < nBufferedTuples; i++) > + { > + cstate->cur_lineno = buffer->linenos[i]; > + ExecARInsertTriggers(estate, resultRelInfo, > + bufferedSlots[i], > + NIL, cstate->transition_capture); > + } > + } > + for (i = 0; i < nBufferedTuples; i++) > + ExecClearTuple(bufferedSlots[i]); I wonder about combining these loops somehow. But it's probably ok. > +/* > + * CopyMultiInsertBuffer_Cleanup > + * Drop used slots and free member for this buffer. The buffer > + * must be flushed before cleanup. > + */ > +static inline void > +CopyMultiInsertBuffer_Cleanup(CopyMultiInsertBuffer *buffer) > +{ > + int i; > + > + ReleaseBulkInsertStatePin(buffer->bistate); Shouldn't this FreeBulkInsertState() rather than ReleaseBulkInsertStatePin()? > + > +/* > + * CopyMultiInsertBuffer_RemoveBuffer > + * Remove a buffer from being tracked by miinfo > + */ > +static inline void > +CopyMultiInsertBuffer_RemoveBuffer(CopyMultiInsertInfo *miinfo, > + CopyMultiInsertBuffer *buffer) > +{ > + Oid relid = buffer->relid; > + > + CopyMultiInsertBuffer_Cleanup(buffer); > + > + hash_search(miinfo->multiInsertBufferTab, (void *) &relid, HASH_REMOVE, > + NULL); > + miinfo->nbuffers--; Aren't we leaking the CopyMultiInsertBuffer itself here? > +} > + > +/* > + * CopyMultiInsertInfo_Flush > + * Write out all stored tuples in all buffers out to the tables. > + * > + * To save us from ending up with buffers for 1000s of partitions we remove > + * buffers belonging to partitions that we've seen no tuples for in this batch > + */ > +static inline void > +CopyMultiInsertInfo_Flush(CopyMultiInsertInfo *miinfo, CopyState cstate, > + EState *estate, CommandId mycid, int ti_options) > +{ > + CopyMultiInsertBuffer *buffer; > + > + /* > + * If just storing buffers for a non-partitioned table, then just flush > + * that buffer. > + */ > + if (miinfo->multiInsertBufferTab == NULL) > + { > + buffer = miinfo->buffer; > + > + CopyMultiInsertInfo_FlushSingleBuffer(buffer, cstate, estate, mycid, > + ti_options); > + } > + else > + { > + HASH_SEQ_STATUS status; > + > + /* > + * Otherwise make a pass over the hash table and flush all buffers > + * that have any tuples stored in them. > + */ > + hash_seq_init(&status, miinfo->multiInsertBufferTab); > + > + while ((buffer = (CopyMultiInsertBuffer *) hash_seq_search(&status)) != NULL) > + { > + if (buffer->nused > 0) > + { > + /* Flush the buffer if it was used */ > + CopyMultiInsertInfo_FlushSingleBuffer(buffer, cstate, estate, > + mycid, ti_options); > + } > + else > + { > + /* > + * Otherwise just remove it. If we saw no tuples for it this > + * batch, then likely its best to make way for buffers for > + * other partitions. > + */ > + CopyMultiInsertBuffer_RemoveBuffer(miinfo, buffer); > + } > + } > + } > + > + miinfo->bufferedTuples = 0; > + miinfo->bufferedBytes = 0; > +} > + > +/* > + * CopyMultiInsertInfo_Cleanup > + * Cleanup allocated buffers and free memory > + */ > +static inline void > +CopyMultiInsertInfo_Cleanup(CopyMultiInsertInfo *miinfo) > +{ > + if (miinfo->multiInsertBufferTab == NULL) > + CopyMultiInsertBuffer_Cleanup(miinfo->buffer); > + else > + { > + HASH_SEQ_STATUS status; > + CopyMultiInsertBuffer *buffer; > + > + hash_seq_init(&status, miinfo->multiInsertBufferTab); > + > + while ((buffer = (CopyMultiInsertBuffer *) hash_seq_search(&status)) != NULL) > + { > + Assert(buffer->nused == 0); > + CopyMultiInsertBuffer_Cleanup(buffer); > + } > + > + hash_destroy(miinfo->multiInsertBufferTab); > + } > +} > + > +/* > + * CopyMultiInsertInfo_NextFreeSlot > + * Get the next TupleTableSlot that the next tuple should be stored in. > + * > + * Callers must ensure that the buffer is not full. > + */ > +static inline TupleTableSlot * > +CopyMultiInsertInfo_NextFreeSlot(CopyMultiInsertInfo *miinfo, > + ResultRelInfo *rri) > +{ > + CopyMultiInsertBuffer *buffer = miinfo->buffer; > + int nused = buffer->nused; > + > + Assert(nused < MAX_BUFFERED_TUPLES); > + > + if (buffer->slots[nused] == NULL) > + buffer->slots[nused] = table_slot_create(rri->ri_RelationDesc, NULL); > + return buffer->slots[nused]; > +} > + > +/* > + * CopyMultiInsertInfo_Store > + * Consume the previously reserved TupleTableSlot that was reserved by > + * CopyMultiInsertInfo_NextFreeSlot. > + */ > +static inline void > +CopyMultiInsertInfo_Store(CopyMultiInsertInfo *miinfo, TupleTableSlot *slot, > + int tuplen, uint64 lineno) > +{ > + CopyMultiInsertBuffer *buffer = miinfo->buffer; > + > + Assert(slot == buffer->slots[buffer->nused]); > + > + /* Store the line number so we can properly report any errors later */ > + buffer->linenos[buffer->nused] = lineno; > + > + /* Record this slot as being used */ > + buffer->nused++; > + > + /* Update how many tuples are stored and their size */ > + miinfo->bufferedTuples++; > + miinfo->bufferedBytes += tuplen; > +} > + > /* > * Copy FROM file to relation. > */ > uint64 > CopyFrom(CopyState cstate) > { > - HeapTuple tuple; > - TupleDesc tupDesc; > - Datum *values; > - bool *nulls; > ResultRelInfo *resultRelInfo; > ResultRelInfo *target_resultRelInfo; > ResultRelInfo *prevResultRelInfo = NULL; > EState *estate = CreateExecutorState(); /* for ExecConstraints() */ > ModifyTableState *mtstate; > ExprContext *econtext; > - TupleTableSlot *myslot; > + TupleTableSlot *singleslot = NULL; > MemoryContext oldcontext = CurrentMemoryContext; > - MemoryContext batchcontext; > > PartitionTupleRouting *proute = NULL; > ErrorContextCallback errcallback; > CommandId mycid = GetCurrentCommandId(true); > int ti_options = 0; /* start with default table_insert options */ > - BulkInsertState bistate; > + BulkInsertState bistate = NULL; > CopyInsertMethod insertMethod; > + CopyMultiInsertInfo multiInsertInfo; > uint64 processed = 0; > - int nBufferedTuples = 0; > bool has_before_insert_row_trig; > bool has_instead_insert_row_trig; > bool leafpart_use_multi_insert = false; > > -#define MAX_BUFFERED_TUPLES 1000 > -#define RECHECK_MULTI_INSERT_THRESHOLD 1000 > - HeapTuple *bufferedTuples = NULL; /* initialize to silence warning */ > - Size bufferedTuplesSize = 0; > - uint64 firstBufferedLineNo = 0; > - uint64 lastPartitionSampleLineNo = 0; > - uint64 nPartitionChanges = 0; > - double avgTuplesPerPartChange = 0; > - > Assert(cstate->rel); > > + memset(&multiInsertInfo, 0, sizeof(CopyMultiInsertInfo)); > + > /* > * The target must be a plain, foreign, or partitioned relation, or have > * an INSTEAD OF INSERT row trigger. (Currently, such triggers are only > @@ -2382,8 +2767,6 @@ CopyFrom(CopyState cstate) > RelationGetRelationName(cstate->rel)))); > } > > - tupDesc = RelationGetDescr(cstate->rel); > - > /*---------- > * Check to see if we can avoid writing WAL > * > @@ -2467,8 +2850,8 @@ CopyFrom(CopyState cstate) > if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > { > ereport(ERROR, > - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > - errmsg("cannot perform FREEZE on a partitioned table"))); > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot perform FREEZE on a partitioned table"))); > } > > /* > @@ -2518,10 +2901,6 @@ CopyFrom(CopyState cstate) > > ExecInitRangeTable(estate, cstate->range_table); > > - /* Set up a tuple slot too */ > - myslot = ExecInitExtraTupleSlot(estate, tupDesc, > - &TTSOpsHeapTuple); > - > /* > * Set up a ModifyTableState so we can let FDW(s) init themselves for > * foreign-table result relation(s). > @@ -2565,10 +2944,11 @@ CopyFrom(CopyState cstate) > &mtstate->ps); > > /* > - * It's more efficient to prepare a bunch of tuples for insertion, and > - * insert them in one heap_multi_insert() call, than call heap_insert() > - * separately for every tuple. However, there are a number of reasons why > - * we might not be able to do this. These are explained below. > + * It's generally more efficient to prepare a bunch of tuples for > + * insertion, and insert them in one table_multi_insert() call, than call > + * table_insert() separately for every tuple. However, there are a number > + * of reasons why we might not be able to do this. These are explained > + * below. > */ > if (resultRelInfo->ri_TrigDesc != NULL && > (resultRelInfo->ri_TrigDesc->trig_insert_before_row || > @@ -2589,8 +2969,8 @@ CopyFrom(CopyState cstate) > * For partitioned tables we can't support multi-inserts when there > * are any statement level insert triggers. It might be possible to > * allow partitioned tables with such triggers in the future, but for > - * now, CopyFromInsertBatch expects that any before row insert and > - * statement level insert triggers are on the same relation. > + * now, CopyMultiInsertInfo_Flush expects that any before row insert > + * and statement level insert triggers are on the same relation. > */ > insertMethod = CIM_SINGLE; > } > @@ -2622,8 +3002,7 @@ CopyFrom(CopyState cstate) > { > /* > * For partitioned tables, we may still be able to perform bulk > - * inserts for sets of consecutive tuples which belong to the same > - * partition. However, the possibility of this depends on which types > + * inserts. However, the possibility of this depends on which types > * of triggers exist on the partition. We must disable bulk inserts > * if the partition is a foreign table or it has any before row insert > * or insert instead triggers (same as we checked above for the parent > @@ -2632,18 +3011,27 @@ CopyFrom(CopyState cstate) > * have the intermediate insert method of CIM_MULTI_CONDITIONAL to > * flag that we must later determine if we can use bulk-inserts for > * the partition being inserted into. > - * > - * Normally, when performing bulk inserts we just flush the insert > - * buffer whenever it becomes full, but for the partitioned table > - * case, we flush it whenever the current tuple does not belong to the > - * same partition as the previous tuple. > */ > if (proute) > insertMethod = CIM_MULTI_CONDITIONAL; > else > insertMethod = CIM_MULTI; > > - bufferedTuples = palloc(MAX_BUFFERED_TUPLES * sizeof(HeapTuple)); > + CopyMultiInsertInfo_Init(&multiInsertInfo, resultRelInfo, > + proute != NULL); > + } > + > + /* > + * If not using batch mode (which allocates slots as needed) set up a > + * tuple slot too. When inserting into a partitioned table, we also need > + * one, even if we might batch insert, to read the tuple in the root > + * partition's form. > + */ > + if (insertMethod == CIM_SINGLE || insertMethod == CIM_MULTI_CONDITIONAL) > + { > + singleslot = table_slot_create(resultRelInfo->ri_RelationDesc, > + &estate->es_tupleTable); > + bistate = GetBulkInsertState(); > } > > has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc && > @@ -2660,10 +3048,6 @@ CopyFrom(CopyState cstate) > */ > ExecBSInsertTriggers(estate, resultRelInfo); > > - values = (Datum *) palloc(tupDesc->natts * sizeof(Datum)); > - nulls = (bool *) palloc(tupDesc->natts * sizeof(bool)); > - > - bistate = GetBulkInsertState(); > econtext = GetPerTupleExprContext(estate); > > /* Set up callback to identify error line number */ > @@ -2672,17 +3056,9 @@ CopyFrom(CopyState cstate) > errcallback.previous = error_context_stack; > error_context_stack = &errcallback; > > - /* > - * Set up memory context for batches. For cases without batching we could > - * use the per-tuple context, but it's simpler to just use it every time. > - */ > - batchcontext = AllocSetContextCreate(CurrentMemoryContext, > - "batch context", > - ALLOCSET_DEFAULT_SIZES); > - > for (;;) > { > - TupleTableSlot *slot; > + TupleTableSlot *myslot; > bool skip_tuple; > > CHECK_FOR_INTERRUPTS(); > @@ -2693,20 +3069,33 @@ CopyFrom(CopyState cstate) > */ > ResetPerTupleExprContext(estate); > > + if (insertMethod == CIM_SINGLE || proute) > + { > + myslot = singleslot; > + Assert(myslot != NULL); > + } > + else > + { > + Assert(resultRelInfo == target_resultRelInfo); > + Assert(insertMethod == CIM_MULTI); > + > + myslot = CopyMultiInsertInfo_NextFreeSlot(&multiInsertInfo, > + resultRelInfo); > + } > + > /* > * Switch to per-tuple context before calling NextCopyFrom, which does > * evaluate default expressions etc. and requires per-tuple context. > */ > MemoryContextSwitchTo(GetPerTupleMemoryContext(estate)); > > - if (!NextCopyFrom(cstate, econtext, values, nulls)) > - break; > + ExecClearTuple(myslot); > > - /* Switch into per-batch memory context before forming the tuple. */ > - MemoryContextSwitchTo(batchcontext); > + /* Directly store the values/nulls array in the slot */ > + if (!NextCopyFrom(cstate, econtext, myslot->tts_values, myslot->tts_isnull)) > + break; > > - /* And now we can form the input tuple. */ > - tuple = heap_form_tuple(tupDesc, values, nulls); > + ExecStoreVirtualTuple(myslot); > > /* > * Constraints might reference the tableoid column, so (re-)initialize > @@ -2717,18 +3106,15 @@ CopyFrom(CopyState cstate) > /* Triggers and stuff need to be invoked in query context. */ > MemoryContextSwitchTo(oldcontext); > > - /* Place tuple in tuple slot --- but slot shouldn't free it */ > - slot = myslot; > - ExecStoreHeapTuple(tuple, slot, false); > - > if (cstate->whereClause) > { > econtext->ecxt_scantuple = myslot; > + /* Skip items that don't match the COPY's WHERE clause */ > if (!ExecQual(cstate->qualexpr, econtext)) > continue; > } > > - /* Determine the partition to heap_insert the tuple into */ > + /* Determine the partition to table_insert the tuple into */ > if (proute) > { > TupleConversionMap *map; > @@ -2739,80 +3125,10 @@ CopyFrom(CopyState cstate) > * if the found partition is not suitable for INSERTs. > */ > resultRelInfo = ExecFindPartition(mtstate, target_resultRelInfo, > - proute, slot, estate); > + proute, myslot, estate); > > if (prevResultRelInfo != resultRelInfo) > { > - /* Check if we can multi-insert into this partition */ > - if (insertMethod == CIM_MULTI_CONDITIONAL) > - { > - /* > - * When performing bulk-inserts into partitioned tables we > - * must insert the tuples seen so far to the heap whenever > - * the partition changes. > - */ > - if (nBufferedTuples > 0) > - { > - MemoryContext oldcontext; > - > - CopyFromInsertBatch(cstate, estate, mycid, ti_options, > - prevResultRelInfo, myslot, bistate, > - nBufferedTuples, bufferedTuples, > - firstBufferedLineNo); > - nBufferedTuples = 0; > - bufferedTuplesSize = 0; > - > - /* > - * The tuple is already allocated in the batch context, which > - * we want to reset. So to keep the tuple we copy it into the > - * short-lived (per-tuple) context, reset the batch context > - * and then copy it back into the per-batch one. > - */ > - oldcontext = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate)); > - tuple = heap_copytuple(tuple); > - MemoryContextSwitchTo(oldcontext); > - > - /* cleanup the old batch */ > - MemoryContextReset(batchcontext); > - > - /* copy the tuple back to the per-batch context */ > - oldcontext = MemoryContextSwitchTo(batchcontext); > - tuple = heap_copytuple(tuple); > - MemoryContextSwitchTo(oldcontext); > - > - /* > - * Also push the tuple copy to the slot (resetting the context > - * invalidated the slot contents). > - */ > - ExecStoreHeapTuple(tuple, slot, false); > - } > - > - nPartitionChanges++; > - > - /* > - * Here we adaptively enable multi-inserts based on the > - * average number of tuples from recent multi-insert > - * batches. We recalculate the average every > - * RECHECK_MULTI_INSERT_THRESHOLD tuples instead of taking > - * the average over the whole copy. This allows us to > - * enable multi-inserts when we get periods in the copy > - * stream that have tuples commonly belonging to the same > - * partition, and disable when the partition is changing > - * too often. > - */ > - if (unlikely(lastPartitionSampleLineNo <= (cstate->cur_lineno - > - RECHECK_MULTI_INSERT_THRESHOLD) > - && cstate->cur_lineno >= RECHECK_MULTI_INSERT_THRESHOLD)) > - { > - avgTuplesPerPartChange = > - (cstate->cur_lineno - lastPartitionSampleLineNo) / > - (double) nPartitionChanges; > - > - lastPartitionSampleLineNo = cstate->cur_lineno; > - nPartitionChanges = 0; > - } > - } > - > /* Determine which triggers exist on this partition */ > has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc && > resultRelInfo->ri_TrigDesc->trig_insert_before_row); > @@ -2821,22 +3137,19 @@ CopyFrom(CopyState cstate) > resultRelInfo->ri_TrigDesc->trig_insert_instead_row); > > /* > - * Tests have shown that using multi-inserts when the > - * partition changes on every tuple slightly decreases the > - * performance, however, there are benefits even when only > - * some batches have just 2 tuples, so let's enable > - * multi-inserts even when the average is quite low. > + * Enable multi-inserts when the partition has BEFORE/INSTEAD > + * OF triggers, or if the partition is a foreign partition. > */ > leafpart_use_multi_insert = insertMethod == CIM_MULTI_CONDITIONAL && > - avgTuplesPerPartChange >= 1.3 && > !has_before_insert_row_trig && > !has_instead_insert_row_trig && > resultRelInfo->ri_FdwRoutine == NULL; > > - /* > - * We'd better make the bulk insert mechanism gets a new > - * buffer when the partition being inserted into changes. > - */ > + /* Set the multi-insert buffer to use for this partition. */ > + if (leafpart_use_multi_insert) > + CopyMultiInsertInfo_SetCurrentBuffer(&multiInsertInfo, > + resultRelInfo); > + > ReleaseBulkInsertStatePin(bistate); > prevResultRelInfo = resultRelInfo; > } > @@ -2879,26 +3192,48 @@ CopyFrom(CopyState cstate) > * rowtype. > */ > map = resultRelInfo->ri_PartitionInfo->pi_RootToPartitionMap; > - if (map != NULL) > + if (insertMethod == CIM_SINGLE || !leafpart_use_multi_insert) > { > - TupleTableSlot *new_slot; > - MemoryContext oldcontext; > - > - new_slot = resultRelInfo->ri_PartitionInfo->pi_PartitionTupleSlot; > - Assert(new_slot != NULL); > - > - slot = execute_attr_map_slot(map->attrMap, slot, new_slot); > + /* non batch insert */ > + if (map != NULL) > + { > + TupleTableSlot *new_slot; > > + new_slot = resultRelInfo->ri_PartitionInfo->pi_PartitionTupleSlot; > + myslot = execute_attr_map_slot(map->attrMap, myslot, new_slot); > + } > + } > + else > + { > /* > - * Get the tuple in the per-batch context, so that it will be > - * freed after each batch insert. > + * Batch insert into partitioned table. > */ > - oldcontext = MemoryContextSwitchTo(batchcontext); > - tuple = ExecCopySlotHeapTuple(slot); > - MemoryContextSwitchTo(oldcontext); > + TupleTableSlot *nextslot; > + > + /* no other path available for partitioned table */ > + Assert(insertMethod == CIM_MULTI_CONDITIONAL); > + > + nextslot = CopyMultiInsertInfo_NextFreeSlot(&multiInsertInfo, > + resultRelInfo); > + > + if (map != NULL) > + myslot = execute_attr_map_slot(map->attrMap, myslot, nextslot); > + else > + { > + /* > + * This looks more expensive than it is (Believe me, I > + * optimized it away. Twice). The input is in virtual > + * form, and we'll materialize the slot below - for most > + * slot types the copy performs the work materialization > + * would later require anyway. > + */ > + ExecCopySlot(nextslot, myslot); > + myslot = nextslot; > + } > } > > - slot->tts_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc); > + /* ensure that triggers etc see the right relation */ > + myslot->tts_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc); > } > > skip_tuple = false; > @@ -2906,7 +3241,7 @@ CopyFrom(CopyState cstate) > /* BEFORE ROW INSERT Triggers */ > if (has_before_insert_row_trig) > { > - if (!ExecBRInsertTriggers(estate, resultRelInfo, slot)) > + if (!ExecBRInsertTriggers(estate, resultRelInfo, myslot)) > skip_tuple = true; /* "do nothing" */ > } > > @@ -2919,7 +3254,7 @@ CopyFrom(CopyState cstate) > */ > if (has_instead_insert_row_trig) > { > - ExecIRInsertTriggers(estate, resultRelInfo, slot); > + ExecIRInsertTriggers(estate, resultRelInfo, myslot); > } > else > { > @@ -2931,12 +3266,7 @@ CopyFrom(CopyState cstate) > */ > if (resultRelInfo->ri_RelationDesc->rd_att->constr && > resultRelInfo->ri_RelationDesc->rd_att->constr->has_generated_stored) > - { > - ExecComputeStoredGenerated(estate, slot); > - MemoryContextSwitchTo(batchcontext); > - tuple = ExecCopySlotHeapTuple(slot); > - MemoryContextSwitchTo(oldcontext); > - } > + ExecComputeStoredGenerated(estate, myslot); > > /* > * If the target is a plain table, check the constraints of > @@ -2944,7 +3274,7 @@ CopyFrom(CopyState cstate) > */ > if (resultRelInfo->ri_FdwRoutine == NULL && > resultRelInfo->ri_RelationDesc->rd_att->constr) > - ExecConstraints(resultRelInfo, slot, estate); > + ExecConstraints(resultRelInfo, myslot, estate); > > /* > * Also check the tuple against the partition constraint, if > @@ -2954,7 +3284,7 @@ CopyFrom(CopyState cstate) > */ > if (resultRelInfo->ri_PartitionCheck && > (proute == NULL || has_before_insert_row_trig)) > - ExecPartitionCheck(resultRelInfo, slot, estate, true); > + ExecPartitionCheck(resultRelInfo, myslot, estate, true); > > /* > * Perform multi-inserts when enabled, or when loading a > @@ -2963,31 +3293,21 @@ CopyFrom(CopyState cstate) > */ > if (insertMethod == CIM_MULTI || leafpart_use_multi_insert) > { > - /* Add this tuple to the tuple buffer */ > - if (nBufferedTuples == 0) > - firstBufferedLineNo = cstate->cur_lineno; > - bufferedTuples[nBufferedTuples++] = tuple; > - bufferedTuplesSize += tuple->t_len; > - > /* > - * If the buffer filled up, flush it. Also flush if the > - * total size of all the tuples in the buffer becomes > - * large, to avoid using large amounts of memory for the > - * buffer when the tuples are exceptionally wide. > + * The slot previously might point into the per-tuple > + * context. For batching it needs to be longer lived. > */ > - if (nBufferedTuples == MAX_BUFFERED_TUPLES || > - bufferedTuplesSize > 65535) > - { > - CopyFromInsertBatch(cstate, estate, mycid, ti_options, > - resultRelInfo, myslot, bistate, > - nBufferedTuples, bufferedTuples, > - firstBufferedLineNo); > - nBufferedTuples = 0; > - bufferedTuplesSize = 0; > - > - /* free memory occupied by tuples from the batch */ > - MemoryContextReset(batchcontext); > - } > + ExecMaterializeSlot(myslot); > + > + /* Add this tuple to the tuple buffer */ > + CopyMultiInsertInfo_Store(&multiInsertInfo, myslot, > + cstate->line_buf.len, > + cstate->cur_lineno); > + > + /* If the buffer filled up, flush it. */ > + if (CopyMultiInsertInfo_IsFull(&multiInsertInfo)) > + CopyMultiInsertInfo_Flush(&multiInsertInfo, cstate, > + estate, mycid, ti_options); > } > else > { > @@ -2996,12 +3316,12 @@ CopyFrom(CopyState cstate) > /* OK, store the tuple */ > if (resultRelInfo->ri_FdwRoutine != NULL) > { > - slot = resultRelInfo->ri_FdwRoutine->ExecForeignInsert(estate, > - resultRelInfo, > - slot, > - NULL); > + myslot = resultRelInfo->ri_FdwRoutine->ExecForeignInsert(estate, > + resultRelInfo, > + myslot, > + NULL); > > - if (slot == NULL) /* "do nothing" */ > + if (myslot == NULL) /* "do nothing" */ > continue; /* next tuple please */ > > /* > @@ -3009,27 +3329,26 @@ CopyFrom(CopyState cstate) > * column, so (re-)initialize tts_tableOid before > * evaluating them. > */ > - slot->tts_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc); > + myslot->tts_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc); > } > else > { > - tuple = ExecFetchSlotHeapTuple(slot, true, NULL); > - heap_insert(resultRelInfo->ri_RelationDesc, tuple, > - mycid, ti_options, bistate); > - ItemPointerCopy(&tuple->t_self, &slot->tts_tid); > - slot->tts_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc); > + /* OK, store the tuple and create index entries for it */ > + table_insert(resultRelInfo->ri_RelationDesc, myslot, > + mycid, ti_options, bistate); > } > > + > /* And create index entries for it */ > if (resultRelInfo->ri_NumIndices > 0) > - recheckIndexes = ExecInsertIndexTuples(slot, > + recheckIndexes = ExecInsertIndexTuples(myslot, > estate, > false, > NULL, > NIL); > > /* AFTER ROW INSERT Triggers */ > - ExecARInsertTriggers(estate, resultRelInfo, slot, > + ExecARInsertTriggers(estate, resultRelInfo, myslot, > recheckIndexes, cstate->transition_capture); > > list_free(recheckIndexes); > @@ -3045,32 +3364,25 @@ CopyFrom(CopyState cstate) > } > } > > - /* Flush any remaining buffered tuples */ > - if (nBufferedTuples > 0) > + if (insertMethod != CIM_SINGLE) > { > - if (insertMethod == CIM_MULTI_CONDITIONAL) > - { > - CopyFromInsertBatch(cstate, estate, mycid, ti_options, > - prevResultRelInfo, myslot, bistate, > - nBufferedTuples, bufferedTuples, > - firstBufferedLineNo); > - } > - else > - CopyFromInsertBatch(cstate, estate, mycid, ti_options, > - resultRelInfo, myslot, bistate, > - nBufferedTuples, bufferedTuples, > - firstBufferedLineNo); > + /* Flush any remaining buffered tuples */ > + if (!CopyMultiInsertInfo_IsEmpty(&multiInsertInfo)) > + CopyMultiInsertInfo_Flush(&multiInsertInfo, cstate, estate, mycid, > + ti_options); > + > + /* Tear down the multi-insert data */ > + CopyMultiInsertInfo_Cleanup(&multiInsertInfo); > } > > /* Done, clean up */ > error_context_stack = errcallback.previous; > > - FreeBulkInsertState(bistate); > + if (bistate != NULL) > + ReleaseBulkInsertStatePin(bistate); > > MemoryContextSwitchTo(oldcontext); > > - MemoryContextDelete(batchcontext); > - > /* > * In the old protocol, tell pqcomm that we can process normal protocol > * messages again. > @@ -3084,9 +3396,6 @@ CopyFrom(CopyState cstate) > /* Handle queued AFTER triggers */ > AfterTriggerEndQuery(estate); > > - pfree(values); > - pfree(nulls); > - > ExecResetTupleTable(estate->es_tupleTable, false); > > /* Allow the FDW to shut down */ > @@ -3111,88 +3420,6 @@ CopyFrom(CopyState cstate) > return processed; > } > > -/* > - * A subroutine of CopyFrom, to write the current batch of buffered heap > - * tuples to the heap. Also updates indexes and runs AFTER ROW INSERT > - * triggers. > - */ > -static void > -CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid, > - int ti_options, ResultRelInfo *resultRelInfo, > - TupleTableSlot *myslot, BulkInsertState bistate, > - int nBufferedTuples, HeapTuple *bufferedTuples, > - uint64 firstBufferedLineNo) > -{ > - MemoryContext oldcontext; > - int i; > - uint64 save_cur_lineno; > - bool line_buf_valid = cstate->line_buf_valid; > - > - /* > - * Print error context information correctly, if one of the operations > - * below fail. > - */ > - cstate->line_buf_valid = false; > - save_cur_lineno = cstate->cur_lineno; > - > - /* > - * heap_multi_insert leaks memory, so switch to short-lived memory context > - * before calling it. > - */ > - oldcontext = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate)); > - heap_multi_insert(resultRelInfo->ri_RelationDesc, > - bufferedTuples, > - nBufferedTuples, > - mycid, > - ti_options, > - bistate); > - MemoryContextSwitchTo(oldcontext); > - > - /* > - * If there are any indexes, update them for all the inserted tuples, and > - * run AFTER ROW INSERT triggers. > - */ > - if (resultRelInfo->ri_NumIndices > 0) > - { > - for (i = 0; i < nBufferedTuples; i++) > - { > - List *recheckIndexes; > - > - cstate->cur_lineno = firstBufferedLineNo + i; > - ExecStoreHeapTuple(bufferedTuples[i], myslot, false); > - recheckIndexes = > - ExecInsertIndexTuples(myslot, > - estate, false, NULL, NIL); > - ExecARInsertTriggers(estate, resultRelInfo, > - myslot, > - recheckIndexes, cstate->transition_capture); > - list_free(recheckIndexes); > - } > - } > - > - /* > - * There's no indexes, but see if we need to run AFTER ROW INSERT triggers > - * anyway. > - */ > - else if (resultRelInfo->ri_TrigDesc != NULL && > - (resultRelInfo->ri_TrigDesc->trig_insert_after_row || > - resultRelInfo->ri_TrigDesc->trig_insert_new_table)) > - { > - for (i = 0; i < nBufferedTuples; i++) > - { > - cstate->cur_lineno = firstBufferedLineNo + i; > - ExecStoreHeapTuple(bufferedTuples[i], myslot, false); > - ExecARInsertTriggers(estate, resultRelInfo, > - myslot, > - NIL, cstate->transition_capture); > - } > - } > - > - /* reset cur_lineno and line_buf_valid to what they were */ > - cstate->line_buf_valid = line_buf_valid; > - cstate->cur_lineno = save_cur_lineno; > -} > - > /* > * Setup to read tuples from a file for COPY FROM. > * > @@ -4990,11 +5217,8 @@ copy_dest_receive(TupleTableSlot *slot, DestReceiver *self) > DR_copy *myState = (DR_copy *) self; > CopyState cstate = myState->cstate; > > - /* Make sure the tuple is fully deconstructed */ > - slot_getallattrs(slot); > - > - /* And send the data */ > - CopyOneRowTo(cstate, slot->tts_values, slot->tts_isnull); > + /* Send the data */ > + CopyOneRowTo(cstate, slot); > myState->processed++; > > return true; > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h > index 4c077755d5..ed0e2de144 100644 > --- a/src/include/access/heapam.h > +++ b/src/include/access/heapam.h > @@ -36,6 +36,7 @@ > #define HEAP_INSERT_SPECULATIVE 0x0010 > > typedef struct BulkInsertStateData *BulkInsertState; > +struct TupleTableSlot; > > #define MaxLockTupleMode LockTupleExclusive > > @@ -143,7 +144,7 @@ extern void ReleaseBulkInsertStatePin(BulkInsertState bistate); > > extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid, > int options, BulkInsertState bistate); > -extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, > +extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots, int ntuples, > CommandId cid, int options, BulkInsertState bistate); > extern TM_Result heap_delete(Relation relation, ItemPointer tid, > CommandId cid, Snapshot crosscheck, bool wait, > diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h > index 4efe178ed1..c2fdedc551 100644 > --- a/src/include/access/tableam.h > +++ b/src/include/access/tableam.h > @@ -328,6 +328,9 @@ typedef struct TableAmRoutine > * ------------------------------------------------------------------------ > */ > > + void (*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots, > + CommandId cid, int options, struct BulkInsertStateData *bistate); > + > /* see table_insert() for reference about parameters */ > void (*tuple_insert) (Relation rel, TupleTableSlot *slot, > CommandId cid, int options, > @@ -1157,6 +1160,17 @@ table_update(Relation rel, ItemPointer otid, TupleTableSlot *slot, > lockmode, update_indexes); > } > > +/* > + * table_multi_insert - insert multiple tuple into a table > + */ > +static inline void > +table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots, > + CommandId cid, int options, struct BulkInsertStateData *bistate) > +{ > + rel->rd_tableam->multi_insert(rel, slots, nslots, > + cid, options, bistate); > +} > + > /* > * Lock a tuple in the specified mode. > * Greetings, Andres Freund
On 2019-04-03 06:41:49 +1300, David Rowley wrote: > However, I've ended up not doing it that way as the patch requires > more than just an array of TupleTableSlots to be stored in the > ResultRelInfo, it'll pretty much need all of what I have in > CopyMultiInsertBuffer, which includes line numbers for error > reporting, a BulkInsertState and a counter to track how many of the > slots are used. I had thoughts that we could just tag the whole > CopyMultiInsertBuffer in ResultRelInfo, but that requires moving it > somewhere a bit more generic. Another thought would be that we have > something like "void *extra;" in ResultRelInfo that we can just borrow > for copy.c. It may be interesting to try this out to see if it saves > much in the way of performance. Hm, we could just forwad declare struct CopyMultiInsertBuffer in a header, and only define it in copy.c. That doesn't sound insane to me. > I've got the patch into a working state now and wrote a bunch of > comments. I've not really done a thorough read back of the code again. > I normally wait until the light is coming from a different angle > before doing that, but there does not seem to be much time to wait for > that in this case, so here's v2. Performance seems to be about the > same as what I reported yesterday. Cool. > +/* > + * Multi-Insert handling code for COPY FROM. Inserts are performed in > + * batches of up to MAX_BUFFERED_TUPLES. This should probably be moved to the top of the file. > + * When COPY FROM is used on a partitioned table, we attempt to maintain > + * only MAX_PARTITION_BUFFERS buffers at once. Buffers that are completely > + * unused in each batch are removed so that we end up not keeping buffers for > + * partitions that we might not insert into again. > + */ > +#define MAX_BUFFERED_TUPLES 1000 > +#define MAX_BUFFERED_BYTES 65535 > +#define MAX_PARTITION_BUFFERS 15 > +/* > + * CopyMultiInsertBuffer > + * Stores multi-insert data related to a single relation in CopyFrom. > + */ > +typedef struct Please don't create anonymous structs - they can't be forward declared, and some debugging tools handle them worse than named structs. And it makes naming CopyMultiInsertBuffer in the comment less necessary. > +{ > + Oid relid; /* Relation ID this insert data is for */ > + TupleTableSlot **slots; /* Array of MAX_BUFFERED_TUPLES to store > + * tuples */ > + uint64 *linenos; /* Line # of tuple in copy stream */ It could make sense to allocate those two together, or even as part of the CopyMultiInsertBuffer itself, to reduce allocator overhead. > + else > + { > + CopyMultiInsertBuffer *buffer = (CopyMultiInsertBuffer *) palloc(sizeof(CopyMultiInsertBuffer)); > + > + /* Non-partitioned table. Just setup the buffer for the table. */ > + buffer->relid = RelationGetRelid(rri->ri_RelationDesc); > + buffer->slots = palloc0(MAX_BUFFERED_TUPLES * sizeof(TupleTableSlot *)); > + buffer->linenos = palloc(MAX_BUFFERED_TUPLES * sizeof(uint64)); > + buffer->resultRelInfo = rri; > + buffer->bistate = GetBulkInsertState(); > + buffer->nused = 0; > + miinfo->multiInsertBufferTab = NULL; > + miinfo->buffer = buffer; > + miinfo->nbuffers = 1; > + } Can this be moved into a helper function? > + /* > + * heap_multi_insert leaks memory, so switch to short-lived memory context > + * before calling it. > + */ s/heap_multi_insert/table_multi_insert/ > + /* > + * If there are any indexes, update them for all the inserted tuples, and > + * run AFTER ROW INSERT triggers. > + */ > + if (resultRelInfo->ri_NumIndices > 0) > + { > + for (i = 0; i < nBufferedTuples; i++) > + { > + List *recheckIndexes; > + > + cstate->cur_lineno = buffer->linenos[i]; > + recheckIndexes = > + ExecInsertIndexTuples(buffer->slots[i], estate, false, NULL, > + NIL); > + ExecARInsertTriggers(estate, resultRelInfo, > + buffer->slots[i], > + recheckIndexes, cstate->transition_capture); > + list_free(recheckIndexes); > + } > + } > + > + /* > + * There's no indexes, but see if we need to run AFTER ROW INSERT triggers > + * anyway. > + */ > + else if (resultRelInfo->ri_TrigDesc != NULL && > + (resultRelInfo->ri_TrigDesc->trig_insert_after_row || > + resultRelInfo->ri_TrigDesc->trig_insert_new_table)) > + { > + for (i = 0; i < nBufferedTuples; i++) > + { > + cstate->cur_lineno = buffer->linenos[i]; > + ExecARInsertTriggers(estate, resultRelInfo, > + bufferedSlots[i], > + NIL, cstate->transition_capture); > + } > + } > + for (i = 0; i < nBufferedTuples; i++) > + ExecClearTuple(bufferedSlots[i]); I wonder about combining these loops somehow. But it's probably ok. > +/* > + * CopyMultiInsertBuffer_Cleanup > + * Drop used slots and free member for this buffer. The buffer > + * must be flushed before cleanup. > + */ > +static inline void > +CopyMultiInsertBuffer_Cleanup(CopyMultiInsertBuffer *buffer) > +{ > + int i; > + > + ReleaseBulkInsertStatePin(buffer->bistate); Shouldn't this FreeBulkInsertState() rather than ReleaseBulkInsertStatePin()? > + > +/* > + * CopyMultiInsertBuffer_RemoveBuffer > + * Remove a buffer from being tracked by miinfo > + */ > +static inline void > +CopyMultiInsertBuffer_RemoveBuffer(CopyMultiInsertInfo *miinfo, > + CopyMultiInsertBuffer *buffer) > +{ > + Oid relid = buffer->relid; > + > + CopyMultiInsertBuffer_Cleanup(buffer); > + > + hash_search(miinfo->multiInsertBufferTab, (void *) &relid, HASH_REMOVE, > + NULL); > + miinfo->nbuffers--; Aren't we leaking the CopyMultiInsertBuffer itself here? > +/* > + * CopyMultiInsertInfo_Flush > + * Write out all stored tuples in all buffers out to the tables. > + * > + * To save us from ending up with buffers for 1000s of partitions we remove > + * buffers belonging to partitions that we've seen no tuples for in this batch That seems a little naive (imagine you have like 5 partitions, and we constantly cycle through 2-3 of them per batch). It's probably OK for this version. I guess I'd only do that cleanup if we're actually flushing due to the number of partitions. > if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > { > ereport(ERROR, > - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > - errmsg("cannot perform FREEZE on a partitioned table"))); > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot perform FREEZE on a partitioned table"))); > } accidental hunk? Greetings, Andres Freund
Gah, ignore this mail - I've somehow accidentally sent this early. I'm not quite sure, but I miskeyed, and emacs sent it early, without going through mutt...
(Fixed all of what you mentioned) On Wed, 3 Apr 2019 at 06:57, Andres Freund <andres@anarazel.de> wrote: > > +/* > > + * CopyMultiInsertInfo_Flush > > + * Write out all stored tuples in all buffers out to the tables. > > + * > > + * To save us from ending up with buffers for 1000s of partitions we remove > > + * buffers belonging to partitions that we've seen no tuples for in this batch > > That seems a little naive (imagine you have like 5 partitions, and we > constantly cycle through 2-3 of them per batch). It's probably OK for > this version. I guess I'd only do that cleanup if we're actually > flushing due to the number of partitions. hmm good point. It seems like being smarter there would be a good thing. I've ended up getting rid of the hash table in favour of the List that you mentioned and storing the buffer in ResultRelInfo. I also changed the logic that removes buffers once we reach the limit. Instead of getting rid of buffers that were not used on this run, I've changed it so it just gets rid of the buffers starting with the oldest one first, but stops once the number of buffers is at the maximum again. This can mean that we end up with MAX_BUFFERED_TUPLES buffers instead of MAX_PARTITION_BUFFERS if there is only 1 tuple per buffer. My current thinking is that this does not matter since only 1 slot will be allocated per buffer. We'll remove all of the excess buffers during the flush and keep just MAX_PARTITION_BUFFERS of the newest buffers. Also, after changing CopyMultiInsertBuffer to use fixed sized arrays instead of allocating them with another palloc the performance has improved a bit more. Using the perl files mentioned in [1] Master + Patched: # copy listp from program $$perl ~/bench_same.pl$$ delimiter '|'; COPY 35651564 Time: 9106.776 ms (00:09.107) # truncate table listp; TRUNCATE TABLE # copy listp from program $$perl ~/bench.pl$$ delimiter '|'; COPY 35651564 Time: 10154.196 ms (00:10.154) Master only: # copy listp from program $$perl ~/bench_same.pl$$ delimiter '|'; COPY 35651564 Time: 22200.535 ms (00:22.201) # truncate table listp; TRUNCATE TABLE # copy listp from program $$perl ~/bench.pl$$ delimiter '|'; COPY 35651564 Time: 18592.107 ms (00:18.592) > > if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > > { > > ereport(ERROR, > > - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > - errmsg("cannot perform FREEZE on a partitioned table"))); > > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > + errmsg("cannot perform FREEZE on a partitioned table"))); > > } > > accidental hunk? It was left over from a pgindent run. Now removed. [1] https://www.postgresql.org/message-id/CAKJS1f98Fa+QRTGKwqbtz0M=Cy1EHYR8Q-W08cpA78tOy4euKQ@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On 2019-04-03 18:44:32 +1300, David Rowley wrote: > (Fixed all of what you mentioned) > > On Wed, 3 Apr 2019 at 06:57, Andres Freund <andres@anarazel.de> wrote: > > > +/* > > > + * CopyMultiInsertInfo_Flush > > > + * Write out all stored tuples in all buffers out to the tables. > > > + * > > > + * To save us from ending up with buffers for 1000s of partitions we remove > > > + * buffers belonging to partitions that we've seen no tuples for in this batch > > > > That seems a little naive (imagine you have like 5 partitions, and we > > constantly cycle through 2-3 of them per batch). It's probably OK for > > this version. I guess I'd only do that cleanup if we're actually > > flushing due to the number of partitions. > > hmm good point. It seems like being smarter there would be a good thing. > > I've ended up getting rid of the hash table in favour of the List that > you mentioned and storing the buffer in ResultRelInfo. Cool. > I also changed > the logic that removes buffers once we reach the limit. Instead of > getting rid of buffers that were not used on this run, I've changed it > so it just gets rid of the buffers starting with the oldest one first, > but stops once the number of buffers is at the maximum again. This > can mean that we end up with MAX_BUFFERED_TUPLES buffers instead of > MAX_PARTITION_BUFFERS if there is only 1 tuple per buffer. My current > thinking is that this does not matter since only 1 slot will be > allocated per buffer. We'll remove all of the excess buffers during > the flush and keep just MAX_PARTITION_BUFFERS of the newest buffers. Yea, that makes sense to me. > Also, after changing CopyMultiInsertBuffer to use fixed sized arrays > instead of allocating them with another palloc the performance has > improved a bit more. > > Using the perl files mentioned in [1] > > Master + Patched: > # copy listp from program $$perl ~/bench_same.pl$$ delimiter '|'; > COPY 35651564 > Time: 9106.776 ms (00:09.107) > # truncate table listp; > TRUNCATE TABLE > # copy listp from program $$perl ~/bench.pl$$ delimiter '|'; > COPY 35651564 > Time: 10154.196 ms (00:10.154) > > > Master only: > # copy listp from program $$perl ~/bench_same.pl$$ delimiter '|'; > COPY 35651564 > Time: 22200.535 ms (00:22.201) > # truncate table listp; > TRUNCATE TABLE > # copy listp from program $$perl ~/bench.pl$$ delimiter '|'; > COPY 35651564 > Time: 18592.107 ms (00:18.592) Awesome. I'm probably too tired to give you meaningful feedback tonight. But I'll do a quick readthrough. > +/* Class the buffer full if there are >= this many bytes of tuples stored */ > +#define MAX_BUFFERED_BYTES 65535 Random aside: This seems pretty small (but should be changed separately. > +/* Trim the list of buffers back down to this number after flushing */ > +#define MAX_PARTITION_BUFFERS 32 > + > +/* Stores multi-insert data related to a single relation in CopyFrom. */ > +typedef struct CopyMultiInsertBuffer > +{ > + TupleTableSlot *slots[MAX_BUFFERED_TUPLES]; /* Array to store tuples */ > + ResultRelInfo *resultRelInfo; /* ResultRelInfo for 'relid' */ > + BulkInsertState bistate; /* BulkInsertState for this rel */ > + int nused; /* number of 'slots' containing tuples */ > + uint64 linenos[MAX_BUFFERED_TUPLES]; /* Line # of tuple in copy > + * stream */ > +} CopyMultiInsertBuffer; I don't think it's needed now, but you'd probably achieve a bit better locality by storing slots + linenos in a single array of (slot,lineno). > +/* > + * CopyMultiInsertBuffer_Init > + * Allocate memory and initialize a new CopyMultiInsertBuffer for this > + * ResultRelInfo. > + */ > +static CopyMultiInsertBuffer * > +CopyMultiInsertBuffer_Init(ResultRelInfo *rri) > +{ > + CopyMultiInsertBuffer *buffer; > + > + buffer = (CopyMultiInsertBuffer *) palloc(sizeof(CopyMultiInsertBuffer)); > + memset(buffer->slots, 0, sizeof(TupleTableSlot *) * MAX_BUFFERED_TUPLES); Is there a reason to not just directly palloc0? > +/* > + * CopyMultiInsertBuffer_Cleanup > + * Drop used slots and free member for this buffer. The buffer > + * must be flushed before cleanup. > + */ > +static inline void > +CopyMultiInsertBuffer_Cleanup(CopyMultiInsertBuffer *buffer) > +{ > + int i; > + > + /* Ensure buffer was flushed */ > + Assert(buffer->nused == 0); > + > + /* Remove back-link to ourself */ > + buffer->resultRelInfo->ri_CopyMultiInsertBuffer = NULL; > + > + ReleaseBulkInsertStatePin(buffer->bistate); Hm, afaict this still leaks the bistate itself? Greetings, Andres Freund
On Wed, 3 Apr 2019 at 18:56, Andres Freund <andres@anarazel.de> wrote: > > +/* Class the buffer full if there are >= this many bytes of tuples stored */ > > +#define MAX_BUFFERED_BYTES 65535 > > Random aside: This seems pretty small (but should be changed separately. Yeah, my fingers were hovering over this. I was close to making it 1MB, but thought I'd better not. > > +typedef struct CopyMultiInsertBuffer > > +{ > > + TupleTableSlot *slots[MAX_BUFFERED_TUPLES]; /* Array to store tuples */ > > + ResultRelInfo *resultRelInfo; /* ResultRelInfo for 'relid' */ > > + BulkInsertState bistate; /* BulkInsertState for this rel */ > > + int nused; /* number of 'slots' containing tuples */ > > + uint64 linenos[MAX_BUFFERED_TUPLES]; /* Line # of tuple in copy > > + * stream */ > > +} CopyMultiInsertBuffer; > > I don't think it's needed now, but you'd probably achieve a bit better > locality by storing slots + linenos in a single array of (slot,lineno). You're right, but right now I only zero the slots array and leave the linenos uninitialised. Merging them to one would double the area of memory to zero. I'm not sure if that's worse or not, but I do know if the number of partitions exceeds MAX_PARTITION_BUFFERS and we change partitions quite a lot during the copy then we could end up initialising buffers quite often. I wanted to keep that as cheap as possible. > > +/* > > + * CopyMultiInsertBuffer_Init > > + * Allocate memory and initialize a new CopyMultiInsertBuffer for this > > + * ResultRelInfo. > > + */ > > +static CopyMultiInsertBuffer * > > +CopyMultiInsertBuffer_Init(ResultRelInfo *rri) > > +{ > > + CopyMultiInsertBuffer *buffer; > > + > > + buffer = (CopyMultiInsertBuffer *) palloc(sizeof(CopyMultiInsertBuffer)); > > + memset(buffer->slots, 0, sizeof(TupleTableSlot *) * MAX_BUFFERED_TUPLES); > > Is there a reason to not just directly palloc0? Yeah, I'm too cheap to zero the entire thing, per what I mentioned above. linenos does not really need anything meaningful put there during init. It'll get set when we consume the slots. > > + > > + /* Remove back-link to ourself */ > > + buffer->resultRelInfo->ri_CopyMultiInsertBuffer = NULL; > > + > > + ReleaseBulkInsertStatePin(buffer->bistate); > > Hm, afaict this still leaks the bistate itself? Oops, I forgot about that one. v4 attached. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
Hi, On 2019-04-03 20:00:09 +1300, David Rowley wrote: > Oops, I forgot about that one. v4 attached. I'm pretty happy with this. I'm doing some minor changes (e.g. don't like the function comment formatting that much, the tableam callback needs docs, stuff like that), and then I'm going to push it tomorrow. I'm planning to attribute it to you, me, and Haribabu Kommi. Oh, btw, is there a reason you're memset(0)'ing multiInsertInfo? Seems unnecessary, given that in all cases we're using it we're going to do CopyMultiInsertInfo_Init(). And IME valgrind and the compiler are more helpful when you don't just default initialize, because then they can tell when you forgot to initialize a field (say like CopyMultiInsertInfo_Init not initializing nbuffers). Greetings, Andres Freund
On Thu, 4 Apr 2019 at 18:20, Andres Freund <andres@anarazel.de> wrote: > I'm pretty happy with this. I'm doing some minor changes (e.g. don't > like the function comment formatting that much, the tableam callback > needs docs, stuff like that), and then I'm going to push it tomorrow. > > I'm planning to attribute it to you, me, and Haribabu Kommi. Sounds good. > > Oh, btw, is there a reason you're memset(0)'ing multiInsertInfo? Seems > unnecessary, given that in all cases we're using it we're going to do > CopyMultiInsertInfo_Init(). And IME valgrind and the compiler are more > helpful when you don't just default initialize, because then they can > tell when you forgot to initialize a field (say like > CopyMultiInsertInfo_Init not initializing nbuffers). At some point when I was hacking at it, I was getting a warning about it being possibly uninitialised. I don't recall exactly how the code was then, but I just tried removing it and I don't get any warning now. So probably fine to remove. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2019-04-03 22:20:00 -0700, Andres Freund wrote: > Hi, > > On 2019-04-03 20:00:09 +1300, David Rowley wrote: > > Oops, I forgot about that one. v4 attached. > > I'm pretty happy with this. I'm doing some minor changes (e.g. don't > like the function comment formatting that much, the tableam callback > needs docs, stuff like that), and then I'm going to push it tomorrow. After another read, I also think I'm going to rename the functions a bit. CopyMultiInsertInfo_SetupBuffer, with its mix of camel-case and underscores, just doesn't seem to mix well with copy.c and also just generally other postgres code. I feel we already have too many different naming conventions... Greetings, Andres Freund
Hi, On 2019-04-04 12:04:28 -0700, Andres Freund wrote: > On 2019-04-03 22:20:00 -0700, Andres Freund wrote: > > On 2019-04-03 20:00:09 +1300, David Rowley wrote: > > > Oops, I forgot about that one. v4 attached. > > > > I'm pretty happy with this. I'm doing some minor changes (e.g. don't > > like the function comment formatting that much, the tableam callback > > needs docs, stuff like that), and then I'm going to push it tomorrow. > > After another read, I also think I'm going to rename the functions a > bit. CopyMultiInsertInfo_SetupBuffer, with its mix of camel-case and > underscores, just doesn't seem to mix well with copy.c and also just > generally other postgres code. I feel we already have too many different > naming conventions... I've pushed this now. Besides those naming changes, I'd to re-add the zero initialization (I did end up seing compiler warnings when compiling with optimizations). Also some comment fixes. I added one more flush location, when inserting into a partition that cannot use batching - there might e.g. be triggers looking at rows in other partitions or such. I found some pretty minor slowdown for COPYing narrow rows into an unlogged unpartitioned table. That got better by combining the loops in CopyMultiInsertBufferFlush() (previously there were stalls due to the indirect function calls for ExecCleartuple()). I think there's still a tiny slowdown left in that scenario, but given that it's unlogged and very narrow rows, I think that's ok. Greetings, Andres Freund
On Fri, 5 Apr 2019 at 12:32, Andres Freund <andres@anarazel.de> wrote: > I've pushed this now. Besides those naming changes, I'd to re-add the > zero initialization (I did end up seing compiler warnings when compiling > with optimizations). Also some comment fixes. Thanks for pushing. > I added one more flush location, when inserting into a partition that > cannot use batching - there might e.g. be triggers looking at rows in > other partitions or such. Good catch. I read through the final commit to see what had changed and I noticed that I had forgotten to remove nbuffers from CopyMultiInsertInfo when changing from a hash table to a List. Patch to fix is attached. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
Hi, On 2019-04-05 12:59:06 +1300, David Rowley wrote: > I read through the final commit to see what had changed and I noticed > that I had forgotten to remove nbuffers from CopyMultiInsertInfo when > changing from a hash table to a List. > > Patch to fix is attached. Pushed, thanks.