Обсуждение: [PATCH] Performance Improvement For Copy From Binary Files

Поиск

Список

Период

Сортировка

[PATCH] Performance Improvement For Copy From Binary Files

От

Bharath Rupireddy

Дата:

29 июня 2020 г., 08:20:59

Hi Hackers,

For Copy From Binary files, there exists below information for each tuple/row.

1. field count(number of columns)

2. for every field, field size(column data length)

3. field data of field size(actual column data)

Currently, all the above data required at each step is read directly from file using fread() and this happens for all the tuples/rows.

One observation is that in the total execution time of a copy from binary file, the fread() call is taking upto 20% of time and the fread() function call count is also too high.

For instance, with a dataset of size 5.3GB, 10million tuples with 10 columns,

total exec time in sec	total time taken for fread()	fread() function call count
101.193	21.33	210000005
101.345	21.436	210000005

The total time taken for fread() and the corresponding function call count may increase if we have more number of columns for instance 1000.

One solution to this problem is to read data from binary file in RAW_BUF_SIZE(64KB) chunks to avoid repeatedly calling fread()(thus possibly avoiding few disk IOs). This is similar to the approach followed for csv/text files.

Attaching a patch, implementing the above solution for binary format files.

Below is the improvement gained.

total exec time in sec	total time taken for fread()	fread() function call count
75.757	2.73	160884
75.351	2.742	160884

Execution is 1.36X times faster, fread() time is reduced by 87%, fread() call count is reduced by 99%.

Request the community to take this patch for review if this approach and improvement seem beneficial.

Any suggestions to improve further are most welcome.

Attached also is the config file used for testing the above use case.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Bharath Rupireddy

Дата:

01 июля 2020 г., 12:33:06

Hi,

Added this to commitfest incase this is useful - https://commitfest.postgresql.org/28/

With Regards,

Bharath Rupireddy.

EnterpriseDB: http://www.enterprisedb.com

On Mon, Jun 29, 2020 at 10:50 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

Hi Hackers,

For Copy From Binary files, there exists below information for each tuple/row.
1. field count(number of columns)
2. for every field, field size(column data length)
3. field data of field size(actual column data)

Currently, all the above data required at each step is read directly from file using fread() and this happens for all the tuples/rows.

One observation is that in the total execution time of a copy from binary file, the fread() call is taking upto 20% of time and the fread() function call count is also too high.

For instance, with a dataset of size 5.3GB, 10million tuples with 10 columns,
total exec time in sec total time taken for fread() fread() function call count
101.193 21.33 210000005
101.345 21.436 210000005

The total time taken for fread() and the corresponding function call count may increase if we have more number of columns for instance 1000.

One solution to this problem is to read data from binary file in RAW_BUF_SIZE(64KB) chunks to avoid repeatedly calling fread()(thus possibly avoiding few disk IOs). This is similar to the approach followed for csv/text files.

Attaching a patch, implementing the above solution for binary format files.

Below is the improvement gained.
total exec time in sec total time taken for fread() fread() function call count
75.757 2.73 160884
75.351 2.742 160884

Execution is 1.36X times faster, fread() time is reduced by 87%, fread() call count is reduced by 99%.

Request the community to take this patch for review if this approach and improvement seem beneficial.

Any suggestions to improve further are most welcome.

Attached also is the config file used for testing the above use case.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Amit Langote

Дата:

07 июля 2020 г., 10:28:20

Hi Bharath,

On Mon, Jun 29, 2020 at 2:21 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> For Copy From Binary files, there exists below information for each tuple/row.
> 1. field count(number of columns)
> 2. for every field, field size(column data length)
> 3. field data of field size(actual column data)
>
> Currently, all the above data required at each step is read directly from file using fread() and this happens for all
thetuples/rows.
 
>
> One observation is that in the total execution time of a copy from binary file, the fread() call is taking upto 20%
oftime and the fread() function call count is also too high.
 
>
> For instance, with a dataset of size 5.3GB, 10million tuples with 10 columns,
> total exec time in sec total time taken for fread() fread() function call count
> 101.193 21.33 210000005
> 101.345 21.436 210000005
>
> The total time taken for fread() and the corresponding function call count may increase if we have more number of
columnsfor instance 1000.
 
>
> One solution to this problem is to read data from binary file in RAW_BUF_SIZE(64KB) chunks to avoid repeatedly
callingfread()(thus possibly avoiding few disk IOs). This is similar to the approach followed for csv/text files.
 

I agree that having the buffer in front of the file makes sense,
although we do now have an extra memcpy, that is, from raw_buf to
attribute_buf.data.  Currently, fread() reads directly into
attribute_buf.data.  But maybe that's okay as I don't see the new copy
being all that bad.

> Attaching a patch, implementing the above solution for binary format files.
>
> Below is the improvement gained.
> total exec time in sec total time taken for fread() fread() function call count
> 75.757 2.73 160884
> 75.351 2.742 160884
>
> Execution is 1.36X times faster, fread() time is reduced by 87%, fread() call count is reduced by 99%.
>
> Request the community to take this patch for review if this approach and improvement seem beneficial.
>
> Any suggestions to improve further are most welcome.

Noticed the following misbehaviors when trying to test the patch:

create table foo5 (a text, b text, c text, d text, e text);
insert into foo5 select repeat('a', (random()*100)::int), 'bbb', 'cc',
'd', 'eee' from generate_series(1, 10000000);
copy foo5 to '/tmp/foo5.bin' binary;
truncate foo5;
copy foo5 from '/tmp/foo5.bin' binary;
ERROR:  unexpected EOF in COPY data
CONTEXT:  COPY foo5, line 33, column a

create table bar (a numeric);
insert into bar select sqrt(a) from generate_series(1, 10000) a;
copy bar to '/tmp/bar.bin' binary;
copy bar from '/tmp/bar.bin' binary;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Trying to figure what was going wrong in each of these cases, I found
the new code a bit hard to navigate and debug :(. Here are a couple of
points that I think could have made things a bit easier:

* Avoid spreading the new buffering logic in multiple existing
functions, with similar pieces of code repeated in multiple places.  I
would add a single new function that takes care of the various
buffering details and call it where CopyGetData() is being used
currently.

* You could've reused CopyLoadRawBuffer()'s functionality instead of
reimplementing it.  I also see multiple instances of special case
handling, which often suggests that bugs are lurking.

Considering these points, I came up with the attached patch with a
much smaller footprint.  As far as I can see, it implements the same
basic idea as your patch.  With it, I was able to see an improvement
in loading time consistent with your numbers.  I measured the time of
loading 10 million rows into tables with 5, 10, 20 text columns as
follows:

create table foo5 (a text, b text, c text, d text, e text);
insert into foo5 select repeat('a', (random()*100)::int), 'bbb', 'cc',
'd', 'eee' from generate_series(1, 10000000);
copy foo5 to '/tmp/foo5.bin' binary;
truncate foo5;
copy foo5 from '/tmp/foo5.bin' binary;

create table foo10 (a text, b text, c text, d text, e text, f text, g
text, h text, i text, j text);
insert into foo10 select repeat('a', (random()*100)::int), 'bbb',
'cc', 'd', 'eee', 'f', 'gg', 'hh', 'i', 'jjj' from generate_series(1,
10000000);
copy foo10 to '/tmp/foo10.bin' binary;
truncate foo10;
copy foo10 from '/tmp/foo10.bin' binary;

create table foo20 (a text, b text, c text, d text, e text, f numeric,
g text, h text, i text, j text, k text, l text, m text, n text, o
text, p text, q text, r text, s text, t text);
insert into foo20 select repeat('a', (random()*100)::int), 'bbb',
'cc', 'd', 'eee', '123.456', 'gg', 'hh', 'ii', 'jjjj', 'kkk', 'llll',
'mm', 'n', 'ooooo', 'pppp', 'q', 'rrrrr', 'ss', 'tttttttttttt' from
generate_series(1, 10000000);
copy foo20 to '/tmp/foo20.bin' binary;
truncate foo20;
copy foo20 from '/tmp/foo20.bin' binary;

The median times for the COPY FROM commands above, with and without
the patch, are as follows:

            HEAD    patched
foo5        8.5     6.5
foo10       14      10
foo20       25      18

 A few more points to remember in the future:

* Commenting style:

+       /* If readbytes are lesser than the requested bytes, then initialize the
+       * remaining bytes in the raw_buf to 0. This will be useful for checking
+       * error "received copy data after EOF marker".
+       */

Multi-line comments are started like this:

    /*
     * <Start here>
    */

* As also mentioned above, it's a good idea in general to avoid having
special cases like these in the code:

+       if (cstate->cur_lineno == 1)
        {
-           /* EOF detected (end of file, or protocol-level EOF) */
-           return false;
+           /* This is for the first time, so read in buff size amount
+            * of data from file.
+            */

...

+
+           /* Move bytes can either be 0, 1, or 2. */
+           movebytes = RAW_BUF_SIZE - cstate->raw_buf_index;

...

+       uint8       movebytes = 0;
+
+       /* Move bytes can either be 0, 1, 2, 3 or 4. */
+       movebytes = RAW_BUF_SIZE - cstate->raw_buf_index;

* Please try to make variable names short if you can or follow the
guidelines around long names:

+       int32 remainingbytesincurrdatablock = RAW_BUF_SIZE -
cstate->raw_buf_index;

Maybe, remaining_bytes would've sufficed here, because "in the current
data block" might be clear to most readers by looking at the
surrounding code.

* The above point also helps avoid long code lines that don't fit
within 78 characters, like these:

+       memcpy(&cstate->attribute_buf.data[0],
&cstate->raw_buf[cstate->raw_buf_index],
remainingbytesincurrdatablock);
+
+       if (CopyGetData(cstate,
&cstate->attribute_buf.data[remainingbytesincurrdatablock],
+                   (fld_size - remainingbytesincurrdatablock),
(fld_size - remainingbytesincurrdatablock)) != (fld_size -
remainingbytesincurrdatablock))

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Вложения

CopyFrom-binary-buffering.patch

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Amit Langote

Дата:

07 июля 2020 г., 11:16:12

On Tue, Jul 7, 2020 at 4:28 PM Amit Langote <amitlangote09@gmail.com> wrote:
> The median times for the COPY FROM commands above, with and without
> the patch, are as follows:
>
>             HEAD    patched
> foo5        8.5     6.5
> foo10       14      10
> foo20       25      18

Sorry, I forgot to mention that these times are in seconds.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Bharath Rupireddy

Дата:

09 июля 2020 г., 13:33:42

>
> Considering these points, I came up with the attached patch with a
> much smaller footprint.  As far as I can see, it implements the same
> basic idea as your patch.  With it, I was able to see an improvement
> in loading time consistent with your numbers.  I measured the time of
> loading 10 million rows into tables with 5, 10, 20 text columns as
> follows:
>

Thanks Amit for buying into the idea. I agree that your patch looks
clean and simple compared to mine and I'm okay with your patch.

I reviewed and tested your patch, below are few comments:

I think we can remove(and delete the function from the code) the
CopyGetInt16() and have the code directly to save the function call
cost. It gets called for each attribute/column for each row/tuple to
just call CopyReadFromRawBuf() and set the byte order. From a
readability perspective it's okay to have this function, but cost wise
I feel no need for that function at all. In one of our earlier
work(parallel copy), we observed that having a new function or few
extra statements in this copy from path which gets hit for each row,
incurs noticeable execution cost.

The same way, we can also avoid using CopyGetInt32() function call in
CopyReadBinaryAttribute() for the same reason stated above.

In CopyReadFromRawBuf(), can the "saveTo" parameter be named "dest"
and use that with (char *) typecast directly, instead of having a
local variable? Though it may/may not be a standard practice, let's
have the parameter name all lower case to keep it consistent with
other function's parameters in the copy.c file.

Seems like there's a bug in below part of the code. Use case is
simple, have some junk value at the end of the binary file, then with
your patch the query succeeds, but it should report the below error.
Here, on fld_count == -1 instead of reading from file, we must be
reading it from the buffer, as we would have already read all the data
from the file into the buffer.
           if (cstate->copy_dest != COPY_OLD_FE &&
                CopyGetData(cstate, &dummy, 1, 1) > 0)
                ereport(ERROR,
                        (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
                         errmsg("received copy data after EOF marker")));

I also tried with some intentionally corrupted binary datasets, (apart
from above issue) patch works fine.

For the case where required nbytes may not fit into the buffer in
CopyReadFromRawBuf, I'm sure this can happen only for field data,
(field count , and field size are of fixed length and can fit in the
buffer), instead of reading them in parts of buff size into the buffer
(using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the
destination, I think we can detect this condition using requested
bytes and the buffer size and directly read from the file to the
destination buffer and then reload the raw_buffer for further
processing. I think this way, it will be good.

I have few synthesized test cases where fields can be of larger size.
I executed them on your patch, but didn't debug to see whether
actually we hit the code where required nbytes can't fit in the entire
buffer. I will try this on the next version of the patch.

>
>             HEAD    patched
> foo5        8.5     6.5
> foo10       14      10
> foo20       25      18
>

Numbers might improve a bit, if we remove the extra function calls as
stated above.

Overall, thanks for your suggestions in the previous mail, my patch
was prepared in a bit hurried manner, anyways, will take care next
time.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Amit Langote

Дата:

10 июля 2020 г., 06:21:12

Hi Bharath,

On Thu, Jul 9, 2020 at 7:33 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> Thanks Amit for buying into the idea. I agree that your patch looks
> clean and simple compared to mine and I'm okay with your patch.
>
> I reviewed and tested your patch, below are few comments:

Thanks for checking it out.

> I think we can remove(and delete the function from the code) the
> CopyGetInt16() and have the code directly to save the function call
> cost. It gets called for each attribute/column for each row/tuple to
> just call CopyReadFromRawBuf() and set the byte order. From a
> readability perspective it's okay to have this function, but cost wise
> I feel no need for that function at all. In one of our earlier
> work(parallel copy), we observed that having a new function or few
> extra statements in this copy from path which gets hit for each row,
> incurs noticeable execution cost.
>
> The same way, we can also avoid using CopyGetInt32() function call in
> CopyReadBinaryAttribute() for the same reason stated above.

I agree that removing the function call overhead in this case is worth
the slight loss of readability.

> In CopyReadFromRawBuf(), can the "saveTo" parameter be named "dest"
> and use that with (char *) typecast directly, instead of having a
> local variable? Though it may/may not be a standard practice, let's
> have the parameter name all lower case to keep it consistent with
> other function's parameters in the copy.c file.

Agreed.

> Seems like there's a bug in below part of the code. Use case is
> simple, have some junk value at the end of the binary file, then with
> your patch the query succeeds, but it should report the below error.
> Here, on fld_count == -1 instead of reading from file, we must be
> reading it from the buffer, as we would have already read all the data
> from the file into the buffer.
>            if (cstate->copy_dest != COPY_OLD_FE &&
>                 CopyGetData(cstate, &dummy, 1, 1) > 0)
>                 ereport(ERROR,
>                         (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
>                          errmsg("received copy data after EOF marker")));
>
> I also tried with some intentionally corrupted binary datasets, (apart
> from above issue) patch works fine.

Yeah, I see the bug.  I should've checked all the call sites of
CopyGetData() and made sure there is only one left, that is,
CopyLoadRawBuffer().

> For the case where required nbytes may not fit into the buffer in
> CopyReadFromRawBuf, I'm sure this can happen only for field data,
> (field count , and field size are of fixed length and can fit in the
> buffer), instead of reading them in parts of buff size into the buffer
> (using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the
> destination, I think we can detect this condition using requested
> bytes and the buffer size and directly read from the file to the
> destination buffer and then reload the raw_buffer for further
> processing. I think this way, it will be good.

Hmm, I'm afraid that this will make the code more complex for
apparently small benefit.  Is this really that much of a problem
performance wise?

> I have few synthesized test cases where fields can be of larger size.
> I executed them on your patch, but didn't debug to see whether
> actually we hit the code where required nbytes can't fit in the entire
> buffer. I will try this on the next version of the patch.
>
> >
> >             HEAD    patched
> > foo5        8.5     6.5
> > foo10       14      10
> > foo20       25      18
> >
>
> Numbers might improve a bit, if we remove the extra function calls as
> stated above.

Here the numbers with the updated patch:

            HEAD    patched (v2)
foo5         8.5        6.1
foo10         14        9.4
foo20         25       16.7

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Вложения

CopyFrom-binary-buffering_v2.patch

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

vignesh C

Дата:

12 июля 2020 г., 16:13:31

On Fri, Jul 10, 2020 at 8:51 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Hi Bharath,
> Here the numbers with the updated patch:
>
>             HEAD    patched (v2)
> foo5         8.5        6.1
> foo10         14        9.4
> foo20         25       16.7
>

Patch applies cleanly, make check & make check-world passes.
I had reviewed the changes. I felt one minor change required:
+ * CopyReadFromRawBuf
+ *             Reads 'nbytes' bytes from cstate->copy_file via
cstate->raw_buf and
+ *             writes then to 'saveTo'
+ *
+ * Useful when reading binary data from the file.
Should "writes then to 'saveTo'" be "writes them to 'dest'"?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Thomas Munro

Дата:

13 июля 2020 г., 01:36:02

On Mon, Jul 13, 2020 at 1:13 AM vignesh C <vignesh21@gmail.com> wrote:
> On Fri, Jul 10, 2020 at 8:51 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Here the numbers with the updated patch:
> >
> >             HEAD    patched (v2)
> > foo5         8.5        6.1
> > foo10         14        9.4
> > foo20         25       16.7
> >
>
> Patch applies cleanly, make check & make check-world passes.

This error showed up when cfbot tried it:

 COPY BINARY stud_emp FROM
'/home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/stud_emp.data';
+ERROR:  could not read from COPY file: Bad address

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Bharath Rupireddy

Дата:

13 июля 2020 г., 04:19:17

Thanks Thomas for checking this feature.

> On Mon, Jul 13, 2020 at 4:06 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> This error showed up when cfbot tried it:
>
>  COPY BINARY stud_emp FROM
> '/home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/stud_emp.data';
> +ERROR:  could not read from COPY file: Bad address

This is due to the recent commit
cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb, in which we restricted the
raw_buf and line_buf allocations for binary files. Since we are using
raw_buf for this performance improvement feature, now, it's enough to
restrict only line_buf for binary files. I made the changes
accordingly in the v3 patch attached here.

Regression tests(make check & make check-world) ran cleanly with the v3 patch.

Please also find my responses for:

Vignesh's comment:

> On Sun, Jul 12, 2020 at 6:43 PM vignesh C <vignesh21@gmail.com> wrote:
> I had reviewed the changes. I felt one minor change required:
> + * CopyReadFromRawBuf
> + *             Reads 'nbytes' bytes from cstate->copy_file via
> cstate->raw_buf and
> + *             writes then to 'saveTo'
> + *
> + * Useful when reading binary data from the file.
> Should "writes then to 'saveTo'" be "writes them to 'dest'"?
>

Thanks Vignesh for reviewing the patch. Modified 'saveTo' to 'dest' in v3 patch.

Amit's comment:

>
> > For the case where required nbytes may not fit into the buffer in
> > CopyReadFromRawBuf, I'm sure this can happen only for field data,
> > (field count , and field size are of fixed length and can fit in the
> > buffer), instead of reading them in parts of buff size into the buffer
> > (using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the
> > destination, I think we can detect this condition using requested
> > bytes and the buffer size and directly read from the file to the
> > destination buffer and then reload the raw_buffer for further
> > processing. I think this way, it will be good.
>
> Hmm, I'm afraid that this will make the code more complex for
> apparently small benefit.  Is this really that much of a problem
> performance wise?
>

Yes it makes CopyReadFromRawBuf(), code a bit complex from a
readability perspective. I'm convinced not to have the
abovementioned(by me) change, due to 3 reasons,1)  the
readability/understandability 2)  how many use cases can we have where
requested field size greater than RAW_BUF_SIZE(64KB)? I think very few
cases. I may be wrong here. 3) Performance wise it may not be much as
we do one extra memcpy only in situations where field sizes are
greater than 64KB(as we have already seen and observed by you as well
in one of the response [1]) that memcpy cost for this case may be
negligible.

Considering all of above, I'm okay to have CopyReadFromRawBuf()
function, the way it is currently.

[1]
> >
> > One solution to this problem is to read data from binary file in RAW_BUF_SIZE(64KB) chunks to avoid repeatedly
callingfread()(thus possibly avoiding few disk IOs). This is similar to the approach followed for csv/text files.

>
> I agree that having the buffer in front of the file makes sense,
> although we do now have an extra memcpy, that is, from raw_buf to
> attribute_buf.data.  Currently, fread() reads directly into
> attribute_buf.data.  But maybe that's okay as I don't see the new copy
> being all that bad.
>

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Вложения

CopyFrom-binary-buffering_v3.patch

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Amit Langote

Дата:

13 июля 2020 г., 05:32:30

On Mon, Jul 13, 2020 at 10:19 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> > On Mon, Jul 13, 2020 at 4:06 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > This error showed up when cfbot tried it:
> >
> >  COPY BINARY stud_emp FROM
> > '/home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/stud_emp.data';
> > +ERROR:  could not read from COPY file: Bad address
>
> This is due to the recent commit
> cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb, in which we restricted the
> raw_buf and line_buf allocations for binary files. Since we are using
> raw_buf for this performance improvement feature, now, it's enough to
> restrict only line_buf for binary files. I made the changes
> accordingly in the v3 patch attached here.
>
> Regression tests(make check & make check-world) ran cleanly with the v3 patch.

Thank you Bharath.  I was a bit surprised that you had also submitted
a patch to NOT allocate raw_buf for COPY FROM ... BINARY. :-)

> Please also find my responses for:
>
> Vignesh's comment:
>
> > On Sun, Jul 12, 2020 at 6:43 PM vignesh C <vignesh21@gmail.com> wrote:
> > I had reviewed the changes. I felt one minor change required:
> > + * CopyReadFromRawBuf
> > + *             Reads 'nbytes' bytes from cstate->copy_file via
> > cstate->raw_buf and
> > + *             writes then to 'saveTo'
> > + *
> > + * Useful when reading binary data from the file.
> > Should "writes then to 'saveTo'" be "writes them to 'dest'"?
> >
>
> Thanks Vignesh for reviewing the patch. Modified 'saveTo' to 'dest' in v3 patch.

My bad.

> Amit's comment:
>
> >
> > > For the case where required nbytes may not fit into the buffer in
> > > CopyReadFromRawBuf, I'm sure this can happen only for field data,
> > > (field count , and field size are of fixed length and can fit in the
> > > buffer), instead of reading them in parts of buff size into the buffer
> > > (using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the
> > > destination, I think we can detect this condition using requested
> > > bytes and the buffer size and directly read from the file to the
> > > destination buffer and then reload the raw_buffer for further
> > > processing. I think this way, it will be good.
> >
> > Hmm, I'm afraid that this will make the code more complex for
> > apparently small benefit.  Is this really that much of a problem
> > performance wise?
> >
>
> Yes it makes CopyReadFromRawBuf(), code a bit complex from a
> readability perspective. I'm convinced not to have the
> abovementioned(by me) change, due to 3 reasons,1)  the
> readability/understandability 2)  how many use cases can we have where
> requested field size greater than RAW_BUF_SIZE(64KB)? I think very few
> cases. I may be wrong here. 3) Performance wise it may not be much as
> we do one extra memcpy only in situations where field sizes are
> greater than 64KB(as we have already seen and observed by you as well
> in one of the response [1]) that memcpy cost for this case may be
> negligible.

Actually, an extra memcpy is incurred on every call of
CopyReadFromRawBuf(), but I haven't seen it to be very problematic.

By the way, considering the rebase over cd22d3cdb9b, it seemed to me
that we needed to update the comments in CopyStateData struct
definition a bit more.  While doing that, I realized
CopyReadFromRawBuf as a name for the new function might be misleading
as long as we are only using it for binary data.  Maybe
CopyReadBinaryData is more appropriate?  See attached v4 with these
and a few other cosmetic changes.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Вложения

CopyFrom-binary-buffering_v4.patch

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Bharath Rupireddy

Дата:

13 июля 2020 г., 06:16:57

>
> > This is due to the recent commit
> > cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb, in which we restricted the
> > raw_buf and line_buf allocations for binary files. Since we are using
> > raw_buf for this performance improvement feature, now, it's enough to
> > restrict only line_buf for binary files. I made the changes
> > accordingly in the v3 patch attached here.
> >
> > Regression tests(make check & make check-world) ran cleanly with the v3 patch.
>
> Thank you Bharath.  I was a bit surprised that you had also submitted
> a patch to NOT allocate raw_buf for COPY FROM ... BINARY. :-)
>

Yes that was by me, before I started to work on this feature. I think
we can backpatch that change(assuming we don't backpatch this
feature), I will make the request accordingly.

Anyways, now we don't allow line_buf allocation for binary files,
which is also a good thing.

>
> > > > For the case where required nbytes may not fit into the buffer in
> > > > CopyReadFromRawBuf, I'm sure this can happen only for field data,
> > > > (field count , and field size are of fixed length and can fit in the
> > > > buffer), instead of reading them in parts of buff size into the buffer
> > > > (using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the
> > > > destination, I think we can detect this condition using requested
> > > > bytes and the buffer size and directly read from the file to the
> > > > destination buffer and then reload the raw_buffer for further
> > > > processing. I think this way, it will be good.
> > >
> > > Hmm, I'm afraid that this will make the code more complex for
> > > apparently small benefit.  Is this really that much of a problem
> > > performance wise?
> > >
> >
> > Yes it makes CopyReadFromRawBuf(), code a bit complex from a
> > readability perspective. I'm convinced not to have the
> > abovementioned(by me) change, due to 3 reasons,1)  the
> > readability/understandability 2)  how many use cases can we have where
> > requested field size greater than RAW_BUF_SIZE(64KB)? I think very few
> > cases. I may be wrong here. 3) Performance wise it may not be much as
> > we do one extra memcpy only in situations where field sizes are
> > greater than 64KB(as we have already seen and observed by you as well
> > in one of the response [1]) that memcpy cost for this case may be
> > negligible.
>
> Actually, an extra memcpy is incurred on every call of
> CopyReadFromRawBuf(), but I haven't seen it to be very problematic.
>

Yes.

>
> CopyReadFromRawBuf as a name for the new function might be misleading
> as long as we are only using it for binary data.  Maybe
> CopyReadBinaryData is more appropriate?  See attached v4 with these
> and a few other cosmetic changes.
>

CopyReadBinaryData() looks meaningful. +1.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Amit Langote

Дата:

13 июля 2020 г., 10:34:46

On Mon, Jul 13, 2020 at 12:17 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> > CopyReadFromRawBuf as a name for the new function might be misleading
> > as long as we are only using it for binary data.  Maybe
> > CopyReadBinaryData is more appropriate?  See attached v4 with these
> > and a few other cosmetic changes.
> >
>
> CopyReadBinaryData() looks meaningful. +1.

Okay, thanks.   Let's have a committer take a look at this then?

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Bharath Rupireddy

Дата:

13 июля 2020 г., 14:34:45

>
> > > CopyReadFromRawBuf as a name for the new function might be misleading
> > > as long as we are only using it for binary data.  Maybe
> > > CopyReadBinaryData is more appropriate?  See attached v4 with these
> > > and a few other cosmetic changes.
> > >
> >
> > CopyReadBinaryData() looks meaningful. +1.
>
> Okay, thanks.   Let's have a committer take a look at this then?
>

I think yes, unless someone has any more points/review comments.
Accordingly the status in the commitfest can be changed.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

vignesh C

Дата:

13 июля 2020 г., 16:20:24

On Mon, Jul 13, 2020 at 8:02 AM Amit Langote <amitlangote09@gmail.com> wrote:
> By the way, considering the rebase over cd22d3cdb9b, it seemed to me
> that we needed to update the comments in CopyStateData struct
> definition a bit more.  While doing that, I realized
> CopyReadFromRawBuf as a name for the new function might be misleading
> as long as we are only using it for binary data.  Maybe
> CopyReadBinaryData is more appropriate?  See attached v4 with these
> and a few other cosmetic changes.
>

I had one small comment:
+{
+       int             copied_bytes = 0;
+
+#define BUF_BYTES      (cstate->raw_buf_len - cstate->raw_buf_index)
+#define DRAIN_COPY_RAW_BUF(cstate, dest, nbytes)\
+       do {\
+               memcpy((dest), (cstate)->raw_buf +
(cstate)->raw_buf_index, (nbytes));\
+               (cstate)->raw_buf_index += (nbytes);\
+       } while(0)

BUF_BYTES could be used in CopyLoadRawBuf function also.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Bharath Rupireddy

Дата:

13 июля 2020 г., 17:58:03

>
> I had one small comment:
> +{
> +       int             copied_bytes = 0;
> +
> +#define BUF_BYTES      (cstate->raw_buf_len - cstate->raw_buf_index)
> +#define DRAIN_COPY_RAW_BUF(cstate, dest, nbytes)\
> +       do {\
> +               memcpy((dest), (cstate)->raw_buf +
> (cstate)->raw_buf_index, (nbytes));\
> +               (cstate)->raw_buf_index += (nbytes);\
> +       } while(0)
>
> BUF_BYTES could be used in CopyLoadRawBuf function also.
>

Thanks Vignesh for the find out. I changed and attached the v5 patch.
The regression tests(make check and make check-world) ran
successfully.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Вложения

CopyFrom-binary-buffering_v5.patch

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Amit Langote

Дата:

14 июля 2020 г., 04:56:15

On Mon, Jul 13, 2020 at 11:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> > I had one small comment:
> > +{
> > +       int             copied_bytes = 0;
> > +
> > +#define BUF_BYTES      (cstate->raw_buf_len - cstate->raw_buf_index)
> > +#define DRAIN_COPY_RAW_BUF(cstate, dest, nbytes)\
> > +       do {\
> > +               memcpy((dest), (cstate)->raw_buf +
> > (cstate)->raw_buf_index, (nbytes));\
> > +               (cstate)->raw_buf_index += (nbytes);\
> > +       } while(0)
> >
> > BUF_BYTES could be used in CopyLoadRawBuf function also.
> >
>
> Thanks Vignesh for the find out. I changed and attached the v5 patch.
> The regression tests(make check and make check-world) ran
> successfully.

Good idea, thanks.

In CopyLoadRawBuf(), we could also change the condition if
(cstate->raw_buf_index < cstate->raw_buf_len) to if (BUF_BYTES > 0),
which looks clearer.

Also, if we are going to use the macro more generally, let's make it
look less localized.  For example, rename it to RAW_BUF_BYTES similar
to RAW_BUF_SIZE and place their definitions close by.  It also seems
like a good idea to make 'cstate' a parameter for clarity.

Attached v6.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Вложения

CopyFrom-binary-buffering_v6.patch

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

vignesh C

Дата:

14 июля 2020 г., 08:01:58

On Tue, Jul 14, 2020 at 7:26 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Good idea, thanks.
>
> In CopyLoadRawBuf(), we could also change the condition if
> (cstate->raw_buf_index < cstate->raw_buf_len) to if (BUF_BYTES > 0),
> which looks clearer.
>
> Also, if we are going to use the macro more generally, let's make it
> look less localized.  For example, rename it to RAW_BUF_BYTES similar
> to RAW_BUF_SIZE and place their definitions close by.  It also seems
> like a good idea to make 'cstate' a parameter for clarity.
>
> Attached v6.
>

Thanks for making the changes.

-       if (cstate->raw_buf_index < cstate->raw_buf_len)
+       if (RAW_BUF_BYTES(cstate) > 0)
        {
                /* Copy down the unprocessed data */
-               nbytes = cstate->raw_buf_len - cstate->raw_buf_index;
+               nbytes = RAW_BUF_BYTES(cstate);
                memmove(cstate->raw_buf, cstate->raw_buf +
cstate->raw_buf_index,
                                nbytes);
        }

One small improvement could be to change it like below to reduce few
more instructions:
static bool
CopyLoadRawBuf(CopyState cstate)
{
int nbytes = RAW_BUF_BYTES(cstate);
int inbytes;

/* Copy down the unprocessed data */
if (nbytes > 0)
memmove(cstate->raw_buf, cstate->raw_buf + cstate->raw_buf_index,
nbytes);

inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
  1, RAW_BUF_SIZE - nbytes);
nbytes += inbytes;
cstate->raw_buf[nbytes] = '\0';
cstate->raw_buf_index = 0;
cstate->raw_buf_len = nbytes;
return (inbytes > 0);
}

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Amit Langote

Дата:

14 июля 2020 г., 08:49:30

On Tue, Jul 14, 2020 at 2:02 PM vignesh C <vignesh21@gmail.com> wrote:
> On Tue, Jul 14, 2020 at 7:26 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > In CopyLoadRawBuf(), we could also change the condition if
> > (cstate->raw_buf_index < cstate->raw_buf_len) to if (BUF_BYTES > 0),
> > which looks clearer.
> >
> > Also, if we are going to use the macro more generally, let's make it
> > look less localized.  For example, rename it to RAW_BUF_BYTES similar
> > to RAW_BUF_SIZE and place their definitions close by.  It also seems
> > like a good idea to make 'cstate' a parameter for clarity.
> >
> > Attached v6.
> >
>
> Thanks for making the changes.
>
> -       if (cstate->raw_buf_index < cstate->raw_buf_len)
> +       if (RAW_BUF_BYTES(cstate) > 0)
>         {
>                 /* Copy down the unprocessed data */
> -               nbytes = cstate->raw_buf_len - cstate->raw_buf_index;
> +               nbytes = RAW_BUF_BYTES(cstate);
>                 memmove(cstate->raw_buf, cstate->raw_buf +
> cstate->raw_buf_index,
>                                 nbytes);
>         }
>
> One small improvement could be to change it like below to reduce few
> more instructions:
> static bool
> CopyLoadRawBuf(CopyState cstate)
> {
> int nbytes = RAW_BUF_BYTES(cstate);
> int inbytes;
>
> /* Copy down the unprocessed data */
> if (nbytes > 0)
> memmove(cstate->raw_buf, cstate->raw_buf + cstate->raw_buf_index,
> nbytes);
>
> inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
>   1, RAW_BUF_SIZE - nbytes);
> nbytes += inbytes;
> cstate->raw_buf[nbytes] = '\0';
> cstate->raw_buf_index = 0;
> cstate->raw_buf_len = nbytes;
> return (inbytes > 0);
> }

Sounds fine to me.  Although CopyLoadRawBuf() does not seem to a
candidate for rigorous code optimization as it does not get called
that often.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

vignesh C

Дата:

14 июля 2020 г., 16:22:58

On Tue, Jul 14, 2020 at 11:19 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
>
> Sounds fine to me.  Although CopyLoadRawBuf() does not seem to a
> candidate for rigorous code optimization as it does not get called
> that often.
>

I thought we could include that change as we are making changes around
that code. Rest of the changes looked fine to me. Also I noticed that
commit message was missing in the patch.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Amit Langote

Дата:

15 июля 2020 г., 05:32:58

Hi Vignesh,

On Tue, Jul 14, 2020 at 10:23 PM vignesh C <vignesh21@gmail.com> wrote:
> On Tue, Jul 14, 2020 at 11:19 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Sounds fine to me.  Although CopyLoadRawBuf() does not seem to a
> > candidate for rigorous code optimization as it does not get called
> > that often.
>
> I thought we could include that change as we are making changes around
> that code.

Sure, done.

> Rest of the changes looked fine to me. Also I noticed that
> commit message was missing in the patch.

Please see the attached v7.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Вложения

v7-0001-Improve-performance-of-binary-COPY-FROM-with-buff.patch

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

vignesh C

Дата:

15 июля 2020 г., 07:06:23

On Wed, Jul 15, 2020 at 8:03 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Hi Vignesh,
>
> On Tue, Jul 14, 2020 at 10:23 PM vignesh C <vignesh21@gmail.com> wrote:
> > On Tue, Jul 14, 2020 at 11:19 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > > Sounds fine to me.  Although CopyLoadRawBuf() does not seem to a
> > > candidate for rigorous code optimization as it does not get called
> > > that often.
> >
> > I thought we could include that change as we are making changes around
> > that code.
>
> Sure, done.
>
> > Rest of the changes looked fine to me. Also I noticed that
> > commit message was missing in the patch.
>
> Please see the attached v7.
>

Thanks for fixing the comments.
Patch applies cleanly, make check & make check-world passes.
The changes looks fine to me.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Amit Langote

Дата:

16 июля 2020 г., 17:14:40

On Wed, Jul 15, 2020 at 1:06 PM vignesh C <vignesh21@gmail.com> wrote:
> On Wed, Jul 15, 2020 at 8:03 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Tue, Jul 14, 2020 at 10:23 PM vignesh C <vignesh21@gmail.com> wrote:
> > > Rest of the changes looked fine to me. Also I noticed that
> > > commit message was missing in the patch.
>>
> > Please see the attached v7.
>
> Thanks for fixing the comments.
> Patch applies cleanly, make check & make check-world passes.
> The changes looks fine to me.

Thanks for checking.  Sorry, I hadn't credited Bharath as an author in
the commit message, so here's v7 again.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Вложения

v7-0001-Improve-performance-of-binary-COPY-FROM-with-buff.patch

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Bharath Rupireddy

Дата:

16 июля 2020 г., 18:22:22

On Thu, Jul 16, 2020 at 7:44 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 1:06 PM vignesh C <vignesh21@gmail.com> wrote:
> > On Wed, Jul 15, 2020 at 8:03 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > > On Tue, Jul 14, 2020 at 10:23 PM vignesh C <vignesh21@gmail.com> wrote:
> > > > Rest of the changes looked fine to me. Also I noticed that
> > > > commit message was missing in the patch.
> >>
> > > Please see the attached v7.
> >
> > Thanks for fixing the comments.
> > Patch applies cleanly, make check & make check-world passes.
> > The changes looks fine to me.
>
> Thanks for checking.  Sorry, I hadn't credited Bharath as an author in
> the commit message, so here's v7 again.
>

Patch looks good. It applies on latest commit
932f9fb504a57f296cf698d15bd93462ddfe2776 and make check, make
check-world were run successfully.

I will change the status to "ready for committer" in commitfest
tomorrow. Hope that's fine.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

vignesh C

Дата:

16 июля 2020 г., 20:13:20

On Thu, Jul 16, 2020 at 8:52 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Jul 16, 2020 at 7:44 PM Amit Langote <amitlangote09@gmail.com> wrote:
> >
> > On Wed, Jul 15, 2020 at 1:06 PM vignesh C <vignesh21@gmail.com> wrote:
> > > On Wed, Jul 15, 2020 at 8:03 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > > > On Tue, Jul 14, 2020 at 10:23 PM vignesh C <vignesh21@gmail.com> wrote:
> > > > > Rest of the changes looked fine to me. Also I noticed that
> > > > > commit message was missing in the patch.
> > >>
> > > > Please see the attached v7.
> > >
> > > Thanks for fixing the comments.
> > > Patch applies cleanly, make check & make check-world passes.
> > > The changes looks fine to me.
> >
> > Thanks for checking.  Sorry, I hadn't credited Bharath as an author in
> > the commit message, so here's v7 again.
> >
>
> Patch looks good. It applies on latest commit
> 932f9fb504a57f296cf698d15bd93462ddfe2776 and make check, make
> check-world were run successfully.
>
> I will change the status to "ready for committer" in commitfest
> tomorrow. Hope that's fine.

I agree, a committer can have a look at this.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Bharath Rupireddy

Дата:

17 июля 2020 г., 14:56:36

> >
> > I will change the status to "ready for committer" in commitfest
> > tomorrow. Hope that's fine.
>
> I agree, a committer can have a look at this.
>

I changed the status in the commit fest to "Ready for Committer".

https://commitfest.postgresql.org/28/2632/

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Tom Lane

Дата:

26 июля 2020 г., 00:06:03

Amit Langote <amitlangote09@gmail.com> writes:
> [ v7-0001-Improve-performance-of-binary-COPY-FROM-with-buff.patch ]

Pushed with cosmetic changes.

I'd always supposed that stdio does enough internal buffering that short
fread()s shouldn't be much worse than memcpy().  But I reproduced your
result of ~30% speedup for data with a lot of narrow text columns, using
RHEL 8.2.  Thinking this an indictment of glibc, I also tried it on
current macOS, and saw an even bigger speedup, approaching 50%.  So
there's definitely something to this.  I wonder if we ought to check
other I/O-constrained users of fread and fwrite, like pg_dump/pg_restore.

A point that I did not see addressed in the thread is whether this
has any negative impact on the copy-from-frontend code path, where
there's no fread() to avoid; short reads from CopyGetData() are
already going to be satisfied by memcpy'ing from the fe_msgbuf.
However, a quick check suggests that this patch is still a small
win for that case too --- apparently the control overhead in
CopyGetData() is not negligible.

So the patch seems fine functionally, but there were some cosmetic
things I didn't like:

* Removing CopyGetInt32 and CopyGetInt16 seemed like a pretty bad
idea, because it made the callers much uglier and more error-prone.
This is a particularly bad example:

         /* Header extension length */
-        if (!CopyGetInt32(cstate, &tmp) ||
-            tmp < 0)
+        if (CopyReadBinaryData(cstate, (char *) &tmp, sizeof(tmp)) !=
+            sizeof(tmp) || (tmp = (int32) pg_ntoh32(tmp)) < 0)

Putting side-effects into late stages of an if-condition is just
awful coding practice.  They're easy for a reader to miss and they
are magnets for bugs, because of the possibility that control doesn't
reach that part of the condition.

You can get the exact same speedup without any of those disadvantages
by marking these two functions "inline", so that's what I did.

* I dropped the DRAIN_COPY_RAW_BUF macro too, as in my estimation it was
a net negative for readability.  With only two use-cases, having it made
the code longer not shorter; I was also pretty unconvinced about the
wisdom of having some of the loop's control logic inside the macro and
some outside.

* BTW, the macro definitions weren't particularly per project style
anyway.  We generally put at least one space before line-ending
backslashes.  I don't think pgindent will fix this for you; IME
it doesn't touch macro definitions at all.

* Did some more work on the comments.

            regards, tom lane

Re: [PATCH] Performance Improvement For Copy From Binary Files

От

Amit Langote

Дата:

27 июля 2020 г., 05:29:54

On Sun, Jul 26, 2020 at 6:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
> > [ v7-0001-Improve-performance-of-binary-COPY-FROM-with-buff.patch ]
>
> Pushed with cosmetic changes.

Thanks for that.

> I'd always supposed that stdio does enough internal buffering that short
> fread()s shouldn't be much worse than memcpy().  But I reproduced your
> result of ~30% speedup for data with a lot of narrow text columns, using
> RHEL 8.2.  Thinking this an indictment of glibc, I also tried it on
> current macOS, and saw an even bigger speedup, approaching 50%.  So
> there's definitely something to this.  I wonder if we ought to check
> other I/O-constrained users of fread and fwrite, like pg_dump/pg_restore.

Ah, maybe a good idea to check that.

> A point that I did not see addressed in the thread is whether this
> has any negative impact on the copy-from-frontend code path, where
> there's no fread() to avoid; short reads from CopyGetData() are
> already going to be satisfied by memcpy'ing from the fe_msgbuf.
> However, a quick check suggests that this patch is still a small
> win for that case too --- apparently the control overhead in
> CopyGetData() is not negligible.

Indeed.

> So the patch seems fine functionally, but there were some cosmetic
> things I didn't like:
>
> * Removing CopyGetInt32 and CopyGetInt16 seemed like a pretty bad
> idea, because it made the callers much uglier and more error-prone.
> This is a particularly bad example:
>
>                 /* Header extension length */
> -               if (!CopyGetInt32(cstate, &tmp) ||
> -                       tmp < 0)
> +               if (CopyReadBinaryData(cstate, (char *) &tmp, sizeof(tmp)) !=
> +                       sizeof(tmp) || (tmp = (int32) pg_ntoh32(tmp)) < 0)
>
> Putting side-effects into late stages of an if-condition is just
> awful coding practice.  They're easy for a reader to miss and they
> are magnets for bugs, because of the possibility that control doesn't
> reach that part of the condition.
>
> You can get the exact same speedup without any of those disadvantages
> by marking these two functions "inline", so that's what I did.
>
> * I dropped the DRAIN_COPY_RAW_BUF macro too, as in my estimation it was
> a net negative for readability.  With only two use-cases, having it made
> the code longer not shorter; I was also pretty unconvinced about the
> wisdom of having some of the loop's control logic inside the macro and
> some outside.
>
> * BTW, the macro definitions weren't particularly per project style
> anyway.  We generally put at least one space before line-ending
> backslashes.  I don't think pgindent will fix this for you; IME
> it doesn't touch macro definitions at all.
>
> * Did some more work on the comments.

Thanks for these changes.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [PATCH] Performance Improvement For Copy From Binary Files

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения