Обсуждение: Speed up COPY FROM text/CSV parsing using SIMD

Поиск
Список
Период
Сортировка

Speed up COPY FROM text/CSV parsing using SIMD

От
Shinya Kato
Дата:
Hi hackers,

I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
text}) command and observed approximately a 5% performance
improvement. Please see the detailed test results below.

Idea
====
The current text/CSV parser processes input byte-by-byte, checking
whether each byte is a special character (\n, \r, quote, escape) or a
regular character, and transitions states in a state machine. This
sequential processing is inefficient and likely causes frequent branch
mispredictions due to the many if statements.

I thought this problem could be addressed by leveraging SIMD and
vectorized operations for faster processing.

Implementation Overview
=======================
1. Create a vector of special characters (e.g., Vector8 nl =
vector8_broadcast('\n');).
2. Load the input buffer into a Vector8 variable called chunk.
3. Perform vectorized operations between chunk and the special
character vectors to check if the buffer contains any special
characters.
4-1. If no special characters are found, advance the input_buf_ptr by
sizeof(Vector8).
4-2. If special characters are found, advance the input_buf_ptr as far
as possible, then fall back to the original text/CSV parser for
byte-by-byte processing.

Test
====
I tested the performance by measuring the time it takes to load a CSV
file created using the attached SQL script with the following COPY
command:
=# COPY t FROM '/tmp/t.csv' (FORMAT csv);

Environment
-----------
OS: Rocky Linux 9.6
CPU: Intel Core i7-10710U (6 Cores / 12 Threads, 1.1 GHz Base / 4.7
GHz Boost, AVX2 & FMA supported)

Time
----
master: 02.44.943
patch applied: 02:36.878 (about 5% faster)

Perf
----
Each call graphs are attached and the rates of CopyReadLineText are:
master: 12.15%
patch applied: 8.04%

Thought?
I would appreciate feedback on the implementation and any suggestions
for further improvement.

-- 
Best regards,
Shinya Kato
NTT OSS Center

Вложения

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nazir Bilal Yavuz
Дата:
Hi,

Thank you for working on this!

On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
>
> Hi hackers,
>
> I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
> text}) command and observed approximately a 5% performance
> improvement. Please see the detailed test results below.

I have been working on the same idea. I was not moving input_buf_ptr
as far as possible, so I think your approach is better.

Also, I did a benchmark on text format. I created a benchmark for line
length in a table being from 1 byte to 1 megabyte.The peak improvement
is line length being 4096 and the improvement is more than 20% [1], I
saw no regression on your patch.

> Idea
> ====
> The current text/CSV parser processes input byte-by-byte, checking
> whether each byte is a special character (\n, \r, quote, escape) or a
> regular character, and transitions states in a state machine. This
> sequential processing is inefficient and likely causes frequent branch
> mispredictions due to the many if statements.
>
> I thought this problem could be addressed by leveraging SIMD and
> vectorized operations for faster processing.
>
> Implementation Overview
> =======================
> 1. Create a vector of special characters (e.g., Vector8 nl =
> vector8_broadcast('\n');).
> 2. Load the input buffer into a Vector8 variable called chunk.
> 3. Perform vectorized operations between chunk and the special
> character vectors to check if the buffer contains any special
> characters.
> 4-1. If no special characters are found, advance the input_buf_ptr by
> sizeof(Vector8).
> 4-2. If special characters are found, advance the input_buf_ptr as far
> as possible, then fall back to the original text/CSV parser for
> byte-by-byte processing.
>
...
> Thought?
> I would appreciate feedback on the implementation and any suggestions
> for further improvement.


I have a couple of ideas that I was working on:
---

+         * However, SIMD optimization cannot be applied in the following cases:
+         * - Inside quoted fields, where escape sequences and closing quotes
+         *   require sequential processing to handle correctly.

I think you can continue SIMD inside quoted fields. Only important
thing is you need to set last_was_esc to false when SIMD skipped the
chunk.
---

+         * - When the remaining buffer size is smaller than the size of a SIMD
+         *   vector register, as SIMD operations require processing data in
+         *   fixed-size chunks.

You run SIMD when 'copy_buf_len - input_buf_ptr >= sizeof(Vector8)'
but you only call CopyLoadInputBuf() when 'input_buf_ptr >=
copy_buf_len || need_data' so basically you need to wait at least the
sizeof(Vector8) character to pass for the next SIMD. And in the worst
case; if CopyLoadInputBuf() puts one character less than
sizeof(Vector8), then you can't ever run SIMD. I think we need to make
sure that CopyLoadInputBuf() loads at least the sizeof(Vector8)
character to the input_buf so we do not encounter that problem.
---

What do you think about adding SIMD to CopyReadAttributesText() and
CopyReadAttributesCSV() functions? When I add your SIMD approach to
CopyReadAttributesText() function, the improvement on the 4096 byte
line length input [1] goes from 20% to 30%.
---

I shared my ideas as a Feedback.txt file (.txt to stay off CFBot's
radar for this thread). I hope these help, please let me know if you
have any questions.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Вложения

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nazir Bilal Yavuz
Дата:
Hi,

On Thu, 7 Aug 2025 at 14:15, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
> >
> > I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
> > text}) command and observed approximately a 5% performance
> > improvement. Please see the detailed test results below.
>
> Also, I did a benchmark on text format. I created a benchmark for line
> length in a table being from 1 byte to 1 megabyte.The peak improvement
> is line length being 4096 and the improvement is more than 20% [1], I
> saw no regression on your patch.

I did the same benchmark for the CSV format. The peak improvement is
line length being 4096 and the improvement is more than 25% [1]. I saw
a 5% regression on the 1 byte benchmark, there are no other
regressions.

> What do you think about adding SIMD to CopyReadAttributesText() and
> CopyReadAttributesCSV() functions? When I add your SIMD approach to
> CopyReadAttributesText() function, the improvement on the 4096 byte
> line length input [1] goes from 20% to 30%.

I wanted to try using SIMD in CopyReadAttributesCSV() as well. The
improvement on the 4096 byte line length input [1] goes from 25% to
35%, the regression on the 1 byte input is the same.

CopyReadAttributesCSV() changes are attached as feedback v2.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Вложения

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Shinya Kato
Дата:
On Thu, Aug 7, 2025 at 8:15 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> Thank you for working on this!
>
> On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
> >
> > Hi hackers,
> >
> > I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
> > text}) command and observed approximately a 5% performance
> > improvement. Please see the detailed test results below.
>
> I have been working on the same idea. I was not moving input_buf_ptr
> as far as possible, so I think your approach is better.

Great. I'm looking forward to working with you on this feature implementation.

> Also, I did a benchmark on text format. I created a benchmark for line
> length in a table being from 1 byte to 1 megabyte.The peak improvement
> is line length being 4096 and the improvement is more than 20% [1], I
> saw no regression on your patch.

Thank you for the additional benchmarks.

> I have a couple of ideas that I was working on:
> ---
>
> +         * However, SIMD optimization cannot be applied in the following cases:
> +         * - Inside quoted fields, where escape sequences and closing quotes
> +         *   require sequential processing to handle correctly.
>
> I think you can continue SIMD inside quoted fields. Only important
> thing is you need to set last_was_esc to false when SIMD skipped the
> chunk.

That's a clever point that last_was_esc should be reset to false when
a SIMD chunk is skipped. You're right about that specific case.

However, the core challenge is not what happens when we skip a chunk,
but what happens when a chunk contains special characters like quotes
or escapes. The main reason we avoid SIMD inside quoted fields is that
the parsing logic becomes fundamentally sequential and
context-dependent.

To correctly parse a "" as a single literal quote, we must perform a
lookahead to check the next character. This is an inherently
sequential operation that doesn't map well to SIMD's parallel nature.

Trying to handle this stateful logic with SIMD would lead to
significant implementation complexity, especially with edge cases like
an escape character falling on the last byte of a chunk.

> +         * - When the remaining buffer size is smaller than the size of a SIMD
> +         *   vector register, as SIMD operations require processing data in
> +         *   fixed-size chunks.
>
> You run SIMD when 'copy_buf_len - input_buf_ptr >= sizeof(Vector8)'
> but you only call CopyLoadInputBuf() when 'input_buf_ptr >=
> copy_buf_len || need_data' so basically you need to wait at least the
> sizeof(Vector8) character to pass for the next SIMD. And in the worst
> case; if CopyLoadInputBuf() puts one character less than
> sizeof(Vector8), then you can't ever run SIMD. I think we need to make
> sure that CopyLoadInputBuf() loads at least the sizeof(Vector8)
> character to the input_buf so we do not encounter that problem.

I think you're probably right, but we only need to account for
sizeof(Vector8) when USE_NO_SIMD is not defined.

> What do you think about adding SIMD to CopyReadAttributesText() and
> CopyReadAttributesCSV() functions? When I add your SIMD approach to
> CopyReadAttributesText() function, the improvement on the 4096 byte
> line length input [1] goes from 20% to 30%.

Agreed, I will.

> I shared my ideas as a Feedback.txt file (.txt to stay off CFBot's
> radar for this thread). I hope these help, please let me know if you
> have any questions.

Thanks a lot!


On Mon, Aug 11, 2025 at 5:52 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Thu, 7 Aug 2025 at 14:15, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
> > >
> > > I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
> > > text}) command and observed approximately a 5% performance
> > > improvement. Please see the detailed test results below.
> >
> > Also, I did a benchmark on text format. I created a benchmark for line
> > length in a table being from 1 byte to 1 megabyte.The peak improvement
> > is line length being 4096 and the improvement is more than 20% [1], I
> > saw no regression on your patch.
>
> I did the same benchmark for the CSV format. The peak improvement is
> line length being 4096 and the improvement is more than 25% [1]. I saw
> a 5% regression on the 1 byte benchmark, there are no other
> regressions.

Thank you. I'm not too concerned about a regression when there's only
one byte per line.

> > What do you think about adding SIMD to CopyReadAttributesText() and
> > CopyReadAttributesCSV() functions? When I add your SIMD approach to
> > CopyReadAttributesText() function, the improvement on the 4096 byte
> > line length input [1] goes from 20% to 30%.
>
> I wanted to try using SIMD in CopyReadAttributesCSV() as well. The
> improvement on the 4096 byte line length input [1] goes from 25% to
> 35%, the regression on the 1 byte input is the same.

Yes, I'm on it. I'm currently adding the SIMD logic to
CopyReadAttributesCSV() as you suggested. I'll share the new version
of the patch soon.


--
Best regards,
Shinya Kato
NTT OSS Center



Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Shinya Kato
Дата:
On Tue, Aug 12, 2025 at 4:25 PM Shinya Kato <shinya11.kato@gmail.com> wrote:

> > +         * However, SIMD optimization cannot be applied in the following cases:
> > +         * - Inside quoted fields, where escape sequences and closing quotes
> > +         *   require sequential processing to handle correctly.
> >
> > I think you can continue SIMD inside quoted fields. Only important
> > thing is you need to set last_was_esc to false when SIMD skipped the
> > chunk.
>
> That's a clever point that last_was_esc should be reset to false when
> a SIMD chunk is skipped. You're right about that specific case.
>
> However, the core challenge is not what happens when we skip a chunk,
> but what happens when a chunk contains special characters like quotes
> or escapes. The main reason we avoid SIMD inside quoted fields is that
> the parsing logic becomes fundamentally sequential and
> context-dependent.
>
> To correctly parse a "" as a single literal quote, we must perform a
> lookahead to check the next character. This is an inherently
> sequential operation that doesn't map well to SIMD's parallel nature.
>
> Trying to handle this stateful logic with SIMD would lead to
> significant implementation complexity, especially with edge cases like
> an escape character falling on the last byte of a chunk.

Ah, you're right. My apologies, I misunderstood the implementation. It
appears that SIMD can be used even within quoted strings.

I think it would be better not to use the SIMD path when last_was_esc
is true. The next character is likely to be a special character, and
handling this case outside the SIMD loop would also improve
readability by consolidating the last_was_esc toggle logic in one
place.

Furthermore, when inside a quote (in_quote) in CSV mode, the detection
of \n and \r can be disabled.

+               last_was_esc = false;

Regarding the implementation, I believe we must set last_was_esc to
false when advancing input_buf_ptr, as shown in the code below. For
this reason, I think it’s best to keep the current logic for toggling
last_was_esc.

+               int advance = pg_rightmost_one_pos32(mask);
+               input_buf_ptr += advance;

I've attached a new patch that includes these changes. Further
modifications are still in progress.

--
Best regards,
Shinya Kato
NTT OSS Center

Вложения

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
KAZAR Ayoub
Дата:
Following Nazir's findings about 4096 bytes being the performant line length, I did more benchmarks from my side on both TEXT and CSV formats with two different cases of normal data (no special characters) and data with many special characters.

Results are con good as expected and similar to previous benchmarks
 ~30.9% faster copy in TEXT format
 ~32.4% faster copy in CSV format
20%-30% reduces cycles per instructions

In the case of doing a lot of special characters in the lines (e.g., tables with large numbers of columns maybe), we obviously expect regressions here because of the overhead of many fallbacks to scalar processing.
Results for a 1/3 of line length of special characters:
~43.9% slower copy in TEXT format
~16.7% slower copy in CSV format
So for even less occurrences of special characters or wider distance between there might still be some regressions in this case, a non-significant case maybe, but can be treated in other patches if we consider to not use SIMD path sometimes.

I hope this helps more and confirms the patch.

Regards,
Ayoub Kazar

Le jeu. 14 août 2025 à 01:55, Shinya Kato <shinya11.kato@gmail.com> a écrit :
On Tue, Aug 12, 2025 at 4:25 PM Shinya Kato <shinya11.kato@gmail.com> wrote:

> > +         * However, SIMD optimization cannot be applied in the following cases:
> > +         * - Inside quoted fields, where escape sequences and closing quotes
> > +         *   require sequential processing to handle correctly.
> >
> > I think you can continue SIMD inside quoted fields. Only important
> > thing is you need to set last_was_esc to false when SIMD skipped the
> > chunk.
>
> That's a clever point that last_was_esc should be reset to false when
> a SIMD chunk is skipped. You're right about that specific case.
>
> However, the core challenge is not what happens when we skip a chunk,
> but what happens when a chunk contains special characters like quotes
> or escapes. The main reason we avoid SIMD inside quoted fields is that
> the parsing logic becomes fundamentally sequential and
> context-dependent.
>
> To correctly parse a "" as a single literal quote, we must perform a
> lookahead to check the next character. This is an inherently
> sequential operation that doesn't map well to SIMD's parallel nature.
>
> Trying to handle this stateful logic with SIMD would lead to
> significant implementation complexity, especially with edge cases like
> an escape character falling on the last byte of a chunk.

Ah, you're right. My apologies, I misunderstood the implementation. It
appears that SIMD can be used even within quoted strings.

I think it would be better not to use the SIMD path when last_was_esc
is true. The next character is likely to be a special character, and
handling this case outside the SIMD loop would also improve
readability by consolidating the last_was_esc toggle logic in one
place.

Furthermore, when inside a quote (in_quote) in CSV mode, the detection
of \n and \r can be disabled.

+               last_was_esc = false;

Regarding the implementation, I believe we must set last_was_esc to
false when advancing input_buf_ptr, as shown in the code below. For
this reason, I think it’s best to keep the current logic for toggling
last_was_esc.

+               int advance = pg_rightmost_one_pos32(mask);
+               input_buf_ptr += advance;

I've attached a new patch that includes these changes. Further
modifications are still in progress.

--
Best regards,
Shinya Kato
NTT OSS Center

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nazir Bilal Yavuz
Дата:
Hi,

On Thu, 14 Aug 2025 at 05:25, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
>
> Following Nazir's findings about 4096 bytes being the performant line length, I did more benchmarks from my side on
bothTEXT and CSV formats with two different cases of normal data (no special characters) and data with many special
characters.
>
> Results are con good as expected and similar to previous benchmarks
>  ~30.9% faster copy in TEXT format
>  ~32.4% faster copy in CSV format
> 20%-30% reduces cycles per instructions
>
> In the case of doing a lot of special characters in the lines (e.g., tables with large numbers of columns maybe), we
obviouslyexpect regressions here because of the overhead of many fallbacks to scalar processing.
 
> Results for a 1/3 of line length of special characters:
> ~43.9% slower copy in TEXT format
> ~16.7% slower copy in CSV format
> So for even less occurrences of special characters or wider distance between there might still be some regressions in
thiscase, a non-significant case maybe, but can be treated in other patches if we consider to not use SIMD path
sometimes.
>
> I hope this helps more and confirms the patch.

Thanks for running that benchmark! Would you mind sharing a reproducer
for the regression you observed?

--
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Speed up COPY FROM text/CSV parsing using SIMD

От
KAZAR Ayoub
Дата:

Hi,

On Thu, 14 Aug 2025 at 05:25, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
>
> Following Nazir's findings about 4096 bytes being the performant line length, I did more benchmarks from my side on both TEXT and CSV formats with two different cases of normal data (no special characters) and data with many special characters.
>
> Results are con good as expected and similar to previous benchmarks
>  ~30.9% faster copy in TEXT format
>  ~32.4% faster copy in CSV format
> 20%-30% reduces cycles per instructions
>
> In the case of doing a lot of special characters in the lines (e.g., tables with large numbers of columns maybe), we obviously expect regressions here because of the overhead of many fallbacks to scalar processing.
> Results for a 1/3 of line length of special characters:
> ~43.9% slower copy in TEXT format
> ~16.7% slower copy in CSV format
> So for even less occurrences of special characters or wider distance between there might still be some regressions in this case, a non-significant case maybe, but can be treated in other patches if we consider to not use SIMD path sometimes.
>
> I hope this helps more and confirms the patch.

Thanks for running that benchmark! Would you mind sharing a reproducer
for the regression you observed?

--
Regards,
Nazir Bilal Yavuz
Microsoft

Of course, I attached the sql to generate the text and csv test files.
If having a 1/3 of line length of special characters can be an exaggeration, something lower might still reproduce some regressions of course for the same idea.

Best regards,
Ayoub Kazar
Вложения

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Ants Aasma
Дата:
On Thu, 7 Aug 2025 at 14:15, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> I have a couple of ideas that I was working on:
> ---
>
> +         * However, SIMD optimization cannot be applied in the following cases:
> +         * - Inside quoted fields, where escape sequences and closing quotes
> +         *   require sequential processing to handle correctly.
>
> I think you can continue SIMD inside quoted fields. Only important
> thing is you need to set last_was_esc to false when SIMD skipped the
> chunk.

There is a trick with doing carryless multiplication with -1 that can
be used to SIMD process transitions between quoted/not-quoted. [1]
This is able to convert a bitmask of unescaped quote character
positions to a quote mask in a single operation. I last looked at it 5
years ago, but I remember coming to the conclusion that it would work
for implementing PostgreSQL's interpretation of CSV.

[1] https://github.com/geofflangdale/simdcsv/blob/master/src/main.cpp#L76

--
Ants



Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nazir Bilal Yavuz
Дата:
Hi,

On Thu, 14 Aug 2025 at 18:00, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
>> Thanks for running that benchmark! Would you mind sharing a reproducer
>> for the regression you observed?
>
> Of course, I attached the sql to generate the text and csv test files.
> If having a 1/3 of line length of special characters can be an exaggeration, something lower might still reproduce
someregressions of course for the same idea. 

Thank you so much!

I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.

So, I implemented a small heuristic. It works like that:

- If advance < 5 -> insert a sleep penalty (n cycles).
- Each time advance < 5, n is doubled.
- Each time advance ≥ 5, n is halved.

I am sharing a POC patch to show heuristic, it can be applied on top
of v1-0001. Heuristic version has the same performance improvements
with the v1-0001 but the regression is %5 instead of %20 compared to
the master.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Вложения

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nazir Bilal Yavuz
Дата:
Hi,

On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> I am able to reproduce the regression you mentioned but both
> regressions are %20 on my end. I found that (by experimenting) SIMD
> causes a regression if it advances less than 5 characters.
>
> So, I implemented a small heuristic. It works like that:
>
> - If advance < 5 -> insert a sleep penalty (n cycles).

'sleep' might be a poor word choice here. I meant skipping SIMD for n
number of times.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Andrew Dunstan
Дата:
On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
> Hi,
>
> On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>> I am able to reproduce the regression you mentioned but both
>> regressions are %20 on my end. I found that (by experimenting) SIMD
>> causes a regression if it advances less than 5 characters.
>>
>> So, I implemented a small heuristic. It works like that:
>>
>> - If advance < 5 -> insert a sleep penalty (n cycles).
> 'sleep' might be a poor word choice here. I meant skipping SIMD for n
> number of times.
>

I was thinking a bit about that this morning. I wonder if it might be 
better instead of having a constantly applied heuristic like this, it 
might be better to do a little extra accounting in the first, say, 1000 
lines of an input file, and if less than some portion of the input is 
found to be special characters then switch to the SIMD code. What that 
portion should be would need to be determined by some experimentation 
with a variety of typical workloads, but given your findings 20% seems 
like a good starting point.


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Speed up COPY FROM text/CSV parsing using SIMD

От
KAZAR Ayoub
Дата:

On Thu, 14 Aug 2025 at 18:00, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
>> Thanks for running that benchmark! Would you mind sharing a reproducer
>> for the regression you observed?
>
> Of course, I attached the sql to generate the text and csv test files.
> If having a 1/3 of line length of special characters can be an exaggeration, something lower might still reproduce some regressions of course for the same idea.

Thank you so much!

I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.

So, I implemented a small heuristic. It works like that:

- If advance < 5 -> insert a sleep penalty (n cycles).
- Each time advance < 5, n is doubled.
- Each time advance ≥ 5, n is halved.

I am sharing a POC patch to show heuristic, it can be applied on top
of v1-0001. Heuristic version has the same performance improvements
with the v1-0001 but the regression is %5 instead of %20 compared to
the master.

--
Regards,
Nazir Bilal Yavuz
Microsoft
Yes this is good, i'm also getting about 5% regression only now.



Regards,
Ayoub Kazar

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nazir Bilal Yavuz
Дата:
Hi,

On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
> > Hi,
> >
> > On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >> I am able to reproduce the regression you mentioned but both
> >> regressions are %20 on my end. I found that (by experimenting) SIMD
> >> causes a regression if it advances less than 5 characters.
> >>
> >> So, I implemented a small heuristic. It works like that:
> >>
> >> - If advance < 5 -> insert a sleep penalty (n cycles).
> > 'sleep' might be a poor word choice here. I meant skipping SIMD for n
> > number of times.
> >
>
> I was thinking a bit about that this morning. I wonder if it might be
> better instead of having a constantly applied heuristic like this, it
> might be better to do a little extra accounting in the first, say, 1000
> lines of an input file, and if less than some portion of the input is
> found to be special characters then switch to the SIMD code. What that
> portion should be would need to be determined by some experimentation
> with a variety of typical workloads, but given your findings 20% seems
> like a good starting point.

I implemented a heuristic something similar to this. It is a mix of
previous heuristic and your idea, it works like that:

Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.

1 - We will try SIMD and decide if it is worth it to run SIMD.
1.1 - If it is worth it, we will continue to run SIMD and we will
halve the simd_last_sleep_cycle variable.
1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
and we will not run SIMD for these many lines.
1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
each 1024 lines at max.

With this heuristic the regression is limited by %2 in the worst case.

Patches are attached, the first patch is v2-0001 from Shinya with the
'-Werror=maybe-uninitialized' fixes and the pgindent changes. 0002 is
the actual heuristic patch.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Вложения

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
KAZAR Ayoub
Дата:

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nazir Bilal Yavuz
Дата:
Hi,

On Thu, 16 Oct 2025 at 17:29, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Overall logic is that we will not run SIMD for the entire line and we
> decide if it is worth it to run SIMD for the next lines.

I had a typo there, correct sentence is that:

"Overall logic is that we *will* run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines."

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nazir Bilal Yavuz
Дата:
Hi,

On Sat, 18 Oct 2025 at 21:46, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
>
> Hello,
>
> I’ve rebenchmarked the new heuristic patch, We still have the previous improvements ranging from 15% to 30%. For
regressionsi see at maximum 3% or 4% in the worst case, so this is solid. 

Thank you so much for doing this! The results look nice, do you think
there are any other benchmarks that might be interesting to try?

> I'm also trying the idea of doing SIMD inside quotes with prefix XOR using carry less multiplication avoiding the
slowpath in all cases even with weird looking input, but it needs to take into consideration the availability of
PCLMULQDQinstruction set with <wmmintrin.h> and here we go, it quickly starts to become dirty OR we can wait for the
decisionto start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2. 

I can not quite picture this, would you mind sharing a few examples or patches?

--
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Andrew Dunstan
Дата:


On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
Hi,

On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <andrew@dunslane.net> wrote:

On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
Hi,

On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.

So, I implemented a small heuristic. It works like that:

- If advance < 5 -> insert a sleep penalty (n cycles).
'sleep' might be a poor word choice here. I meant skipping SIMD for n
number of times.

I was thinking a bit about that this morning. I wonder if it might be
better instead of having a constantly applied heuristic like this, it
might be better to do a little extra accounting in the first, say, 1000
lines of an input file, and if less than some portion of the input is
found to be special characters then switch to the SIMD code. What that
portion should be would need to be determined by some experimentation
with a variety of typical workloads, but given your findings 20% seems
like a good starting point.
I implemented a heuristic something similar to this. It is a mix of
previous heuristic and your idea, it works like that:

Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.

1 - We will try SIMD and decide if it is worth it to run SIMD.
1.1 - If it is worth it, we will continue to run SIMD and we will
halve the simd_last_sleep_cycle variable.
1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
and we will not run SIMD for these many lines.
1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
each 1024 lines at max.

With this heuristic the regression is limited by %2 in the worst case.


My worry is that the worst case is actually quite common. Sparse data sets dominated by a lot of null values (and hence lots of  special characters) are very common. Are people prepared to accept a 2% regression on load times for such data sets?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nathan Bossart
Дата:
On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:
> On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
>> With this heuristic the regression is limited by %2 in the worst case.
> 
> My worry is that the worst case is actually quite common. Sparse data sets
> dominated by a lot of null values (and hence lots of special characters) are
> very common. Are people prepared to accept a 2% regression on load times for
> such data sets?

Without knowing how common it is, I think it's difficult to judge whether
2% is a reasonable trade-off.  If <5% of workloads might see a small
regression while the other >95% see double-digit percentage improvements,
then I might argue that it's fine.  But I'm not sure we have any way to
know those sorts of details at the moment.

I'm also at least a little skeptical about the 2% number.  IME that's
generally within the noise range and can vary greatly between machines and
test runs.

-- 
nathan



Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Andrew Dunstan
Дата:


On 2025-10-20 Mo 1:04 PM, Nathan Bossart wrote:
On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:
On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
With this heuristic the regression is limited by %2 in the worst case.
My worry is that the worst case is actually quite common. Sparse data sets
dominated by a lot of null values (and hence lots of special characters) are
very common. Are people prepared to accept a 2% regression on load times for
such data sets?
Without knowing how common it is, I think it's difficult to judge whether
2% is a reasonable trade-off.  If <5% of workloads might see a small
regression while the other >95% see double-digit percentage improvements,
then I might argue that it's fine.  But I'm not sure we have any way to
know those sorts of details at the moment.


I guess what I don't understand is why we actually need to do the test continuously, even using an adaptive algorithm. Data files in my experience usually have lines with fairly similar shapes. It's highly unlikely that you will get the the first 1000 (say) lines of a file that are rich in special characters and then some later significant section that isn't, or vice versa. Therefore, doing the test once should yield the correct answer that can be applied to the rest of the file. That should reduce the worst case regression to ~0% without sacrificing any of the performance gains. I appreciate the elegance of what Bilal has done here, but it does seem like overkill.

I'm also at least a little skeptical about the 2% number.  IME that's
generally within the noise range and can vary greatly between machines and
test runs.


Fair point.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nazir Bilal Yavuz
Дата:
Hi,

On Mon, 20 Oct 2025 at 23:32, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 2025-10-20 Mo 1:04 PM, Nathan Bossart wrote:
>
> On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:
>
> On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
>
> With this heuristic the regression is limited by %2 in the worst case.
>
> My worry is that the worst case is actually quite common. Sparse data sets
> dominated by a lot of null values (and hence lots of special characters) are
> very common. Are people prepared to accept a 2% regression on load times for
> such data sets?
>
> Without knowing how common it is, I think it's difficult to judge whether
> 2% is a reasonable trade-off.  If <5% of workloads might see a small
> regression while the other >95% see double-digit percentage improvements,
> then I might argue that it's fine.  But I'm not sure we have any way to
> know those sorts of details at the moment.
>
>
> I guess what I don't understand is why we actually need to do the test continuously, even using an adaptive
algorithm.Data files in my experience usually have lines with fairly similar shapes. It's highly unlikely that you will
getthe the first 1000 (say) lines of a file that are rich in special characters and then some later significant section
thatisn't, or vice versa. Therefore, doing the test once should yield the correct answer that can be applied to the
restof the file. That should reduce the worst case regression to ~0% without sacrificing any of the performance gains.
Iappreciate the elegance of what Bilal has done here, but it does seem like overkill. 

I think the problem is deciding how many lines to process before
deciding for the rest. 1000 lines could work for the small sized data
but it might not work for the big sized data. Also, it might cause a
worse regressions for the small sized data. Because of this reason, I
tried to implement a heuristic that will work regardless of the size
of the data. The last heuristic I suggested will run SIMD for
approximately (#number_of_lines / 1024 [1024 is the max number of
lines to sleep before running SIMD again]) lines if all characters in
the data are special characters.

--
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Speed up COPY FROM text/CSV parsing using SIMD

От
KAZAR Ayoub
Дата:


On Sat, Oct 18, 2025 at 10:01 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Thank you so much for doing this! The results look nice, do you think
there are any other benchmarks that might be interesting to try?

> I'm also trying the idea of doing SIMD inside quotes with prefix XOR using carry less multiplication avoiding the slow path in all cases even with weird looking input, but it needs to take into consideration the availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we go, it quickly starts to become dirty OR we can wait for the decision to start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.

I can not quite picture this, would you mind sharing a few examples or patches?
The idea aims to avoid stopping at characters that are not actually special in their position (inside quote, escaped ..etc)
This is done by creating a lot of masks from the original chunk, masks like: quote_mask, escape_mask, odd escape sequences mask ; from these we can deduce which quotes are not special to stop at
Then for inside quotes, we aim to know which characters in our chunk are inside quotes (also keeping in track the previous chunk's quote state) and there's a clever/fast way to do it [1].
After this you start to match with LF and CR ..etc, all this while maintaining the state of what you've seen (the annoying part).
At the end you only reach the scalar path advancing by the position of first real special character that requires special treatment.

However, after trying to implement this on the existing pipeline way of COPY command [2] (broken hopeless try, but has the idea), It becomes very unreasonable for a lot of reasons:
- It is very challenging to correctly handle commas inside quoted fields, and tracking quoted vs. unquoted state (especially across chunk boundaries, or with escaped quotes) ....
- Using carry less multiplication (CLMUL) for prefix xor on a 16 bytes chunk is overkill for some architectures where PCLMULQDQ latency is high [3][4] to a point where it performs worse than an unrolled shifts + xor (5 cycles).
- It starts to feel that handling these cases is inherently scalar, doing all that work for a 16 bytes chunk would be unreasonable since it's not free, compared to a simple help using SIMD and heuristic of Nazir which is way nicer in general.

Currently we are at 200-400Mbps which isn't that terrible compared to production and non production grade parsers (of course we don't only parse in our case), also we are using SSE2 only so theoretically if we add support for avx later on we'll have even better numbers.
Maybe more micro optimizations to the current heuristic can squeeze it more.


Regards,
Ayoub Kazar.

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
KAZAR Ayoub
Дата:


On Tue, Oct 21, 2025, 8:17 AM KAZAR Ayoub <ma_kazar@esi.dz> wrote:

Currently we are at 200-400Mbps which isn't that terrible compared to production and non production grade parsers (of course we don't only parse in our case), also we are using SSE2 only so theoretically if we add support for avx later on we'll have even better numbers.
Maybe more micro optimizations to the current heuristic can squeeze it more.


Regards,
Ayoub Kazar.
Sorry, I meant 200-400MB/s.


Regards.
Ayoub Kazar.

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nathan Bossart
Дата:
On Tue, Oct 21, 2025 at 12:09:27AM +0300, Nazir Bilal Yavuz wrote:
> I think the problem is deciding how many lines to process before
> deciding for the rest. 1000 lines could work for the small sized data
> but it might not work for the big sized data. Also, it might cause a
> worse regressions for the small sized data.

IMHO we have some leeway with smaller amounts of data.  If COPY FROM for
1000 rows takes 19 milliseconds as opposed to 11 milliseconds, it seems
unlikely users would be inconvenienced all that much.  (Those numbers are
completely made up in order to illustrate my point.)

> Because of this reason, I
> tried to implement a heuristic that will work regardless of the size
> of the data. The last heuristic I suggested will run SIMD for
> approximately (#number_of_lines / 1024 [1024 is the max number of
> lines to sleep before running SIMD again]) lines if all characters in
> the data are special characters.

I wonder if we could mitigate the regression further by spacing out the
checks a bit more.  It could be worth comparing a variety of values to
identify what works best with the test data.

-- 
nathan



Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nathan Bossart
Дата:
On Tue, Oct 21, 2025 at 08:17:01AM +0200, KAZAR Ayoub wrote:
>>> I'm also trying the idea of doing SIMD inside quotes with prefix XOR
>>> using carry less multiplication avoiding the slow path in all cases even
>>> with weird looking input, but it needs to take into consideration the
>>> availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we
>>> go, it quickly starts to become dirty OR we can wait for the decision to
>>> start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.
>
> [...]
> 
> Currently we are at 200-400Mbps which isn't that terrible compared to
> production and non production grade parsers (of course we don't only parse
> in our case), also we are using SSE2 only so theoretically if we add
> support for avx later on we'll have even better numbers.
> Maybe more micro optimizations to the current heuristic can squeeze it more.

I'd greatly prefer that we stick with SSE2/Neon (i.e., simd.h) unless the
gains are extraordinary.  Beyond the inherent complexity of using
architecture-specific intrinsics, you also have to deal with configure-time
checks, runtime checks, and function pointer overhead juggling.  That tends
to be a lot of work for the amount of gain.

-- 
nathan



Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nazir Bilal Yavuz
Дата:
Hi,

On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Oct 21, 2025 at 12:09:27AM +0300, Nazir Bilal Yavuz wrote:
> > I think the problem is deciding how many lines to process before
> > deciding for the rest. 1000 lines could work for the small sized data
> > but it might not work for the big sized data. Also, it might cause a
> > worse regressions for the small sized data.
>
> IMHO we have some leeway with smaller amounts of data.  If COPY FROM for
> 1000 rows takes 19 milliseconds as opposed to 11 milliseconds, it seems
> unlikely users would be inconvenienced all that much.  (Those numbers are
> completely made up in order to illustrate my point.)
>
> > Because of this reason, I
> > tried to implement a heuristic that will work regardless of the size
> > of the data. The last heuristic I suggested will run SIMD for
> > approximately (#number_of_lines / 1024 [1024 is the max number of
> > lines to sleep before running SIMD again]) lines if all characters in
> > the data are special characters.
>
> I wonder if we could mitigate the regression further by spacing out the
> checks a bit more.  It could be worth comparing a variety of values to
> identify what works best with the test data.

Do you mean that instead of doubling the SIMD sleep, we should
multiply it by 3 (or another factor)? Or are you referring to
increasing the maximum sleep from 1024? Or possibly both?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Nathan Bossart
Дата:
On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote:
> On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I wonder if we could mitigate the regression further by spacing out the
>> checks a bit more.  It could be worth comparing a variety of values to
>> identify what works best with the test data.
> 
> Do you mean that instead of doubling the SIMD sleep, we should
> multiply it by 3 (or another factor)? Or are you referring to
> increasing the maximum sleep from 1024? Or possibly both?

I'm not sure of the precise details, but the main thrust of my suggestion
is to assume that whatever sampling you do to determine whether to use SIMD
is good for a larger chunk of data.  That is, if you are sampling 1K lines
and then using the result to choose whether to use SIMD for the next 100K
lines, we could instead bump the latter number to 1M lines (or something).
That way we minimize the regression for relatively uniform data sets while
retaining some ability to adapt in case things change halfway through a
large table.

-- 
nathan



Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Andrew Dunstan
Дата:
On 2025-10-22 We 3:24 PM, Nathan Bossart wrote:
> On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote:
>> On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com> wrote:
>>> I wonder if we could mitigate the regression further by spacing out the
>>> checks a bit more.  It could be worth comparing a variety of values to
>>> identify what works best with the test data.
>> Do you mean that instead of doubling the SIMD sleep, we should
>> multiply it by 3 (or another factor)? Or are you referring to
>> increasing the maximum sleep from 1024? Or possibly both?
> I'm not sure of the precise details, but the main thrust of my suggestion
> is to assume that whatever sampling you do to determine whether to use SIMD
> is good for a larger chunk of data.  That is, if you are sampling 1K lines
> and then using the result to choose whether to use SIMD for the next 100K
> lines, we could instead bump the latter number to 1M lines (or something).
> That way we minimize the regression for relatively uniform data sets while
> retaining some ability to adapt in case things change halfway through a
> large table.
>


I'd be ok with numbers like this, although I suspect the numbers of 
cases where we see shape shifts like this in the middle of a data set 
would be vanishingly small.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Manni Wood
Дата:


On Wed, Oct 29, 2025 at 5:23 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2025-10-22 We 3:24 PM, Nathan Bossart wrote:
> On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote:
>> On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com> wrote:
>>> I wonder if we could mitigate the regression further by spacing out the
>>> checks a bit more.  It could be worth comparing a variety of values to
>>> identify what works best with the test data.
>> Do you mean that instead of doubling the SIMD sleep, we should
>> multiply it by 3 (or another factor)? Or are you referring to
>> increasing the maximum sleep from 1024? Or possibly both?
> I'm not sure of the precise details, but the main thrust of my suggestion
> is to assume that whatever sampling you do to determine whether to use SIMD
> is good for a larger chunk of data.  That is, if you are sampling 1K lines
> and then using the result to choose whether to use SIMD for the next 100K
> lines, we could instead bump the latter number to 1M lines (or something).
> That way we minimize the regression for relatively uniform data sets while
> retaining some ability to adapt in case things change halfway through a
> large table.
>


I'd be ok with numbers like this, although I suspect the numbers of
cases where we see shape shifts like this in the middle of a data set
would be vanishingly small.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Hello!

I wanted reproduce the results using files attached by Shinya Kato and Ayoub Kazar. I installed a postgres compiled from master, and then I installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches applied.

The master+v3patches postgres naturally performed better on copying into the database: anywhere from 11% better for the t.csv file produced by Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file created by Ayoub Kazar's simd-copy-from-bench.sql.

But here's where it gets weird. The two files created by Ayoub Kazar's simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt, and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5% respectively.

This seems impossible.

A few things I should note:

I timed the commands using the Unix time command, like so:

time psql -X -U mwood -h localhost -d postgres -c '\copy t from /tmp/t_4096_escape.txt'

For each file, I timed the copy 6 times and took the average.

This was done on my work Linux machine while also running Chrome and an Open Office spreadsheet; not a dedicated machine only running postgres.

All of the copy results took between 4.5 seconds (Shinyo's t.csv copied into postgres compiled from master) to 2 seconds (Ayoub Kazar's t_4096_none.csv copied into postgres compiled from master plus Nazir's v3 patches).

Perhaps I need to fiddle with the provided SQL to produce larger files to get longer run times? Maybe sub-second differences won't tell as interesting a story as minutes-long copy commands?

Thanks for reading this.
--
-- Manni Wood EDB: https://www.enterprisedb.com

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
KAZAR Ayoub
Дата:
On Tue, Nov 11, 2025 at 11:23 PM Manni Wood <manni.wood@enterprisedb.com> wrote:
Hello!

I wanted reproduce the results using files attached by Shinya Kato and Ayoub Kazar. I installed a postgres compiled from master, and then I installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches applied.

The master+v3patches postgres naturally performed better on copying into the database: anywhere from 11% better for the t.csv file produced by Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file created by Ayoub Kazar's simd-copy-from-bench.sql.

But here's where it gets weird. The two files created by Ayoub Kazar's simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt, and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5% respectively.

This seems impossible.

A few things I should note:

I timed the commands using the Unix time command, like so:

time psql -X -U mwood -h localhost -d postgres -c '\copy t from /tmp/t_4096_escape.txt'

For each file, I timed the copy 6 times and took the average.

This was done on my work Linux machine while also running Chrome and an Open Office spreadsheet; not a dedicated machine only running postgres.
Hello,
I think if you do a perf benchmark (if it still reproduces) it would probably be possible to explain why it's performing like that looking at the CPI and other metrics and compare it to my findings.
What i also suggest is to make the data close even closer to the worst case i.e: more special characters where it hurts the switching between SIMD and scalar processing (in simd-copy-from-bench.sql file), if still does a good job then there's something to look at.
 

All of the copy results took between 4.5 seconds (Shinyo's t.csv copied into postgres compiled from master) to 2 seconds (Ayoub Kazar's t_4096_none.csv copied into postgres compiled from master plus Nazir's v3 patches).

Perhaps I need to fiddle with the provided SQL to produce larger files to get longer run times? Maybe sub-second differences won't tell as interesting a story as minutes-long copy commands?
I did try it on some GBs (around 2-5GB only), the differences were not that much, but if you can run this on more GBs (at least 10GB) it would be good to look at, although i don't suspect anything interesting since the shape of data is the same for the totality of the COPY.

Thanks for reading this.
--
-- Manni Wood EDB: https://www.enterprisedb.com
Thanks for the info.


Regards,
Ayoub Kazar. 

Re: Speed up COPY FROM text/CSV parsing using SIMD

От
Manni Wood
Дата:


On Wed, Nov 12, 2025 at 8:44 AM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
On Tue, Nov 11, 2025 at 11:23 PM Manni Wood <manni.wood@enterprisedb.com> wrote:
Hello!

I wanted reproduce the results using files attached by Shinya Kato and Ayoub Kazar. I installed a postgres compiled from master, and then I installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches applied.

The master+v3patches postgres naturally performed better on copying into the database: anywhere from 11% better for the t.csv file produced by Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file created by Ayoub Kazar's simd-copy-from-bench.sql.

But here's where it gets weird. The two files created by Ayoub Kazar's simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt, and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5% respectively.

This seems impossible.

A few things I should note:

I timed the commands using the Unix time command, like so:

time psql -X -U mwood -h localhost -d postgres -c '\copy t from /tmp/t_4096_escape.txt'

For each file, I timed the copy 6 times and took the average.

This was done on my work Linux machine while also running Chrome and an Open Office spreadsheet; not a dedicated machine only running postgres.
Hello,
I think if you do a perf benchmark (if it still reproduces) it would probably be possible to explain why it's performing like that looking at the CPI and other metrics and compare it to my findings.
What i also suggest is to make the data close even closer to the worst case i.e: more special characters where it hurts the switching between SIMD and scalar processing (in simd-copy-from-bench.sql file), if still does a good job then there's something to look at.
 

All of the copy results took between 4.5 seconds (Shinyo's t.csv copied into postgres compiled from master) to 2 seconds (Ayoub Kazar's t_4096_none.csv copied into postgres compiled from master plus Nazir's v3 patches).

Perhaps I need to fiddle with the provided SQL to produce larger files to get longer run times? Maybe sub-second differences won't tell as interesting a story as minutes-long copy commands?
I did try it on some GBs (around 2-5GB only), the differences were not that much, but if you can run this on more GBs (at least 10GB) it would be good to look at, although i don't suspect anything interesting since the shape of data is the same for the totality of the COPY.

Thanks for reading this.
--
-- Manni Wood EDB: https://www.enterprisedb.com
Thanks for the info.


Regards,
Ayoub Kazar. 

Hello again!

It looks like using 10 times the data removed the apparent speedup in the simd code when the simd code has to deal with t_4096_escape.txt and t_4096_quote.csv. When both files contain 1,000,000 lines each, postgres master+v3patch imports 0.63% slower and 0.54% slower respectively. For 1,000,000 lines of t_4096_none.txt, the v3 patch yields a 30% speedup. For 1,000,000 lines of t_4096_none.csv, the v3 patch yields a 33% speedup.

I got these numbers just via simple timing, though this time I used psql's \timing feature. I left psql running rather than launching it each time as I did when I used the unix "time" command. I ran the copy command 5 times for each file and averaged the results. Again, this happened on a Linux machine that also happened to be running Chrome and Open Office's spreadsheet.

I should probably try to construct some .txt or .csv files that would trip up the simd on/off heuristic in the v3 patch.

If data "in the wild" tend to be roughly the same "shape" from row to row, as Andrew's experience has shown, I imagine these million row results bode well for the v3 patch...
--
-- Manni Wood EDB: https://www.enterprisedb.com