Обсуждение: Speed up COPY FROM text/CSV parsing using SIMD
Hi hackers,
I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
text}) command and observed approximately a 5% performance
improvement. Please see the detailed test results below.
Idea
====
The current text/CSV parser processes input byte-by-byte, checking
whether each byte is a special character (\n, \r, quote, escape) or a
regular character, and transitions states in a state machine. This
sequential processing is inefficient and likely causes frequent branch
mispredictions due to the many if statements.
I thought this problem could be addressed by leveraging SIMD and
vectorized operations for faster processing.
Implementation Overview
=======================
1. Create a vector of special characters (e.g., Vector8 nl =
vector8_broadcast('\n');).
2. Load the input buffer into a Vector8 variable called chunk.
3. Perform vectorized operations between chunk and the special
character vectors to check if the buffer contains any special
characters.
4-1. If no special characters are found, advance the input_buf_ptr by
sizeof(Vector8).
4-2. If special characters are found, advance the input_buf_ptr as far
as possible, then fall back to the original text/CSV parser for
byte-by-byte processing.
Test
====
I tested the performance by measuring the time it takes to load a CSV
file created using the attached SQL script with the following COPY
command:
=# COPY t FROM '/tmp/t.csv' (FORMAT csv);
Environment
-----------
OS: Rocky Linux 9.6
CPU: Intel Core i7-10710U (6 Cores / 12 Threads, 1.1 GHz Base / 4.7
GHz Boost, AVX2 & FMA supported)
Time
----
master: 02.44.943
patch applied: 02:36.878 (about 5% faster)
Perf
----
Each call graphs are attached and the rates of CopyReadLineText are:
master: 12.15%
patch applied: 8.04%
Thought?
I would appreciate feedback on the implementation and any suggestions
for further improvement.
--
Best regards,
Shinya Kato
NTT OSS Center
Вложения
Hi,
Thank you for working on this!
On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
>
> Hi hackers,
>
> I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
> text}) command and observed approximately a 5% performance
> improvement. Please see the detailed test results below.
I have been working on the same idea. I was not moving input_buf_ptr
as far as possible, so I think your approach is better.
Also, I did a benchmark on text format. I created a benchmark for line
length in a table being from 1 byte to 1 megabyte.The peak improvement
is line length being 4096 and the improvement is more than 20% [1], I
saw no regression on your patch.
> Idea
> ====
> The current text/CSV parser processes input byte-by-byte, checking
> whether each byte is a special character (\n, \r, quote, escape) or a
> regular character, and transitions states in a state machine. This
> sequential processing is inefficient and likely causes frequent branch
> mispredictions due to the many if statements.
>
> I thought this problem could be addressed by leveraging SIMD and
> vectorized operations for faster processing.
>
> Implementation Overview
> =======================
> 1. Create a vector of special characters (e.g., Vector8 nl =
> vector8_broadcast('\n');).
> 2. Load the input buffer into a Vector8 variable called chunk.
> 3. Perform vectorized operations between chunk and the special
> character vectors to check if the buffer contains any special
> characters.
> 4-1. If no special characters are found, advance the input_buf_ptr by
> sizeof(Vector8).
> 4-2. If special characters are found, advance the input_buf_ptr as far
> as possible, then fall back to the original text/CSV parser for
> byte-by-byte processing.
>
...
> Thought?
> I would appreciate feedback on the implementation and any suggestions
> for further improvement.
I have a couple of ideas that I was working on:
---
+ * However, SIMD optimization cannot be applied in the following cases:
+ * - Inside quoted fields, where escape sequences and closing quotes
+ * require sequential processing to handle correctly.
I think you can continue SIMD inside quoted fields. Only important
thing is you need to set last_was_esc to false when SIMD skipped the
chunk.
---
+ * - When the remaining buffer size is smaller than the size of a SIMD
+ * vector register, as SIMD operations require processing data in
+ * fixed-size chunks.
You run SIMD when 'copy_buf_len - input_buf_ptr >= sizeof(Vector8)'
but you only call CopyLoadInputBuf() when 'input_buf_ptr >=
copy_buf_len || need_data' so basically you need to wait at least the
sizeof(Vector8) character to pass for the next SIMD. And in the worst
case; if CopyLoadInputBuf() puts one character less than
sizeof(Vector8), then you can't ever run SIMD. I think we need to make
sure that CopyLoadInputBuf() loads at least the sizeof(Vector8)
character to the input_buf so we do not encounter that problem.
---
What do you think about adding SIMD to CopyReadAttributesText() and
CopyReadAttributesCSV() functions? When I add your SIMD approach to
CopyReadAttributesText() function, the improvement on the 4096 byte
line length input [1] goes from 20% to 30%.
---
I shared my ideas as a Feedback.txt file (.txt to stay off CFBot's
radar for this thread). I hope these help, please let me know if you
have any questions.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Вложения
Hi,
On Thu, 7 Aug 2025 at 14:15, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
> >
> > I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
> > text}) command and observed approximately a 5% performance
> > improvement. Please see the detailed test results below.
>
> Also, I did a benchmark on text format. I created a benchmark for line
> length in a table being from 1 byte to 1 megabyte.The peak improvement
> is line length being 4096 and the improvement is more than 20% [1], I
> saw no regression on your patch.
I did the same benchmark for the CSV format. The peak improvement is
line length being 4096 and the improvement is more than 25% [1]. I saw
a 5% regression on the 1 byte benchmark, there are no other
regressions.
> What do you think about adding SIMD to CopyReadAttributesText() and
> CopyReadAttributesCSV() functions? When I add your SIMD approach to
> CopyReadAttributesText() function, the improvement on the 4096 byte
> line length input [1] goes from 20% to 30%.
I wanted to try using SIMD in CopyReadAttributesCSV() as well. The
improvement on the 4096 byte line length input [1] goes from 25% to
35%, the regression on the 1 byte input is the same.
CopyReadAttributesCSV() changes are attached as feedback v2.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Вложения
On Thu, Aug 7, 2025 at 8:15 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> Thank you for working on this!
>
> On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
> >
> > Hi hackers,
> >
> > I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
> > text}) command and observed approximately a 5% performance
> > improvement. Please see the detailed test results below.
>
> I have been working on the same idea. I was not moving input_buf_ptr
> as far as possible, so I think your approach is better.
Great. I'm looking forward to working with you on this feature implementation.
> Also, I did a benchmark on text format. I created a benchmark for line
> length in a table being from 1 byte to 1 megabyte.The peak improvement
> is line length being 4096 and the improvement is more than 20% [1], I
> saw no regression on your patch.
Thank you for the additional benchmarks.
> I have a couple of ideas that I was working on:
> ---
>
> + * However, SIMD optimization cannot be applied in the following cases:
> + * - Inside quoted fields, where escape sequences and closing quotes
> + * require sequential processing to handle correctly.
>
> I think you can continue SIMD inside quoted fields. Only important
> thing is you need to set last_was_esc to false when SIMD skipped the
> chunk.
That's a clever point that last_was_esc should be reset to false when
a SIMD chunk is skipped. You're right about that specific case.
However, the core challenge is not what happens when we skip a chunk,
but what happens when a chunk contains special characters like quotes
or escapes. The main reason we avoid SIMD inside quoted fields is that
the parsing logic becomes fundamentally sequential and
context-dependent.
To correctly parse a "" as a single literal quote, we must perform a
lookahead to check the next character. This is an inherently
sequential operation that doesn't map well to SIMD's parallel nature.
Trying to handle this stateful logic with SIMD would lead to
significant implementation complexity, especially with edge cases like
an escape character falling on the last byte of a chunk.
> + * - When the remaining buffer size is smaller than the size of a SIMD
> + * vector register, as SIMD operations require processing data in
> + * fixed-size chunks.
>
> You run SIMD when 'copy_buf_len - input_buf_ptr >= sizeof(Vector8)'
> but you only call CopyLoadInputBuf() when 'input_buf_ptr >=
> copy_buf_len || need_data' so basically you need to wait at least the
> sizeof(Vector8) character to pass for the next SIMD. And in the worst
> case; if CopyLoadInputBuf() puts one character less than
> sizeof(Vector8), then you can't ever run SIMD. I think we need to make
> sure that CopyLoadInputBuf() loads at least the sizeof(Vector8)
> character to the input_buf so we do not encounter that problem.
I think you're probably right, but we only need to account for
sizeof(Vector8) when USE_NO_SIMD is not defined.
> What do you think about adding SIMD to CopyReadAttributesText() and
> CopyReadAttributesCSV() functions? When I add your SIMD approach to
> CopyReadAttributesText() function, the improvement on the 4096 byte
> line length input [1] goes from 20% to 30%.
Agreed, I will.
> I shared my ideas as a Feedback.txt file (.txt to stay off CFBot's
> radar for this thread). I hope these help, please let me know if you
> have any questions.
Thanks a lot!
On Mon, Aug 11, 2025 at 5:52 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Thu, 7 Aug 2025 at 14:15, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
> > >
> > > I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
> > > text}) command and observed approximately a 5% performance
> > > improvement. Please see the detailed test results below.
> >
> > Also, I did a benchmark on text format. I created a benchmark for line
> > length in a table being from 1 byte to 1 megabyte.The peak improvement
> > is line length being 4096 and the improvement is more than 20% [1], I
> > saw no regression on your patch.
>
> I did the same benchmark for the CSV format. The peak improvement is
> line length being 4096 and the improvement is more than 25% [1]. I saw
> a 5% regression on the 1 byte benchmark, there are no other
> regressions.
Thank you. I'm not too concerned about a regression when there's only
one byte per line.
> > What do you think about adding SIMD to CopyReadAttributesText() and
> > CopyReadAttributesCSV() functions? When I add your SIMD approach to
> > CopyReadAttributesText() function, the improvement on the 4096 byte
> > line length input [1] goes from 20% to 30%.
>
> I wanted to try using SIMD in CopyReadAttributesCSV() as well. The
> improvement on the 4096 byte line length input [1] goes from 25% to
> 35%, the regression on the 1 byte input is the same.
Yes, I'm on it. I'm currently adding the SIMD logic to
CopyReadAttributesCSV() as you suggested. I'll share the new version
of the patch soon.
--
Best regards,
Shinya Kato
NTT OSS Center
On Tue, Aug 12, 2025 at 4:25 PM Shinya Kato <shinya11.kato@gmail.com> wrote: > > + * However, SIMD optimization cannot be applied in the following cases: > > + * - Inside quoted fields, where escape sequences and closing quotes > > + * require sequential processing to handle correctly. > > > > I think you can continue SIMD inside quoted fields. Only important > > thing is you need to set last_was_esc to false when SIMD skipped the > > chunk. > > That's a clever point that last_was_esc should be reset to false when > a SIMD chunk is skipped. You're right about that specific case. > > However, the core challenge is not what happens when we skip a chunk, > but what happens when a chunk contains special characters like quotes > or escapes. The main reason we avoid SIMD inside quoted fields is that > the parsing logic becomes fundamentally sequential and > context-dependent. > > To correctly parse a "" as a single literal quote, we must perform a > lookahead to check the next character. This is an inherently > sequential operation that doesn't map well to SIMD's parallel nature. > > Trying to handle this stateful logic with SIMD would lead to > significant implementation complexity, especially with edge cases like > an escape character falling on the last byte of a chunk. Ah, you're right. My apologies, I misunderstood the implementation. It appears that SIMD can be used even within quoted strings. I think it would be better not to use the SIMD path when last_was_esc is true. The next character is likely to be a special character, and handling this case outside the SIMD loop would also improve readability by consolidating the last_was_esc toggle logic in one place. Furthermore, when inside a quote (in_quote) in CSV mode, the detection of \n and \r can be disabled. + last_was_esc = false; Regarding the implementation, I believe we must set last_was_esc to false when advancing input_buf_ptr, as shown in the code below. For this reason, I think it’s best to keep the current logic for toggling last_was_esc. + int advance = pg_rightmost_one_pos32(mask); + input_buf_ptr += advance; I've attached a new patch that includes these changes. Further modifications are still in progress. -- Best regards, Shinya Kato NTT OSS Center
Вложения
Results for a 1/3 of line length of special characters:
~43.9% slower copy in TEXT format
~16.7% slower copy in CSV format
On Tue, Aug 12, 2025 at 4:25 PM Shinya Kato <shinya11.kato@gmail.com> wrote:
> > + * However, SIMD optimization cannot be applied in the following cases:
> > + * - Inside quoted fields, where escape sequences and closing quotes
> > + * require sequential processing to handle correctly.
> >
> > I think you can continue SIMD inside quoted fields. Only important
> > thing is you need to set last_was_esc to false when SIMD skipped the
> > chunk.
>
> That's a clever point that last_was_esc should be reset to false when
> a SIMD chunk is skipped. You're right about that specific case.
>
> However, the core challenge is not what happens when we skip a chunk,
> but what happens when a chunk contains special characters like quotes
> or escapes. The main reason we avoid SIMD inside quoted fields is that
> the parsing logic becomes fundamentally sequential and
> context-dependent.
>
> To correctly parse a "" as a single literal quote, we must perform a
> lookahead to check the next character. This is an inherently
> sequential operation that doesn't map well to SIMD's parallel nature.
>
> Trying to handle this stateful logic with SIMD would lead to
> significant implementation complexity, especially with edge cases like
> an escape character falling on the last byte of a chunk.
Ah, you're right. My apologies, I misunderstood the implementation. It
appears that SIMD can be used even within quoted strings.
I think it would be better not to use the SIMD path when last_was_esc
is true. The next character is likely to be a special character, and
handling this case outside the SIMD loop would also improve
readability by consolidating the last_was_esc toggle logic in one
place.
Furthermore, when inside a quote (in_quote) in CSV mode, the detection
of \n and \r can be disabled.
+ last_was_esc = false;
Regarding the implementation, I believe we must set last_was_esc to
false when advancing input_buf_ptr, as shown in the code below. For
this reason, I think it’s best to keep the current logic for toggling
last_was_esc.
+ int advance = pg_rightmost_one_pos32(mask);
+ input_buf_ptr += advance;
I've attached a new patch that includes these changes. Further
modifications are still in progress.
--
Best regards,
Shinya Kato
NTT OSS Center
Hi, On Thu, 14 Aug 2025 at 05:25, KAZAR Ayoub <ma_kazar@esi.dz> wrote: > > Following Nazir's findings about 4096 bytes being the performant line length, I did more benchmarks from my side on bothTEXT and CSV formats with two different cases of normal data (no special characters) and data with many special characters. > > Results are con good as expected and similar to previous benchmarks > ~30.9% faster copy in TEXT format > ~32.4% faster copy in CSV format > 20%-30% reduces cycles per instructions > > In the case of doing a lot of special characters in the lines (e.g., tables with large numbers of columns maybe), we obviouslyexpect regressions here because of the overhead of many fallbacks to scalar processing. > Results for a 1/3 of line length of special characters: > ~43.9% slower copy in TEXT format > ~16.7% slower copy in CSV format > So for even less occurrences of special characters or wider distance between there might still be some regressions in thiscase, a non-significant case maybe, but can be treated in other patches if we consider to not use SIMD path sometimes. > > I hope this helps more and confirms the patch. Thanks for running that benchmark! Would you mind sharing a reproducer for the regression you observed? -- Regards, Nazir Bilal Yavuz Microsoft
Hi,
On Thu, 14 Aug 2025 at 05:25, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
>
> Following Nazir's findings about 4096 bytes being the performant line length, I did more benchmarks from my side on both TEXT and CSV formats with two different cases of normal data (no special characters) and data with many special characters.
>
> Results are con good as expected and similar to previous benchmarks
> ~30.9% faster copy in TEXT format
> ~32.4% faster copy in CSV format
> 20%-30% reduces cycles per instructions
>
> In the case of doing a lot of special characters in the lines (e.g., tables with large numbers of columns maybe), we obviously expect regressions here because of the overhead of many fallbacks to scalar processing.
> Results for a 1/3 of line length of special characters:
> ~43.9% slower copy in TEXT format
> ~16.7% slower copy in CSV format
> So for even less occurrences of special characters or wider distance between there might still be some regressions in this case, a non-significant case maybe, but can be treated in other patches if we consider to not use SIMD path sometimes.
>
> I hope this helps more and confirms the patch.
Thanks for running that benchmark! Would you mind sharing a reproducer
for the regression you observed?
--
Regards,
Nazir Bilal Yavuz
Microsoft
Вложения
On Thu, 7 Aug 2025 at 14:15, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > I have a couple of ideas that I was working on: > --- > > + * However, SIMD optimization cannot be applied in the following cases: > + * - Inside quoted fields, where escape sequences and closing quotes > + * require sequential processing to handle correctly. > > I think you can continue SIMD inside quoted fields. Only important > thing is you need to set last_was_esc to false when SIMD skipped the > chunk. There is a trick with doing carryless multiplication with -1 that can be used to SIMD process transitions between quoted/not-quoted. [1] This is able to convert a bitmask of unescaped quote character positions to a quote mask in a single operation. I last looked at it 5 years ago, but I remember coming to the conclusion that it would work for implementing PostgreSQL's interpretation of CSV. [1] https://github.com/geofflangdale/simdcsv/blob/master/src/main.cpp#L76 -- Ants
Hi, On Thu, 14 Aug 2025 at 18:00, KAZAR Ayoub <ma_kazar@esi.dz> wrote: >> Thanks for running that benchmark! Would you mind sharing a reproducer >> for the regression you observed? > > Of course, I attached the sql to generate the text and csv test files. > If having a 1/3 of line length of special characters can be an exaggeration, something lower might still reproduce someregressions of course for the same idea. Thank you so much! I am able to reproduce the regression you mentioned but both regressions are %20 on my end. I found that (by experimenting) SIMD causes a regression if it advances less than 5 characters. So, I implemented a small heuristic. It works like that: - If advance < 5 -> insert a sleep penalty (n cycles). - Each time advance < 5, n is doubled. - Each time advance ≥ 5, n is halved. I am sharing a POC patch to show heuristic, it can be applied on top of v1-0001. Heuristic version has the same performance improvements with the v1-0001 but the regression is %5 instead of %20 compared to the master. -- Regards, Nazir Bilal Yavuz Microsoft
Вложения
Hi, On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > I am able to reproduce the regression you mentioned but both > regressions are %20 on my end. I found that (by experimenting) SIMD > causes a regression if it advances less than 5 characters. > > So, I implemented a small heuristic. It works like that: > > - If advance < 5 -> insert a sleep penalty (n cycles). 'sleep' might be a poor word choice here. I meant skipping SIMD for n number of times. -- Regards, Nazir Bilal Yavuz Microsoft
On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote: > Hi, > > On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: >> I am able to reproduce the regression you mentioned but both >> regressions are %20 on my end. I found that (by experimenting) SIMD >> causes a regression if it advances less than 5 characters. >> >> So, I implemented a small heuristic. It works like that: >> >> - If advance < 5 -> insert a sleep penalty (n cycles). > 'sleep' might be a poor word choice here. I meant skipping SIMD for n > number of times. > I was thinking a bit about that this morning. I wonder if it might be better instead of having a constantly applied heuristic like this, it might be better to do a little extra accounting in the first, say, 1000 lines of an input file, and if less than some portion of the input is found to be special characters then switch to the SIMD code. What that portion should be would need to be determined by some experimentation with a variety of typical workloads, but given your findings 20% seems like a good starting point. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Thu, 14 Aug 2025 at 18:00, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
>> Thanks for running that benchmark! Would you mind sharing a reproducer
>> for the regression you observed?
>
> Of course, I attached the sql to generate the text and csv test files.
> If having a 1/3 of line length of special characters can be an exaggeration, something lower might still reproduce some regressions of course for the same idea.
Thank you so much!
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.
So, I implemented a small heuristic. It works like that:
- If advance < 5 -> insert a sleep penalty (n cycles).
- Each time advance < 5, n is doubled.
- Each time advance ≥ 5, n is halved.
I am sharing a POC patch to show heuristic, it can be applied on top
of v1-0001. Heuristic version has the same performance improvements
with the v1-0001 but the regression is %5 instead of %20 compared to
the master.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi, On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote: > > Hi, > > > > On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > >> I am able to reproduce the regression you mentioned but both > >> regressions are %20 on my end. I found that (by experimenting) SIMD > >> causes a regression if it advances less than 5 characters. > >> > >> So, I implemented a small heuristic. It works like that: > >> > >> - If advance < 5 -> insert a sleep penalty (n cycles). > > 'sleep' might be a poor word choice here. I meant skipping SIMD for n > > number of times. > > > > I was thinking a bit about that this morning. I wonder if it might be > better instead of having a constantly applied heuristic like this, it > might be better to do a little extra accounting in the first, say, 1000 > lines of an input file, and if less than some portion of the input is > found to be special characters then switch to the SIMD code. What that > portion should be would need to be determined by some experimentation > with a variety of typical workloads, but given your findings 20% seems > like a good starting point. I implemented a heuristic something similar to this. It is a mix of previous heuristic and your idea, it works like that: Overall logic is that we will not run SIMD for the entire line and we decide if it is worth it to run SIMD for the next lines. 1 - We will try SIMD and decide if it is worth it to run SIMD. 1.1 - If it is worth it, we will continue to run SIMD and we will halve the simd_last_sleep_cycle variable. 1.2 - If it is not worth it, we will double the simd_last_sleep_cycle and we will not run SIMD for these many lines. 1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1. Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for each 1024 lines at max. With this heuristic the regression is limited by %2 in the worst case. Patches are attached, the first patch is v2-0001 from Shinya with the '-Werror=maybe-uninitialized' fixes and the pgindent changes. 0002 is the actual heuristic patch. -- Regards, Nazir Bilal Yavuz Microsoft
Вложения
I’ve rebenchmarked the new heuristic patch, We still have the previous improvements ranging from 15% to 30%. For regressions i see at maximum 3% or 4% in the worst case, so this is solid.
I'm also trying the idea of doing SIMD inside quotes with prefix XOR using carry less multiplication avoiding the slow path in all cases even with weird looking input, but it needs to take into consideration the availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we go, it quickly starts to become dirty OR we can wait for the decision to start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.
Regards,
Ayoub Kazar
Hi, On Thu, 16 Oct 2025 at 17:29, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > Overall logic is that we will not run SIMD for the entire line and we > decide if it is worth it to run SIMD for the next lines. I had a typo there, correct sentence is that: "Overall logic is that we *will* run SIMD for the entire line and we decide if it is worth it to run SIMD for the next lines." -- Regards, Nazir Bilal Yavuz Microsoft
Hi, On Sat, 18 Oct 2025 at 21:46, KAZAR Ayoub <ma_kazar@esi.dz> wrote: > > Hello, > > I’ve rebenchmarked the new heuristic patch, We still have the previous improvements ranging from 15% to 30%. For regressionsi see at maximum 3% or 4% in the worst case, so this is solid. Thank you so much for doing this! The results look nice, do you think there are any other benchmarks that might be interesting to try? > I'm also trying the idea of doing SIMD inside quotes with prefix XOR using carry less multiplication avoiding the slowpath in all cases even with weird looking input, but it needs to take into consideration the availability of PCLMULQDQinstruction set with <wmmintrin.h> and here we go, it quickly starts to become dirty OR we can wait for the decisionto start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2. I can not quite picture this, would you mind sharing a few examples or patches? -- Regards, Nazir Bilal Yavuz Microsoft
Hi, On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <andrew@dunslane.net> wrote:On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:Hi, On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:I am able to reproduce the regression you mentioned but both regressions are %20 on my end. I found that (by experimenting) SIMD causes a regression if it advances less than 5 characters. So, I implemented a small heuristic. It works like that: - If advance < 5 -> insert a sleep penalty (n cycles).'sleep' might be a poor word choice here. I meant skipping SIMD for n number of times.I was thinking a bit about that this morning. I wonder if it might be better instead of having a constantly applied heuristic like this, it might be better to do a little extra accounting in the first, say, 1000 lines of an input file, and if less than some portion of the input is found to be special characters then switch to the SIMD code. What that portion should be would need to be determined by some experimentation with a variety of typical workloads, but given your findings 20% seems like a good starting point.I implemented a heuristic something similar to this. It is a mix of previous heuristic and your idea, it works like that: Overall logic is that we will not run SIMD for the entire line and we decide if it is worth it to run SIMD for the next lines. 1 - We will try SIMD and decide if it is worth it to run SIMD. 1.1 - If it is worth it, we will continue to run SIMD and we will halve the simd_last_sleep_cycle variable. 1.2 - If it is not worth it, we will double the simd_last_sleep_cycle and we will not run SIMD for these many lines. 1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1. Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for each 1024 lines at max. With this heuristic the regression is limited by %2 in the worst case.
My worry is that the worst case is actually quite common. Sparse data sets dominated by a lot of null values (and hence lots of special characters) are very common. Are people prepared to accept a 2% regression on load times for such data sets?
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote: > On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote: >> With this heuristic the regression is limited by %2 in the worst case. > > My worry is that the worst case is actually quite common. Sparse data sets > dominated by a lot of null values (and hence lots of special characters) are > very common. Are people prepared to accept a 2% regression on load times for > such data sets? Without knowing how common it is, I think it's difficult to judge whether 2% is a reasonable trade-off. If <5% of workloads might see a small regression while the other >95% see double-digit percentage improvements, then I might argue that it's fine. But I'm not sure we have any way to know those sorts of details at the moment. I'm also at least a little skeptical about the 2% number. IME that's generally within the noise range and can vary greatly between machines and test runs. -- nathan
On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:With this heuristic the regression is limited by %2 in the worst case.My worry is that the worst case is actually quite common. Sparse data sets dominated by a lot of null values (and hence lots of special characters) are very common. Are people prepared to accept a 2% regression on load times for such data sets?Without knowing how common it is, I think it's difficult to judge whether 2% is a reasonable trade-off. If <5% of workloads might see a small regression while the other >95% see double-digit percentage improvements, then I might argue that it's fine. But I'm not sure we have any way to know those sorts of details at the moment.
I guess what I don't understand is why we actually need to do the test continuously, even using an adaptive algorithm. Data files in my experience usually have lines with fairly similar shapes. It's highly unlikely that you will get the the first 1000 (say) lines of a file that are rich in special characters and then some later significant section that isn't, or vice versa. Therefore, doing the test once should yield the correct answer that can be applied to the rest of the file. That should reduce the worst case regression to ~0% without sacrificing any of the performance gains. I appreciate the elegance of what Bilal has done here, but it does seem like overkill.
I'm also at least a little skeptical about the 2% number. IME that's generally within the noise range and can vary greatly between machines and test runs.
Fair point.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Hi, On Mon, 20 Oct 2025 at 23:32, Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2025-10-20 Mo 1:04 PM, Nathan Bossart wrote: > > On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote: > > On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote: > > With this heuristic the regression is limited by %2 in the worst case. > > My worry is that the worst case is actually quite common. Sparse data sets > dominated by a lot of null values (and hence lots of special characters) are > very common. Are people prepared to accept a 2% regression on load times for > such data sets? > > Without knowing how common it is, I think it's difficult to judge whether > 2% is a reasonable trade-off. If <5% of workloads might see a small > regression while the other >95% see double-digit percentage improvements, > then I might argue that it's fine. But I'm not sure we have any way to > know those sorts of details at the moment. > > > I guess what I don't understand is why we actually need to do the test continuously, even using an adaptive algorithm.Data files in my experience usually have lines with fairly similar shapes. It's highly unlikely that you will getthe the first 1000 (say) lines of a file that are rich in special characters and then some later significant section thatisn't, or vice versa. Therefore, doing the test once should yield the correct answer that can be applied to the restof the file. That should reduce the worst case regression to ~0% without sacrificing any of the performance gains. Iappreciate the elegance of what Bilal has done here, but it does seem like overkill. I think the problem is deciding how many lines to process before deciding for the rest. 1000 lines could work for the small sized data but it might not work for the big sized data. Also, it might cause a worse regressions for the small sized data. Because of this reason, I tried to implement a heuristic that will work regardless of the size of the data. The last heuristic I suggested will run SIMD for approximately (#number_of_lines / 1024 [1024 is the max number of lines to sleep before running SIMD again]) lines if all characters in the data are special characters. -- Regards, Nazir Bilal Yavuz Microsoft
Thank you so much for doing this! The results look nice, do you think
there are any other benchmarks that might be interesting to try?
> I'm also trying the idea of doing SIMD inside quotes with prefix XOR using carry less multiplication avoiding the slow path in all cases even with weird looking input, but it needs to take into consideration the availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we go, it quickly starts to become dirty OR we can wait for the decision to start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.
I can not quite picture this, would you mind sharing a few examples or patches?
[1] https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/
[2] https://github.com/AyoubKaz07/postgres/commit/73c6ecfedae4cce5c3f375fd6074b1ca9dfe1daf
Currently we are at 200-400Mbps which isn't that terrible compared to production and non production grade parsers (of course we don't only parse in our case), also we are using SSE2 only so theoretically if we add support for avx later on we'll have even better numbers.Maybe more micro optimizations to the current heuristic can squeeze it more.
[1] https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/
[2] https://github.com/AyoubKaz07/postgres/commit/73c6ecfedae4cce5c3f375fd6074b1ca9dfe1dafRegards,Ayoub Kazar.
On Tue, Oct 21, 2025 at 12:09:27AM +0300, Nazir Bilal Yavuz wrote: > I think the problem is deciding how many lines to process before > deciding for the rest. 1000 lines could work for the small sized data > but it might not work for the big sized data. Also, it might cause a > worse regressions for the small sized data. IMHO we have some leeway with smaller amounts of data. If COPY FROM for 1000 rows takes 19 milliseconds as opposed to 11 milliseconds, it seems unlikely users would be inconvenienced all that much. (Those numbers are completely made up in order to illustrate my point.) > Because of this reason, I > tried to implement a heuristic that will work regardless of the size > of the data. The last heuristic I suggested will run SIMD for > approximately (#number_of_lines / 1024 [1024 is the max number of > lines to sleep before running SIMD again]) lines if all characters in > the data are special characters. I wonder if we could mitigate the regression further by spacing out the checks a bit more. It could be worth comparing a variety of values to identify what works best with the test data. -- nathan
On Tue, Oct 21, 2025 at 08:17:01AM +0200, KAZAR Ayoub wrote: >>> I'm also trying the idea of doing SIMD inside quotes with prefix XOR >>> using carry less multiplication avoiding the slow path in all cases even >>> with weird looking input, but it needs to take into consideration the >>> availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we >>> go, it quickly starts to become dirty OR we can wait for the decision to >>> start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2. > > [...] > > Currently we are at 200-400Mbps which isn't that terrible compared to > production and non production grade parsers (of course we don't only parse > in our case), also we are using SSE2 only so theoretically if we add > support for avx later on we'll have even better numbers. > Maybe more micro optimizations to the current heuristic can squeeze it more. I'd greatly prefer that we stick with SSE2/Neon (i.e., simd.h) unless the gains are extraordinary. Beyond the inherent complexity of using architecture-specific intrinsics, you also have to deal with configure-time checks, runtime checks, and function pointer overhead juggling. That tends to be a lot of work for the amount of gain. -- nathan
Hi, On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Oct 21, 2025 at 12:09:27AM +0300, Nazir Bilal Yavuz wrote: > > I think the problem is deciding how many lines to process before > > deciding for the rest. 1000 lines could work for the small sized data > > but it might not work for the big sized data. Also, it might cause a > > worse regressions for the small sized data. > > IMHO we have some leeway with smaller amounts of data. If COPY FROM for > 1000 rows takes 19 milliseconds as opposed to 11 milliseconds, it seems > unlikely users would be inconvenienced all that much. (Those numbers are > completely made up in order to illustrate my point.) > > > Because of this reason, I > > tried to implement a heuristic that will work regardless of the size > > of the data. The last heuristic I suggested will run SIMD for > > approximately (#number_of_lines / 1024 [1024 is the max number of > > lines to sleep before running SIMD again]) lines if all characters in > > the data are special characters. > > I wonder if we could mitigate the regression further by spacing out the > checks a bit more. It could be worth comparing a variety of values to > identify what works best with the test data. Do you mean that instead of doubling the SIMD sleep, we should multiply it by 3 (or another factor)? Or are you referring to increasing the maximum sleep from 1024? Or possibly both? -- Regards, Nazir Bilal Yavuz Microsoft
On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote: > On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com> wrote: >> I wonder if we could mitigate the regression further by spacing out the >> checks a bit more. It could be worth comparing a variety of values to >> identify what works best with the test data. > > Do you mean that instead of doubling the SIMD sleep, we should > multiply it by 3 (or another factor)? Or are you referring to > increasing the maximum sleep from 1024? Or possibly both? I'm not sure of the precise details, but the main thrust of my suggestion is to assume that whatever sampling you do to determine whether to use SIMD is good for a larger chunk of data. That is, if you are sampling 1K lines and then using the result to choose whether to use SIMD for the next 100K lines, we could instead bump the latter number to 1M lines (or something). That way we minimize the regression for relatively uniform data sets while retaining some ability to adapt in case things change halfway through a large table. -- nathan
On 2025-10-22 We 3:24 PM, Nathan Bossart wrote: > On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote: >> On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com> wrote: >>> I wonder if we could mitigate the regression further by spacing out the >>> checks a bit more. It could be worth comparing a variety of values to >>> identify what works best with the test data. >> Do you mean that instead of doubling the SIMD sleep, we should >> multiply it by 3 (or another factor)? Or are you referring to >> increasing the maximum sleep from 1024? Or possibly both? > I'm not sure of the precise details, but the main thrust of my suggestion > is to assume that whatever sampling you do to determine whether to use SIMD > is good for a larger chunk of data. That is, if you are sampling 1K lines > and then using the result to choose whether to use SIMD for the next 100K > lines, we could instead bump the latter number to 1M lines (or something). > That way we minimize the regression for relatively uniform data sets while > retaining some ability to adapt in case things change halfway through a > large table. > I'd be ok with numbers like this, although I suspect the numbers of cases where we see shape shifts like this in the middle of a data set would be vanishingly small. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2025-10-22 We 3:24 PM, Nathan Bossart wrote:
> On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote:
>> On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com> wrote:
>>> I wonder if we could mitigate the regression further by spacing out the
>>> checks a bit more. It could be worth comparing a variety of values to
>>> identify what works best with the test data.
>> Do you mean that instead of doubling the SIMD sleep, we should
>> multiply it by 3 (or another factor)? Or are you referring to
>> increasing the maximum sleep from 1024? Or possibly both?
> I'm not sure of the precise details, but the main thrust of my suggestion
> is to assume that whatever sampling you do to determine whether to use SIMD
> is good for a larger chunk of data. That is, if you are sampling 1K lines
> and then using the result to choose whether to use SIMD for the next 100K
> lines, we could instead bump the latter number to 1M lines (or something).
> That way we minimize the regression for relatively uniform data sets while
> retaining some ability to adapt in case things change halfway through a
> large table.
>
I'd be ok with numbers like this, although I suspect the numbers of
cases where we see shape shifts like this in the middle of a data set
would be vanishingly small.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Hello!I wanted reproduce the results using files attached by Shinya Kato and Ayoub Kazar. I installed a postgres compiled from master, and then I installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches applied.The master+v3patches postgres naturally performed better on copying into the database: anywhere from 11% better for the t.csv file produced by Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file created by Ayoub Kazar's simd-copy-from-bench.sql.But here's where it gets weird. The two files created by Ayoub Kazar's simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt, and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5% respectively.This seems impossible.A few things I should note:I timed the commands using the Unix time command, like so:time psql -X -U mwood -h localhost -d postgres -c '\copy t from /tmp/t_4096_escape.txt'For each file, I timed the copy 6 times and took the average.This was done on my work Linux machine while also running Chrome and an Open Office spreadsheet; not a dedicated machine only running postgres.
All of the copy results took between 4.5 seconds (Shinyo's t.csv copied into postgres compiled from master) to 2 seconds (Ayoub Kazar's t_4096_none.csv copied into postgres compiled from master plus Nazir's v3 patches).Perhaps I need to fiddle with the provided SQL to produce larger files to get longer run times? Maybe sub-second differences won't tell as interesting a story as minutes-long copy commands?
On Tue, Nov 11, 2025 at 11:23 PM Manni Wood <manni.wood@enterprisedb.com> wrote:Hello!I wanted reproduce the results using files attached by Shinya Kato and Ayoub Kazar. I installed a postgres compiled from master, and then I installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches applied.The master+v3patches postgres naturally performed better on copying into the database: anywhere from 11% better for the t.csv file produced by Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file created by Ayoub Kazar's simd-copy-from-bench.sql.But here's where it gets weird. The two files created by Ayoub Kazar's simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt, and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5% respectively.This seems impossible.A few things I should note:I timed the commands using the Unix time command, like so:time psql -X -U mwood -h localhost -d postgres -c '\copy t from /tmp/t_4096_escape.txt'For each file, I timed the copy 6 times and took the average.This was done on my work Linux machine while also running Chrome and an Open Office spreadsheet; not a dedicated machine only running postgres.Hello,I think if you do a perf benchmark (if it still reproduces) it would probably be possible to explain why it's performing like that looking at the CPI and other metrics and compare it to my findings.What i also suggest is to make the data close even closer to the worst case i.e: more special characters where it hurts the switching between SIMD and scalar processing (in simd-copy-from-bench.sql file), if still does a good job then there's something to look at.All of the copy results took between 4.5 seconds (Shinyo's t.csv copied into postgres compiled from master) to 2 seconds (Ayoub Kazar's t_4096_none.csv copied into postgres compiled from master plus Nazir's v3 patches).Perhaps I need to fiddle with the provided SQL to produce larger files to get longer run times? Maybe sub-second differences won't tell as interesting a story as minutes-long copy commands?I did try it on some GBs (around 2-5GB only), the differences were not that much, but if you can run this on more GBs (at least 10GB) it would be good to look at, although i don't suspect anything interesting since the shape of data is the same for the totality of the COPY.Thanks for the info.Regards,Ayoub Kazar.