Обсуждение: Speed up COPY TO text/CSV parsing using SIMD
Hello,
Following Nazir's recommendation to move this to a different thread so it can be looked at separately.
On Thu, Jan 8, 2026 at 2:49 PM Manni Wood <manni.wood@enterprisedb.com> wrote:
On Thu, Jan 8, 2026 at 2:49 PM Manni Wood <manni.wood@enterprisedb.com> wrote:
On Wed, 24 Dec 2025 at 18:08, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
>
> Hello,
> Following the same path of optimizing COPY FROM using SIMD, i found that COPY TO can also benefit from this.
>
> I attached a small patch that uses SIMD to skip data and advance as far as the first special character is found, then fallback to scalar processing for that character and re-enter the SIMD path again...
> There's two ways to do this:
> 1) Essentially we do SIMD until we find a special character, then continue scalar path without re-entering SIMD again.
> - This gives from 10% to 30% speedups depending on the weight of special characters in the attribute, we don't lose anything here since it advances with SIMD until it can't (using the previous scripts: 1/3, 2/3 specials chars).
>
> 2) Do SIMD path, then use scalar path when we hit a special character, keep re-entering the SIMD path each time.
> - This is equivalent to the COPY FROM story, we'll need to find the same heuristic to use for both COPY FROM/TO to reduce the regressions (same regressions: around from 20% to 30% with 1/3, 2/3 specials chars).
>
> Something else to note is that the scalar path for COPY TO isn't as heavy as the state machine in COPY FROM.
>
> So if we find the sweet spot for the heuristic, doing the same for COPY TO will be trivial and always beneficial.
> Attached is 0004 which is option 1 (SIMD without re-entering), 0005 is the second one.Ayoub Kazar, I tested your v4 "copy to" patch, doing everything in RAM, and using the cpupower tips from above. (I wanted to test your v5, but `git apply --check` gave me an error, so I can look at that another day.)The results look great:master: (forgot to get commit hash)
text, no special: 8165
text, 1/3 special: 22662
csv, no special: 9619
csv, 1/3 special: 23213
v4 (copy to)
text, no special: 4577 (43.9% speedup)
text, 1/3 special: 22847 (0.8% regression)
csv, no special: 4720 (50.9% speedup)
csv, 1/3 special: 23195 (0.07% regression)Seems like a very clear win to me!-- Manni Wood EDB: https://www.enterprisedb.com
Currently optimizing COPY FROM using SIMD is still under review, but for the case of COPY TO using the same ideas, we found that the problem is trivial, the attached patch gives very nice speedups as confirmed by Manni's benchmarks.
Regards,
Ayoub
Вложения
Hi, On 2026-02-12 22:07:52 +0100, KAZAR Ayoub wrote: > Currently optimizing COPY FROM using SIMD is still under review, but for > the case of COPY TO using the same ideas, we found that the problem is > trivial, the attached patch gives very nice speedups as confirmed by > Manni's benchmarks. I have a hard time believing that adding a strlen() to the handling of a short column won't be a measurable overhead with lots of short attributes. Particularly because the patch afaict will call it repeatedly if there are any to-be-escaped characters. I also don't think it's good how much code this repeats. I think you'd have to start with preparatory moving the exiting code into static inline helper functions and then introduce SIMD into those. Greetings, Andres Freund
Hi,
On Thu, Feb 12, 2026 at 10:25 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2026-02-12 22:07:52 +0100, KAZAR Ayoub wrote:
> Currently optimizing COPY FROM using SIMD is still under review, but for
> the case of COPY TO using the same ideas, we found that the problem is
> trivial, the attached patch gives very nice speedups as confirmed by
> Manni's benchmarks.
I have a hard time believing that adding a strlen() to the handling of a short
column won't be a measurable overhead with lots of short attributes.
Particularly because the patch afaict will call it repeatedly if there are any
to-be-escaped characters.
Thanks for pointing that out, so here's what i did:
1) In the previous patch, strlen was called twice if a CSV attribute needed to add a quote, the attached patch gets the length in the beginning and uses it for both SIMD paths, so basically one call.
2) If an attribute needs encoding we need to recalculate string length because it can grow. (so 2 calls at maximum in all cases)
3) Supposing the very worse cases, i benchmarked this against master for tables that have 100, 500, 1000 columns : all integers only, so one would want to process the whole thing in just a pass rather than calculating length of such short attributes:
1000 columns:
TEXT: 17% regression
CSV: 3.4% regression
500 columns:
TEXT: 17.7% regression
CSV: 3.1% regression
100 columns:
TEXT: 17.3% regression
TEXT: 17.3% regression
CSV: 3% regression
A bit unstable results, but yeah the overhead for worse cases like this is really significant, I can't argue whether this is worth it or not, so thoughts on this ?
I also don't think it's good how much code this repeats. I think you'd have to
start with preparatory moving the exiting code into static inline helper
functions and then introduce SIMD into those.
Done, yet i'm not too sure whether this is the right place to put it, let me know.
Regards,
Ayoub