Re: Speed up COPY FROM text/CSV parsing using SIMD
| От | Nazir Bilal Yavuz |
|---|---|
| Тема | Re: Speed up COPY FROM text/CSV parsing using SIMD |
| Дата | |
| Msg-id | CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Speed up COPY FROM text/CSV parsing using SIMD (Andrew Dunstan <andrew@dunslane.net>) |
| Ответы |
Re: Speed up COPY FROM text/CSV parsing using SIMD
Re: Speed up COPY FROM text/CSV parsing using SIMD Re: Speed up COPY FROM text/CSV parsing using SIMD |
| Список | pgsql-hackers |
Hi, On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote: > > Hi, > > > > On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > >> I am able to reproduce the regression you mentioned but both > >> regressions are %20 on my end. I found that (by experimenting) SIMD > >> causes a regression if it advances less than 5 characters. > >> > >> So, I implemented a small heuristic. It works like that: > >> > >> - If advance < 5 -> insert a sleep penalty (n cycles). > > 'sleep' might be a poor word choice here. I meant skipping SIMD for n > > number of times. > > > > I was thinking a bit about that this morning. I wonder if it might be > better instead of having a constantly applied heuristic like this, it > might be better to do a little extra accounting in the first, say, 1000 > lines of an input file, and if less than some portion of the input is > found to be special characters then switch to the SIMD code. What that > portion should be would need to be determined by some experimentation > with a variety of typical workloads, but given your findings 20% seems > like a good starting point. I implemented a heuristic something similar to this. It is a mix of previous heuristic and your idea, it works like that: Overall logic is that we will not run SIMD for the entire line and we decide if it is worth it to run SIMD for the next lines. 1 - We will try SIMD and decide if it is worth it to run SIMD. 1.1 - If it is worth it, we will continue to run SIMD and we will halve the simd_last_sleep_cycle variable. 1.2 - If it is not worth it, we will double the simd_last_sleep_cycle and we will not run SIMD for these many lines. 1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1. Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for each 1024 lines at max. With this heuristic the regression is limited by %2 in the worst case. Patches are attached, the first patch is v2-0001 from Shinya with the '-Werror=maybe-uninitialized' fixes and the pgindent changes. 0002 is the actual heuristic patch. -- Regards, Nazir Bilal Yavuz Microsoft
Вложения
В списке pgsql-hackers по дате отправления: