Re: Speed up COPY FROM text/CSV parsing using SIMD
| От | Manni Wood |
|---|---|
| Тема | Re: Speed up COPY FROM text/CSV parsing using SIMD |
| Дата | |
| Msg-id | CAKWEB6pev=pNVi4qDYWS50N=YFrKRbjH1h=5F1bXpnK7WR5CYg@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Speed up COPY FROM text/CSV parsing using SIMD (KAZAR Ayoub <ma_kazar@esi.dz>) |
| Список | pgsql-hackers |
On Wed, Nov 12, 2025 at 8:44 AM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
On Tue, Nov 11, 2025 at 11:23 PM Manni Wood <manni.wood@enterprisedb.com> wrote:Hello!I wanted reproduce the results using files attached by Shinya Kato and Ayoub Kazar. I installed a postgres compiled from master, and then I installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches applied.The master+v3patches postgres naturally performed better on copying into the database: anywhere from 11% better for the t.csv file produced by Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file created by Ayoub Kazar's simd-copy-from-bench.sql.But here's where it gets weird. The two files created by Ayoub Kazar's simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt, and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5% respectively.This seems impossible.A few things I should note:I timed the commands using the Unix time command, like so:time psql -X -U mwood -h localhost -d postgres -c '\copy t from /tmp/t_4096_escape.txt'For each file, I timed the copy 6 times and took the average.This was done on my work Linux machine while also running Chrome and an Open Office spreadsheet; not a dedicated machine only running postgres.Hello,I think if you do a perf benchmark (if it still reproduces) it would probably be possible to explain why it's performing like that looking at the CPI and other metrics and compare it to my findings.What i also suggest is to make the data close even closer to the worst case i.e: more special characters where it hurts the switching between SIMD and scalar processing (in simd-copy-from-bench.sql file), if still does a good job then there's something to look at.All of the copy results took between 4.5 seconds (Shinyo's t.csv copied into postgres compiled from master) to 2 seconds (Ayoub Kazar's t_4096_none.csv copied into postgres compiled from master plus Nazir's v3 patches).Perhaps I need to fiddle with the provided SQL to produce larger files to get longer run times? Maybe sub-second differences won't tell as interesting a story as minutes-long copy commands?I did try it on some GBs (around 2-5GB only), the differences were not that much, but if you can run this on more GBs (at least 10GB) it would be good to look at, although i don't suspect anything interesting since the shape of data is the same for the totality of the COPY.Thanks for the info.Regards,Ayoub Kazar.
Hello again!
It looks like using 10 times the data removed the apparent speedup in the simd code when the simd code has to deal with t_4096_escape.txt and t_4096_quote.csv. When both files contain 1,000,000 lines each, postgres master+v3patch imports 0.63% slower and 0.54% slower respectively. For 1,000,000 lines of t_4096_none.txt, the v3 patch yields a 30% speedup. For 1,000,000 lines of t_4096_none.csv, the v3 patch yields a 33% speedup.
I got these numbers just via simple timing, though this time I used psql's \timing feature. I left psql running rather than launching it each time as I did when I used the unix "time" command. I ran the copy command 5 times for each file and averaged the results. Again, this happened on a Linux machine that also happened to be running Chrome and Open Office's spreadsheet.
I should probably try to construct some .txt or .csv files that would trip up the simd on/off heuristic in the v3 patch.
If data "in the wild" tend to be roughly the same "shape" from row to row, as Andrew's experience has shown, I imagine these million row results bode well for the v3 patch...
-- -- Manni Wood EDB: https://www.enterprisedb.com
В списке pgsql-hackers по дате отправления: