Re: AIO v2.5
От | Andres Freund |
---|---|
Тема | Re: AIO v2.5 |
Дата | |
Msg-id | dhohimismkcohflmvd6hkrfwgy2c7v5g4gcg5wnfrd3rtxpzxy@aqdwtknayaub обсуждение исходный текст |
Ответ на | Re: AIO v2.5 (Tomas Vondra <tomas@vondra.me>) |
Ответы |
Re: AIO v2.5
|
Список | pgsql-hackers |
On 2025-07-13 20:04:51 +0200, Tomas Vondra wrote: > On 7/11/25 23:03, Tomas Vondra wrote: > > ... > > > > e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png) > > > > There's an interesting difference difference I noticed in the run with > > checksums on PG17. The full PDF is available here: > > > > https://github.com/tvondra/iomethod-tests/blob/run2-17-checksums-on/ryzen-rows-cold-32GB-16-unscaled.pdf > > > > The interesting thing is that PG17 indexscans on uniform dataset got a > > little bit faster. In the attached PDF it's exactly on par with PG18, > > but here it got a bit faster. Which makes no sense, if it has to also > > verify checksums. I haven't had time to investigate this yet. > > I was intrigued by this, so I looked into this today. > > TL;DR I believe it was caused by something in the filesystem or even the > storage devices, making the "PG17" data directory (or maybe even just > the "uniform" table) a bit faster. > > I started by reproducing the behavior with an indexscan matching 10% of > the rows, and it was very easy to reproduce the difference shows on the > chart (all timings in milliseconds): > > PG17: 14112.800 ms > PG18: 21612.090 ms That's a decidedly nontrivial difference. Did you keep any metrics from those runs? E.g. whether there were larger IOs or such? > This was perfectly reproducible, affecting the whole table (not just one > part of it), etc. At some point I recalled that I might have initialized > the databases in slightly different ways - one by running the SQL, the > other one by pg_dump/pg_restore (likely with multiple jobs). I guess that's an INSERT ... SELECT vs COPY? Which one was the faster one? If this ever re-occurs, it might be interesting to look at the fragmentation of the underlying files with filefrag. > I couldn't think of any other difference between the data directories, > so I simply reloaded them by pg_restore (from the same dump). Which > however made them both slow :O So that suggests that COPY is the slow case, interesting. One potentially relevant factor could be that parallel COPY into logged tables currently leads to really sub-optimal write patterns, due to us writing back buffers one-by-one, interspersed by WAL writes and file extensions. I know how that affects write speed, but it's not entirely obvious how it would affect read speed... > And it didn't matter how many jobs are used, or anything else I tried. > But every now and then an instance (17 or 18) happened to be fast > (~14000 ms). Consistently, for all queries on the table, not randomly. > > In the end I recreated the (ext4) filesystem, loaded the databases and > now both instances are fast. I have no idea what the root cause was, and > I assume recreating the filesystem destroyed all the evidence. Besides differences in filesystem level fragmentation, another potential theory is that the SSDs were internally more fragmented. Occasionally dumping/restoring the data could allow the drive to do internal wear leveling before the new data is loaded, leading to a better layout. I found that I get more consistent benchmark performance if I delete as much of the data as possible, run fstrim -v -a and then load the data. And do another round of fstrim. Greetings, Andres Freund
В списке pgsql-hackers по дате отправления: