Re: AIO v2.5

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: AIO v2.5
Дата
Msg-id dhohimismkcohflmvd6hkrfwgy2c7v5g4gcg5wnfrd3rtxpzxy@aqdwtknayaub
обсуждение исходный текст
Ответ на Re: AIO v2.5  (Tomas Vondra <tomas@vondra.me>)
Ответы Re: AIO v2.5
Список pgsql-hackers
On 2025-07-13 20:04:51 +0200, Tomas Vondra wrote:
> On 7/11/25 23:03, Tomas Vondra wrote:
> > ...
> >
> > e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png)
> >
> > There's an interesting difference difference I noticed in the run with
> > checksums on PG17. The full PDF is available here:
> >
> > https://github.com/tvondra/iomethod-tests/blob/run2-17-checksums-on/ryzen-rows-cold-32GB-16-unscaled.pdf
> >
> > The interesting thing is that PG17 indexscans on uniform dataset got a
> > little bit faster. In the attached PDF it's exactly on par with PG18,
> > but here it got a bit faster. Which makes no sense, if it has to also
> > verify checksums. I haven't had time to investigate this yet.
>
> I was intrigued by this, so I looked into this today.
>
> TL;DR I believe it was caused by something in the filesystem or even the
> storage devices, making the "PG17" data directory (or maybe even just
> the "uniform" table) a bit faster.
>
> I started by reproducing the behavior with an indexscan matching 10% of
> the rows, and it was very easy to reproduce the difference shows on the
> chart (all timings in milliseconds):
>
> PG17: 14112.800 ms
> PG18: 21612.090 ms

That's a decidedly nontrivial difference.

Did you keep any metrics from those runs? E.g. whether there were larger IOs
or such?



> This was perfectly reproducible, affecting the whole table (not just one
> part of it), etc. At some point I recalled that I might have initialized
> the databases in slightly different ways - one by running the SQL, the
> other one by pg_dump/pg_restore (likely with multiple jobs).

I guess that's an INSERT ...  SELECT vs COPY?

Which one was the faster one?

If this ever re-occurs, it might be interesting to look at the fragmentation
of the underlying files with filefrag.


> I couldn't think of any other difference between the data directories,
> so I simply reloaded them by pg_restore (from the same dump). Which
> however made them both slow :O

So that suggests that COPY is the slow case, interesting.

One potentially relevant factor could be that parallel COPY into logged tables
currently leads to really sub-optimal write patterns, due to us writing back
buffers one-by-one, interspersed by WAL writes and file extensions. I know how
that affects write speed, but it's not entirely obvious how it would affect
read speed...


> And it didn't matter how many jobs are used, or anything else I tried.
> But every now and then an instance (17 or 18) happened to be fast
> (~14000 ms). Consistently, for all queries on the table, not randomly.
>
> In the end I recreated the (ext4) filesystem, loaded the databases and
> now both instances are fast. I have no idea what the root cause was, and
> I assume recreating the filesystem destroyed all the evidence.

Besides differences in filesystem level fragmentation, another potential
theory is that the SSDs were internally more fragmented. Occasionally
dumping/restoring the data could allow the drive to do internal wear
leveling before the new data is loaded, leading to a better layout.

I found that I get more consistent benchmark performance if I delete as much
of the data as possible, run fstrim -v -a and then load the data. And do
another round of fstrim.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления: