Re: effective_io_concurrency and NVMe devices

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: effective_io_concurrency and NVMe devices
Дата
Msg-id 5b050fd1-7813-a223-466d-852d1c62a257@enterprisedb.com
обсуждение исходный текст
Ответ на Re: effective_io_concurrency and NVMe devices  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список pgsql-hackers
Hi,

I've been looking at this a bit more, investigating the regression. I
was wondering how come no one noticed/reported this issue before, since
we have "1" as the default value since 9.5.

So either this behaves very differently on moder flash/NVMe storage, or
maybe it somehow depends on the dataset / access pattern.

Note: I don't have access to a machine with NVMe at the moment, so I did
all the tests on my usual machine with SATA SSDs. I plan to run the same
tests on NVMe once the bigger machine is available, but I think that'll
lead mostly to the same conclusions. So let me present the results now,
including the scripts so David can run those tests on their machine.


From now on, I'll refer to two storage devices:

1) SSD RAID - 6x Intel S3600 100GB SATA, in RAID0

2) SSD SINGLE - Intel Series 320, 120GB

The machine is pretty small, with just 8GB of RAM and i5-2500k (4C) CPU.


Firstly, I remembered there were some prefetching benchmarks [1], so I
repeated those. I don't have the same SSD as Merlin, but the general
behavior should be similar.

   e_i_c       1     2     4     8    16    32    64   128   256
   -------------------------------------------------------------
   timing   46.3  49.3  29.1  23.2  22.1  20.7  20.0  19.3  19.2
     diff   100%  106%   63%   50%   48%   45%   43%   42%   41%

The second line is simply the timing relative to the first column.
Merlin did not include timing for e_i_c=0 (I think that was valid value,
meaning "disabled" even back then.

In any case, those results shows significant improvements compared to
e_i_c=1 as prefetch increases.

When I run the same query on scale 3000, including eic=0:

  e_i_c        0     1     2     4      8    16    32    64   128   256
  ---------------------------------------------------------------------
  ssd       29.4  49.4  33.9  25.2   31.9  27.2  28.0  29.3  27.6  27.6
  ssd       100%  168%  115%   86%   108%   92%   95%  100%   94%   94%
  ---------------------------------------------------------------------
  ssd raid  10.7  74.2  51.2  30.6   24.0  13.8  14.6  14.3  14.1  14.0
  ssd raid  100%  691%  477%  285%   224%  129%  137%  134%  132%  131%

Notice that ignoring the eic=0 value (no prefetch), the behavior is
pretty similar to what Melin reported - consistent improvements as the
eic value increases. Ultimately it gets close to eic=0, but not faster
(at least not significantly).

FWIW I actually tried running this on 9.3, and the behavior is the same.

So I guess the behavior is the same, but it misses that eic=1 actually
may be making it much worse (compared to eic=0). The last para in [1]
actually says:

  > Interesting that at setting of '2' (the lowest possible setting with
  > the feature actually working) is pessimal.

which sounds a bit like '1' does nothing (no prefetch). But that's not
(and was not) the case, I think. But we don't have the results for eic=0
unfortunately.

Note: We stopped using the complex prefetch distance calculating since
then, but we can ignore that here I think.


The other problem with reproducing/interpreting those results is it's
unclear whether the query was executed immediately after "pgbench -i" or
sometime later (after a bunch of transactions were done). Consider the
query is:

   select * from pgbench_accounts
    where aid between 1000 and 50000000 and abalance != 0;

and right after initialization the accounts will be almost perfectly
sequential. So the query will match a continuous range of pages
(roughtly 1/6 of the whole table). But updates may be shuffling rows
around, making the I/O access pattern more random (but I'm not sure how
much, I'd expect most updates to fit on the same page).

This might explain the poor results (compared to eic=0). Sequential
access is great for readahead (in the OS and also internal in SSD),
which makes our prefetch pretty unnecessary / perhaps even actively harmful.

And the same explanation applies to David's query - that's also almost
perfectly sequential, AFAICS.

But that just raises the question - how does the prefetch work for other
access patterns, with pages not this sequential, but spread randomly
through the table.

So I constructed a couple datasets, with different patterns, generated
by the attached bash script. The table has this structure:

   CREATE TABLE t (a int, padding text)

and "a" has values between 0 and 1000, and the script generates data so
that each page contains 27 rows with the same "a" value. This allows us
to write queries matching arbitrary fraction of the table. For example
we can say "a BETWEEN 10 AND 20" which matches 1%, etc.

Furthermore, the pages are either independent (each with a different
value) or with longer streaks of the same value.

The script generates these data sets:

 random:   each page gets a random "a" value
 random-8: each sequence of 8 pages gets a random value
 random-32: each sequence of 8 pages gets a random value
 sequential: split into 1000 sequences, values 0, 1, 2, ...

And then the script runs queries matching a random subset the table,
with fractions 1%, 5%, 10%, 25% and 50% (queries with different
selectivity). The ranges are generated at random, it's just the length
of the range that matters.

The script also restarts the database and drops caches, so that the
prefetch actually does something.

Attached are CSV files with a complete run from the two SSD devices, if
you want to dig in. But the two PDFs are a better "visualization" of
performance compared to "no prefetch" (eic=0).

The "tables" PDF shows timing compared to eic=0, so 100% means "the
same" and 200% "twice slower". Or by color - red is "slower" (bad) while
green is "faster" (good).

The "charts" PDF shows essentially the same thing (duration compared to
eic=0), but as chart with "eic" on x-axis. In principle, we want all the
values to be "below" 100% line.


I think there are three obvious observations we can make from the tables
and charts:

1) The higher the selectivity, the worse.

2) The more sequential the data, the worse.

3) These two things "combine".


For example on the "random" data, prefetching works perfectly fine for
queries matching 1%, 5% and 10% even for eic=1. But queries matching 25%
and 50% get much slower with eic=1 and need much higher values to even
break even.

The less random data sets make it worse and worse. With random-32 all
query cases (even 1%) require much at least eic=4 or more to break even,
and with "sequential" it never happens.

I'd bet the NVMe devices will behave mostly the same way, after all
David showed the same issue for prefetching on sequential data. I'm not
sure about the "more random" cases, because one of the supposed
advantages of modern NVMe devices is they require lower queue depth.

This may also explain why we haven't received any reports - most queries
probably match either tiny fraction of data, or the data is mostly
random. So prefetching either helps, or at least is not too harmful.


I think this can be explained mostly by OS read-ahead and/or internal
caching on SSD devices, which works pretty well for sequential accesses
and "our" prefetching may be either unnecessary (essentially a little
bit of extra overhead) or interfering with it - changing the access
pattern so that OS does not recognize/trigger the read-ahead, or maybe
evicting the interesting pages from internal device cache.


What can we do about this? AFAICS it shouldn't be difficult to look at
the bitmap generated by the bitmap index scan, and analyze it - that
will tell us what fraction of pages match, and also how sequential the
patterns are. And based on that we can either adjust prefetching
distance, or maybe wen disable prefetching for cases matching too many
pages or "too sequential". Of course, that'll require some heuristics or
a simple "cost model".



regards


[1]
https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH%3DHWh1WbLRsioe%3DmzRJTHwtr%3D2azsTdQ%40mail.gmail.com


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: First-draft release notes for next week's minor releases
Следующее
От: David Rowley
Дата:
Сообщение: Re: strange slow query - lost lot of time somewhere