Re: BitmapHeapScan streaming read user and prelim refactoring

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: BitmapHeapScan streaming read user and prelim refactoring
Дата
Msg-id 3ac67b68-47ff-4edb-bb9e-d86504acd573@enterprisedb.com
обсуждение исходный текст
Ответ на Re: BitmapHeapScan streaming read user and prelim refactoring  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: BitmapHeapScan streaming read user and prelim refactoring  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
On 3/29/24 14:36, Thomas Munro wrote:
> On Sat, Mar 30, 2024 at 12:17 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> eic unpatched patched
>> 0        4172    9572
>> 1       30846   10376
>> 2       18435    5562
>> 4       18980    3503
>> 8       18980    2680
>> 16      18976    3233
> 
> ... but the patched version gets down to a low number for eic=0 too if
> you turn up the blockdev --setra so that it also gets Linux RA
> treatment, making it the clear winner on all eic settings.  Patched
> doesn't improve.  So, for low IOPS storage at least, when you're on
> the borderline between random and sequential, ie bitmap with a lot of
> 1s in it, it seems there are cases where patched doesn't trigger Linux
> RA but unpatched does, and you can tune your way out of that, and then
> there are cases where the IOPS limit is reached due to small reads,
> but patched does better because of larger I/Os that are likely under
> the same circumstances.  Does that make sense?

I think you meant "unpatched version gets down" in the first sentence,
right? Still, it seems clear this changes the interaction with readahead
done by the kernel.

However, you seem to focus only on eic=0/eic=1 cases, but IIRC that was
just an example. There are regression with higher eic values too.

I do have some early results from the benchmarks - it's just from the
NVMe machine, with 1M tables (~300MB), and it's just one incomplete run
(so there might be some noise etc.).

Attached is a PDF with charts for different subsets of the runs:

- optimal (would optimizer pick bitmapscan or not)
- readahead (yes/no)
- workers (serial vs. 4 workers)
- combine limit (8kB / 128kB)

The most interesting cases are first two rows, i.e. optimal plans.
Either with readahead enabled (first row) or disabled (second row).

Two observations:

* The combine limit seems to have negligible impact. There's no visible
difference between combine_limit=8kB and 128kB.

* Parallel queries seem to work about the same as master (especially for
optimal cases, but even for not optimal ones).


The optimal plans with kernel readahead (two charts in the first row)
look fairly good. There are a couple regressed cases, but a bunch of
faster ones too.

The optimal plans without kernel read ahead (two charts in the second
row) perform pretty poorly - there are massive regressions. But I think
the obvious reason is that the streaming read API skips prefetches for
sequential access patterns, relying on kernel to do the readahead. But
if the kernel readahead is disabled for the device, that obviously can't
happen ...

I think the question is how much we can (want to) rely on the readahead
to be done by the kernel. Maybe there should be some flag to force
issuing fadvise even for sequential patterns, perhaps at the tablespace
level? I don't recall seeing a system with disabled readahead, but I'm
sure there are cases where it may not really work - it clearly can't
work with direct I/O, but I've also not been very successful with
prefetching on ZFS.

The non-optimal plans (second half of the charts) shows about the same
behavior, but the regressions are more frequent / significant.

I'm also attaching results for the 5k "optimal" runs, showing the timing
for master and patched build, sorted by (patched/master). The most
significant regressions are with readahead=0, but if you filter that out
you'll see the regressions affect a mix of data sets, not just the
uniformly random data used as example before.

On 3/29/24 12:17, Thomas Munro wrote:
> ...
> On the other hand this might be a pretty unusual data distribution.
> People who store random numbers or hashes or whatever probably don't
> really search for ranges of them (unless they're trying to mine
> bitcoins in SQL).  I dunno.  Maybe we need more realistic tests, or
> maybe we're just discovering all the things that are bad about the
> pre-existing code.

I certainly admit the data sets are synthetic and perhaps adversarial.
My intent was to cover a wide range of data sets, to trigger even less
common cases. It's certainly up to debate how serious the regressions on
those data sets are in practice, I'm not suggesting "this strange data
set makes it slower than master, so we can't commit this".

But I'd also point that what matters is the access pattern, not the
exact query generating it. I agree people probably don't do random
numbers or hashes with range conditions, but that's irrelevant - what
it's all about the page access pattern. If you have IPv4 addresses and
query that, that's likely going to be pretty random, for example.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Nathan Bossart
Дата:
Сообщение: Re: Popcount optimization using AVX512
Следующее
От: Nathan Bossart
Дата:
Сообщение: Re: Popcount optimization using AVX512