Re: BitmapHeapScan streaming read user and prelim refactoring
От | Thomas Munro |
---|---|
Тема | Re: BitmapHeapScan streaming read user and prelim refactoring |
Дата | |
Msg-id | CA+hUKGKxPEh4oUtz_9fij=DdrJRWejbf1r0qWyqpSxUe2tu3VA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BitmapHeapScan streaming read user and prelim refactoring (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Ответы |
Re: BitmapHeapScan streaming read user and prelim refactoring
(Thomas Munro <thomas.munro@gmail.com>)
Re: BitmapHeapScan streaming read user and prelim refactoring (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Список | pgsql-hackers |
On Sat, Mar 30, 2024 at 4:53 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Two observations: > > * The combine limit seems to have negligible impact. There's no visible > difference between combine_limit=8kB and 128kB. > > * Parallel queries seem to work about the same as master (especially for > optimal cases, but even for not optimal ones). > > > The optimal plans with kernel readahead (two charts in the first row) > look fairly good. There are a couple regressed cases, but a bunch of > faster ones too. Thanks for doing this! > The optimal plans without kernel read ahead (two charts in the second > row) perform pretty poorly - there are massive regressions. But I think > the obvious reason is that the streaming read API skips prefetches for > sequential access patterns, relying on kernel to do the readahead. But > if the kernel readahead is disabled for the device, that obviously can't > happen ... Right, it does seem that this whole concept is sensitive on the 'borderline' between sequential and random, and this patch changes that a bit and we lose some. It's becoming much clearer to me that master is already exposing weird kinks, and the streaming version is mostly better, certainly on low IOPS systems. I suspect that there must be queries in the wild that would run much faster with eic=0 than eic=1 today due to that, and while the streaming version also loses in some cases, it seems that it mostly loses because of not triggering RA, which can at least be improved by increasing the RA window. On the flip side, master is more prone to running out of IOPS and there is no way to tune your way out of that. > I think the question is how much we can (want to) rely on the readahead > to be done by the kernel. ... We already rely on it everywhere, for basic things like sequential scan. > ... Maybe there should be some flag to force > issuing fadvise even for sequential patterns, perhaps at the tablespace > level? ... Yeah, I've wondered about trying harder to "second guess" the Linux RA. At the moment, read_stream.c detects *exactly* sequential reads (see seq_blocknum) to suppress advice, but if we knew/guessed the RA window size, we could (1) detect it with the same window that Linux will use to detect it, and (2) [new realisation from yesterday's testing] we could even "tickle" it to wake it up in certain cases where it otherwise wouldn't, by temporarily using a smaller io_combine_limit if certain patterns come along. I think that sounds like madness (I suspect that any place where the latter would help is a place where you could turn RA up a bit higher for the same effect without weird kludges), or another way to put it would be to call it "overfitting" to the pre-existing quirks; but maybe it's a future research idea... > I don't recall seeing a system with disabled readahead, but I'm > sure there are cases where it may not really work - it clearly can't > work with direct I/O, ... Right, for direct I/O everything is slow right now including seq scan. We need to start asynchronous reads in the background (imagine literally just a bunch of background "I/O workers" running preadv() on your behalf to get your future buffers ready for you, or equivalently Linux io_uring). That's the real goal of this project: restructuring so we have the information we need to do that, ie teach every part of PostgreSQL to predict the future in a standard and centralised way. Should work out better than RA heuristics, because we're not just driving in a straight line, we can turn corners too. > ... but I've also not been very successful with > prefetching on ZFS. posix_favise() did not do anything in OpenZFS before 2.2, maybe you have an older version? > I certainly admit the data sets are synthetic and perhaps adversarial. > My intent was to cover a wide range of data sets, to trigger even less > common cases. It's certainly up to debate how serious the regressions on > those data sets are in practice, I'm not suggesting "this strange data > set makes it slower than master, so we can't commit this". Right, yeah. Thanks! Your initial results seemed discouraging, but looking closer I'm starting to feel a lot more positive about streaming BHS.
В списке pgsql-hackers по дате отправления: