Re: BitmapHeapScan streaming read user and prelim refactoring

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: BitmapHeapScan streaming read user and prelim refactoring
Дата
Msg-id 86ddff18-175b-4ae7-a039-129eaf4abe8f@enterprisedb.com
обсуждение исходный текст
Ответ на Re: BitmapHeapScan streaming read user and prelim refactoring  (Melanie Plageman <melanieplageman@gmail.com>)
Список pgsql-hackers
On 3/3/24 00:39, Melanie Plageman wrote:
> On Sat, Mar 2, 2024 at 5:51 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 3/2/24 23:11, Melanie Plageman wrote:
>>> On Fri, Mar 1, 2024 at 2:31 PM Melanie Plageman
>>> <melanieplageman@gmail.com> wrote:
>>>>
>>>> ...
>>>>
>>>> Hold the phone on this one. I realized why I moved
>>>> BitmapAdjustPrefetchIterator after table_scan_bitmap_next_block() in
>>>> the first place -- master calls BitmapAdjustPrefetchIterator after the
>>>> tbm_iterate() for the current block -- otherwise with eic = 1, it
>>>> considers the prefetch iterator behind the current block iterator. I'm
>>>> going to go through and figure out what order this must be done in and
>>>> fix it.
>>>
>>> So, I investigated this further, and, as far as I can tell, for
>>> parallel bitmapheapscan the timing around when workers decrement
>>> prefetch_pages causes the performance differences with patch 0010
>>> applied. It makes very little sense to me, but some of the queries I
>>> borrowed from your regression examples are up to 30% slower when this
>>> code from BitmapAdjustPrefetchIterator() is after
>>> table_scan_bitmap_next_block() instead of before it.
>>>
>>>         SpinLockAcquire(&pstate->mutex);
>>>         if (pstate->prefetch_pages > 0)
>>>             pstate->prefetch_pages--;
>>>         SpinLockRelease(&pstate->mutex);
>>>
>>> I did some stracing and did see much more time spent in futex/wait
>>> with this code after the call to table_scan_bitmap_next_block() vs
>>> before it. (table_scan_bitmap_next_block()) calls ReadBuffer()).
>>>
>>> In my branch, I've now moved only the parallel prefetch_pages-- code
>>> to before table_scan_bitmap_next_block().
>>> https://github.com/melanieplageman/postgres/tree/bhs_pgsr
>>> I'd be interested to know if you see the regressions go away with 0010
>>> applied (commit message "Make table_scan_bitmap_next_block() async
>>> friendly" and sha bfdcbfee7be8e2c461).
>>>
>>
>> I'll give this a try once the runs with MAX_BUFFERS_PER_TRANSFER=1
>> complete. But it seems really bizarre that simply moving this code a
>> little bit would cause such a regression ...
> 
> Yes, it is bizarre. It also might not be a reproducible performance
> difference on the cases besides the one I was testing (cyclic dataset,
> uncached, eic=8, matches 16+, distinct=100, rows=100000000, 4 parallel
> workers). But even if it only affects that one case, it still had a
> major, reproducible performance impact to move those 5 lines before
> and after table_scan_bitmap_next_block().
> 
> The same number of reads and fadvises are being issued overall.
> However, I did notice that the pread calls are skewed when the those
> lines of code are after table_scan_bitmap_next_block()  -- fewer of
> the workers are doing more of the reads. Perhaps this explains what is
> taking longer. Why those workers would end up doing more of the reads,
> I don't quite know.
> 
> - Melanie


I do have some numbers with e44505ce179e442bd50664c85a31a1805e13514a,
and I don't see any such effect - it performs pretty much exactly like
the v6 patches.

I used a slightly different visualization, plotting the timings on a
scatter plot, so values on diagonal mean "same performance" while values
above/below mean speedup/slowdown.

This is a bit more compact than the tables with color scales, and it
makes it harder (impossible) to see patterns (e.g. changes depending on
eic). But for evaluating if there's a shift overall it still works, and
it also shows clusters. So more a complementary & simpler visualization.

There are three charts

1) master-patched.png - master vs. v6 patches
2) master-locks.png - master vs. e44505ce
3) patched-locks.png - v6 patches vs. e44505ce

There's virtually no difference between (1) and (2) - same pattern of
regressions and speedups, almost as a copy. That's confirmed by (3)
where pretty much all values are exactly on the diagonal, with only a
couple outliers.

I'm not sure why you see a 30% difference with the change. I wonder if
that might be due to some issue in the environment? Are you running in a
VM, or something like that?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: pgsql: Improve performance of subsystems on top of SLRU
Следующее
От: Erik Wienhold
Дата:
Сообщение: Re: psql: fix variable existence tab completion