Re: BitmapHeapScan streaming read user and prelim refactoring
От | Tomas Vondra |
---|---|
Тема | Re: BitmapHeapScan streaming read user and prelim refactoring |
Дата | |
Msg-id | 1d9902c0-43b9-4ae7-8c19-70bab3802bd1@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: BitmapHeapScan streaming read user and prelim refactoring (Andres Freund <andres@anarazel.de>) |
Ответы |
Re: BitmapHeapScan streaming read user and prelim refactoring
|
Список | pgsql-hackers |
On 3/17/24 17:38, Andres Freund wrote: > Hi, > > On 2024-03-16 21:25:18 +0100, Tomas Vondra wrote: >> On 3/16/24 20:12, Andres Freund wrote: >>> That would address some of the worst behaviour, but it doesn't really seem to >>> address the underlying problem of the two iterators being modified >>> independently. ISTM the proper fix would be to protect the state of the >>> iterators with a single lock, rather than pushing down the locking into the >>> bitmap code. OTOH, we'll only need one lock going forward, so being economic >>> in the effort of fixing this is also important. >>> >> >> Can you share some details about how you identified the problem, counted >> the prefetches that happen too late, etc? I'd like to try to reproduce >> this to understand the issue better. > > There's two aspects. Originally I couldn't reliably reproduce the regression > with Melanie's repro on my laptop. I finally was able to do so after I > a) changed the block device's read_ahead_kb to 0 > b) used effective_io_concurrency=1 > > That made the difference between the BitmapAdjustPrefetchIterator() locations > very significant, something like 2.3s vs 12s. > Interesting. I haven't thought about read_ahead_kb, but in hindsight it makes sense it affects these cases. OTOH I did not set it to 0 on either machine (the 6xSATA RAID0 has it at 12288, for example) and yet that's how we found the regressions. For eic it makes perfect sense that setting it to 1 is particularly vulnerable to this issue - it only takes a small "desynchronization" of the two iterators for the prefetch to "fall behind" and frequently prefetch blocks we already read. > Besides a lot of other things, I finally added debugging fprintfs printing the > pid, (prefetch, read), block number. Even looking at tiny excerpts of the > large amount of output that generates shows that two iterators were out of > sync. > Thanks. I did experiment with fprintf, but it's quite cumbersome, so I was hoping you came up with some smart way to trace this king of stuff. For example I was wondering if ebpf would be a more convenient way. > >> If I understand correctly, what may happen is that a worker reads blocks >> from the "prefetch" iterator, but before it manages to issue the >> posix_fadvise, some other worker already did pread. Or can the iterators >> get "out of sync" in a more fundamental way? > > I agree that the current scheme of two shared iterators being used has some > fairly fundamental raciness. But I suspect there's more than that going on > right now. > > Moving BitmapAdjustPrefetchIterator() to later drastically increases the > raciness because it means table_scan_bitmap_next_block() happens between > increasing the "real" and the "prefetch" iterators. > > An example scenario that, I think, leads to the iterators being out of sync, > without there being races between iterator advancement and completing > prefetching: > > start: > real -> block 0 > prefetch -> block 0 > prefetch_pages = 0 > prefetch_target = 1 > > W1: tbm_shared_iterate(real) -> block 0 > W2: tbm_shared_iterate(real) -> block 1 > W1: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 0 > W2: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 1 > W1: read block 0 > W2: read block 1 > W1: BitmapPrefetch() -> prefetch_pages++ -> 1, tbm_shared_iterate(prefetch) -> 2, prefetch block 2 > W2: BitmapPrefetch() -> nothing, as prefetch_pages == prefetch_target > > W1: tbm_shared_iterate(real) -> block 2 > W2: tbm_shared_iterate(real) -> block 3 > > W2: BitmapAdjustPrefetchIterator() -> prefetch_pages-- > W2: read block 3 > W2: BitmapPrefetch() -> prefetch_pages++, tbm_shared_iterate(prefetch) -> 3, prefetch block 3 > > So afaict here we end up prefetching a block that the *same process* just had > read. > Uh, that's very weird. I'd understood if there's some cross-process issue, but if this happens in a single process ... strange. > ISTM that the idea of somehow "catching up" in BitmapAdjustPrefetchIterator(), > separately from advancing the "real" iterator, is pretty ugly for non-parallel > BHS and just straight up broken in the parallel case. > Yeah, I agree with the feeling it's an ugly fix. Definitely seems more like fixing symptoms than the actual problem. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: