Re: Handing off SLRU fsyncs to the checkpointer

Поиск
Список
Период
Сортировка
От Jakub Wartak
Тема Re: Handing off SLRU fsyncs to the checkpointer
Дата
Msg-id VI1PR0701MB69603BF4F1EC2AF6E6ECF5C1F6550@VI1PR0701MB6960.eurprd07.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Handing off SLRU fsyncs to the checkpointer  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: Handing off SLRU fsyncs to the checkpointer  (Thomas Munro <thomas.munro@gmail.com>)
Re: Handing off SLRU fsyncs to the checkpointer  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
Hi Thomas / hackers,

>> The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS
cache(but not in shared buffers, please compare with TEST1 as there was no such bottleneck at all): 
>>
>>     29.62%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
>>             ---copy_user_enhanced_fast_string
>>                |--17.98%--copyin
>> [..]
>>                |          __pwrite_nocancel
>>                |          FileWrite
>>                |          mdwrite
>>                |          FlushBuffer
>>                |          ReadBuffer_common
>>                |          ReadBufferWithoutRelcache
>>                |          XLogReadBufferExtended
>>                |          XLogReadBufferForRedoExtended
>>                |           --17.57%--btree_xlog_insert
>
> To move these writes out of recovery's way, we should probably just
> run the bgwriter process during crash recovery.  I'm going to look
> into that.

Sounds awesome. Also as this thread is starting to derail the SLRU fsync topic - maybe we should change subject?
However,to add some data to the separate bgwriter: on 14master (already with lseek() caching, SLRU fsyncs out of way,
bettersorting), I've measured the same configuration as last time with still the same append-only WAL workload on NVMe
andcompared with various shared_buffers settings (and buffers description sizing from pg_shmem_allocations which as You
statedis wrongly reported(?) which I'm stating only for reference just in case): 

shared_buffers=128MB buffers_desc=1024kB 96.778, 0.438 [a]
shared_buffers=256MB buffers_desc=2048kB 62.755, 0.577 [a]
shared_buffers=512MB buffers_desc=4096kB 33.167, 0.62 [a]
shared_buffers=1GB buffers_desc=8192kB 27.303, 0.929
shared_buffers=4GB buffers_desc=32MB 27.185, 1.166
shared_buffers=8GB buffers_desc=64MB 27.649, 1.088
shared_buffers=16GB buffers_desc=128MB 27.584, 1.201 [b]
shared_buffers=32GB buffers_desc=256MB 32.314, 1.171 [b]
shared_buffers=48GB buffers_desc=384 MB 31.95, 1.217
shared_buffers=64GB buffers_desc=512 MB 31.276, 1.349
shared_buffers=72GB buffers_desc=576 MB 31.925, 1.284
shared_buffers=80GB buffers_desc=640 MB 31.809, 1.413

The amount of WAL to be replayed was ~2.8GB. To me it looks like that
a) flushing dirty buffers by StartupXLog might be a real problem but please read-on.
b) there is very low impact by this L2/L3 hypothesis you mention (?), but it's not that big and it's not degrading
linearlyas one would expect (??) L1d:L1d:L2:L3 cache sizes on this machine are as follows on this machine:
32K/32K/256K/46080K.Maybe the DB size is too small. 

I've double-checked that in condition [a] (shared_buffers=128MB) there was a lot of FlushBuffer() invocations per
second(perf stat -e probe_postgres:FlushBuffer -I 1000), e.g: 
#           time             counts unit events
     1.000485217             79,494      probe_postgres:FlushBuffer
     2.000861366             75,305      probe_postgres:FlushBuffer
     3.001263163             79,662      probe_postgres:FlushBuffer
     4.001650795             80,317      probe_postgres:FlushBuffer
     5.002033273             79,951      probe_postgres:FlushBuffer
     6.002418646             79,993      probe_postgres:FlushBuffer
while at 1GB shared_buffers it sits nearly at zero all the time.  So there is like 3x speed-up possible  when
StartupXLogwouldn't have to care too much about dirty buffers in the critical path (bgwriter would as you say?) at
leastin append-only scenarios.  But ... I've checked some real systems (even older versions of PostgreSQL not doing
thatmuch of replication, and indeed it's  e.g. happening 8-12k/s FlushBuffer's() and shared buffers are pretty huge
(>100GB,0.5TB RAM) but they are *system-wide* numbers, work is really performed by bgwriter not by startup/recovering
asin this redo-bench case when DB is open for reads and replicating-- so it appears that this isn't optimization for
hotstandby case , but for the DB-closed startup recovery/restart/disaster/PITR scenario). 

As for the 24GB shared_buffers scenario where dirty buffers are not at all a problem with given top profile (output
trimmed),again as expected: 

    13.41%  postgres  postgres           [.] hash_search_with_hash_value
               |--8.31%--BufTableLookup <- ReadBuffer_common <- ReadBufferWithoutRelcache
                --5.11%--smgropen
                          |--2.77%--XLogReadBufferExtended
                           --2.34%--ReadBufferWithoutRelcache
     7.88%  postgres  postgres           [.] MarkBufferDirty

I've tried to get cache misses ratio via PMCs, apparently on EC2 they are (even on bigger) reporting as not-supported
orzeros. However interestingly the workload has IPC of 1.40 (instruction bound) which to me is strange as I would
expectBufTableLookup() to be actually heavy memory bound (?) Maybe I'll try on some different hardware one day.  

>>                           __pread_nocancel
>>                            --11.44%--FileRead
>>                                      mdread
>>                                      ReadBuffer_common
>>                                      ReadBufferWithoutRelcache
>>                                      XLogReadBufferExtended
>>                                      XLogReadBufferForRedoExtended
>
> For these reads, the solution should be WAL prefetching,(..) But... when combined with Andres's work-in-progress AIO
stuff(..) 

Yes, I've heard a thing or two about those :) I hope I'll be able to deliver some measurements sooner or later of those
twotogether (AIO+WALprefetch). 

-Jakub Wartak.


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ashutosh Sharma
Дата:
Сообщение: Re: Help needed configuring postgreSQL with xml support
Следующее
От: John Naylor
Дата:
Сообщение: Re: factorial function/phase out postfix operators?