RE: pgcon unconference / impact of block size on performance

Поиск
Список
Период
Сортировка
От Jakub Wartak
Тема RE: pgcon unconference / impact of block size on performance
Дата
Msg-id AM8PR07MB8248009560DB2C7B4C92D874F6A69@AM8PR07MB8248.eurprd07.prod.outlook.com
обсуждение исходный текст
Ответ на Re: pgcon unconference / impact of block size on performance  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: pgcon unconference / impact of block size on performance  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список pgsql-hackers
> On 6/9/22 13:23, Jakub Wartak wrote:
> >>>>>>> The really
> >>>>>> puzzling thing is why is the filesystem so much slower for
> >>>>>> smaller pages. I mean, why would writing 1K be 1/3 of writing 4K?
> >>>>>> Why would a filesystem have such effect?
> >>>>>
> >>>>> Ha! I don't care at this point as 1 or 2kB seems too small to
> >>>>> handle many real world scenarios ;)
> >>> [..]
> >>>> Independently of that, it seems like an interesting behavior and it
> >>>> might tell us something about how to optimize for larger pages.
> >>>
> >>> OK, curiosity won:
> >>>
> >>> With randwrite on ext4 directio using 4kb the avgqu-sz reaches
> >>> ~90-100 (close to fio's 128 queue depth?) and I'm getting ~70k IOPS
> >>> [with maxdepth=128] With randwrite on ext4 directio using 1kb the
> >>> avgqu-sz is just
> >> 0.7 and I'm getting just ~17-22k IOPS [with maxdepth=128] ->  conclusion:
> >> something is being locked thus preventing queue to build up With
> >> randwrite on
> >> ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is
> >> queued) and I'm also getting ~70k IOPS with minimal possible
> >> maxdepth=4 ->  conclusion: I just need to split the lock contention by 4.
> >>>
> >>> The 1kB (slow) profile top function is aio_write() -> .... ->
> >>> iov_iter_get_pages()
> >> -> internal_get_user_pages_fast() and there's sadly plenty of "lock"
> >> -> keywords
> >> inside {related to memory manager, padding to full page size, inode
> >> locking} also one can find some articles / commits related to it [1]
> >> which didn't made a good feeling to be honest as the fio is using
> >> just 1 file (even while I'm on kernel 5.10.x). So I've switched to 4x
> >> files and numjobs=4 and got easily 60k IOPS, contention solved
> >> whatever it was :) So I would assume PostgreSQL (with it's splitting
> >> data files by default on 1GB boundaries and multiprocess
> >> architecture) should be relatively safe from such ext4 inode(?)/mm(?)
> contentions even with smallest 1kb block sizes on Direct I/O some day.
> >>>
> >>
> >> Interesting. So what parameter values would you suggest?
> >
> > At least have 4x filename= entries and numjobs=4
> >
> >> FWIW some of the tests I did were on xfs, so I wonder if that might
> >> be hitting similar/other bottlenecks.
> >
> > Apparently XFS also shows same contention on single file for 1..2kb randwrite,
> see [ZZZ].
> >
> 
> I don't have any results yet, but after thinking about this a bit I find this really
> strange. Why would there be any contention with a single fio job?  Doesn't contention 
> imply multiple processes competing for the same resource/lock etc.?

Maybe 1 job throws a lot of concurrent random I/Os that contend against the same inodes / pages (?) 
 
> Isn't this simply due to the iodepth increase? IIUC with multiple fio jobs, each
> will use a separate iodepth value. So with numjobs=4, we'll really use iodepth*4,
> which can make a big difference.

I was thinking the same (it should be enough to have big queue depth), but apparently one needs many files (inodes?)
too:

On 1 file I'm not getting a lot of IOPS on small blocksize (even with numjobs), < 20k IOPS always:
numjobs=1/ext4/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=13.5k, BW=13.2MiB/s (13.8MB/s)(396MiB/30008msec); 0
zoneresets
 
numjobs=1/ext4/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=49.1k, BW=192MiB/s (201MB/s)(5759MiB/30001msec); 0
zoneresets
 
numjobs=1/ext4/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=16.8k, BW=16.4MiB/s (17.2MB/s)(494MiB/30001msec); 0
zoneresets
 
numjobs=1/ext4/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=62.5k, BW=244MiB/s (256MB/s)(7324MiB/30001msec); 0 zone
resets
numjobs=1/xfs/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=14.7k, BW=14.3MiB/s (15.0MB/s)(429MiB/30008msec); 0
zoneresets
 
numjobs=1/xfs/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=46.4k, BW=181MiB/s (190MB/s)(5442MiB/30002msec); 0
zoneresets
 
numjobs=1/xfs/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=22.3k, BW=21.8MiB/s (22.9MB/s)(654MiB/30001msec); 0 zone
resets
numjobs=1/xfs/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=59.6k, BW=233MiB/s (244MB/s)(6988MiB/30001msec); 0 zone
resets
numjobs=4/ext4/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=13.9k, BW=13.6MiB/s (14.2MB/s)(407MiB/30035msec); 0
zoneresets [FAIL 4*qdepth]
 
numjobs=4/ext4/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=52.9k, BW=207MiB/s (217MB/s)(6204MiB/30010msec); 0
zoneresets
 
numjobs=4/ext4/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=17.9k, BW=17.5MiB/s (18.4MB/s)(525MiB/30001msec); 0
zoneresets [FAIL 4*qdepth]
 
numjobs=4/ext4/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=63.3k, BW=247MiB/s (259MB/s)(7417MiB/30001msec); 0 zone
resets
numjobs=4/xfs/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=14.3k, BW=13.9MiB/s (14.6MB/s)(419MiB/30033msec); 0
zoneresets [FAIL 4*qdepth]
 
numjobs=4/xfs/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=50.5k, BW=197MiB/s (207MB/s)(5917MiB/30010msec); 0
zoneresets
 
numjobs=4/xfs/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=19.6k, BW=19.1MiB/s (20.1MB/s)(574MiB/30001msec); 0 zone
resets[FAIL 4*qdepth]
 
numjobs=4/xfs/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=63.6k, BW=248MiB/s (260MB/s)(7448MiB/30001msec); 0 zone
resets

Now with 4 files: It is necessary to have *both* 4 files and bigger processes to get the result, irrespective of IO
interfaceand fs to get closer to at least half of IOPS max
 
numjobs=1/ext4/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=28.3k, BW=27.6MiB/s (28.9MB/s)(834MiB/30230msec); 0
zoneresets
 
numjobs=1/ext4/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=57.8k, BW=226MiB/s (237MB/s)(6772MiB/30001msec); 0
zoneresets
 
numjobs=1/ext4/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=17.3k, BW=16.9MiB/s (17.7MB/s)(506MiB/30001msec); 0
zoneresets
 
numjobs=1/ext4/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=61.6k, BW=240MiB/s (252MB/s)(7215MiB/30001msec); 0 zone
resets
numjobs=1/xfs/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=24.3k, BW=23.8MiB/s (24.9MB/s)(713MiB/30008msec); 0
zoneresets
 
numjobs=1/xfs/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=54.7k, BW=214MiB/s (224MB/s)(6408MiB/30002msec); 0
zoneresets
 
numjobs=1/xfs/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=22.1k, BW=21.6MiB/s (22.6MB/s)(648MiB/30001msec); 0 zone
resets
numjobs=1/xfs/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=65.7k, BW=257MiB/s (269MB/s)(7705MiB/30001msec); 0 zone
resets
numjobs=4/ext4/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=34.1k, BW=33.3MiB/s (34.9MB/s)(999MiB/30020msec); 0
zoneresets [OK?]
 
numjobs=4/ext4/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=64.5k, BW=252MiB/s (264MB/s)(7565MiB/30003msec); 0
zoneresets
 
numjobs=4/ext4/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=49.7k, BW=48.5MiB/s (50.9MB/s)(1456MiB/30001msec); 0
zoneresets [OK]
 
numjobs=4/ext4/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=67.1k, BW=262MiB/s (275MB/s)(7874MiB/30037msec); 0 zone
resets
numjobs=4/xfs/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=33.9k, BW=33.1MiB/s (34.7MB/s)(994MiB/30026msec); 0
zoneresets [OK?]
 
numjobs=4/xfs/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=67.7k, BW=264MiB/s (277MB/s)(7933MiB/30007msec); 0
zoneresets
 
numjobs=4/xfs/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=61.0k, BW=59.5MiB/s (62.4MB/s)(1786MiB/30001msec); 0
zoneresets [OK]
 
numjobs=4/xfs/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=69.2k, BW=270MiB/s (283MB/s)(8111MiB/30004msec); 0 zone
resets

It makes me thing this looks like some file/inode<->process kind of a locking (reminder: Direct I/O case) -- note that
evenwith files=4 and numjobs=1 it doesn't reach those levels it should. One way or another PostgreSQL should be safe on
OLTP- that's the first though, but on 2nd thought - when thinking about extreme IOPS and single-threaded checkpointer /
bgwriter/ walrecovery on standbys I'm not so sure. In potential future IO API implementations - with Direct I/O (???) -
the1kb, 2kb apparently would seem to be limited unless you parallelize those processes due to some internal kernel
locking(sigh! - at least that's what the result the 4 files/numjobs=1/../1k cases indicate; this may vary across kernel
versionsas per earlier link). 
 
 
> >>>
> >>> Explanation: it's the CPU scheduler migrations mixing the
> >>> performance result
> >> during the runs of fio  (as you have in your framework). Various
> >> VCPUs seem to be having varying max IOPS characteristics (sic!) and
> >> CPU scheduler seems to be unaware of it. At least on 1kB and 4kB
> >> blocksize this happens also notice that some VCPUs [XXXX marker]
> >> don't reach 100% CPU reaching almost twice the result; while cores 0, 3 do
> reach 100% and lack CPU power to perform more.
> >> The only thing that I don't get is that it doesn't make sense from
> >> extened lscpu output (but maybe it's AWS XEN mixing real CPU mappings,
> who knows).
> >>
> >> Uh, that's strange. I haven't seen anything like that, but I'm
> >> running on physical HW and not AWS, so it's either that or maybe I just didn't
> do the same test.
> >
> > I couldn't belived it until I've checked via taskset 😊 BTW: I don't
> > have real HW with NVMe , but we might be with worth checking if
> > placing (taskset -c ...) fio on hyperthreading VCPU is not causing
> > (there's /sys/devices/system/cpu/cpu0/topology/thread_siblings and
> > maybe lscpu(1) output). On AWS I have feeling that lscpu might simply
> > lie and I cannot identify which VCPU is HT and which isn't.
> 
> Did you see the same issue with io_uring?

Yes, tested today, got similar results (io_uring doesn’t change a thing and BTW it looks like hypervisor shifts real HW
CPUsto logical VCPUs ) After reading this https://wiki.xenproject.org/wiki/Hyperthreading  (section: Is Xen
hyperthreadingaware), I think solid NVMe testing shouldn't be conducted on anything virtualized - I have no control
overpotentially noisy CPU-heavy neighbors. So please take my results for with grain of salt, unless somebody reproduces
thistaskset -c .. fio tests on proper isolated HW, but another thing: that's where PostgreSQL runs in reality.
 

-J.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "shiy.fnst@fujitsu.com"
Дата:
Сообщение: RE: Replica Identity check of partition table on subscriber
Следующее
От: Amit Langote
Дата:
Сообщение: Re: Replica Identity check of partition table on subscriber