RE: pgcon unconference / impact of block size on performance

Поиск

Список

Период

Сортировка

От	Jakub Wartak
Тема	RE: pgcon unconference / impact of block size on performance
Дата	9 июня 2022 г. 14:23:36
Msg-id	AM8PR07MB82482E971F018439CF0C4927F6A79@AM8PR07MB8248.eurprd07.prod.outlook.com обсуждение исходный текст
Ответ на	Re: pgcon unconference / impact of block size on performance (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы	Re: pgcon unconference / impact of block size on performance (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список	pgsql-hackers

Дерево обсуждения

> > >>>> The really
> >>>> puzzling thing is why is the filesystem so much slower for smaller
> >>>> pages. I mean, why would writing 1K be 1/3 of writing 4K?
> >>>> Why would a filesystem have such effect?
> >>>
> >>> Ha! I don't care at this point as 1 or 2kB seems too small to handle
> >>> many real world scenarios ;)
> > [..]
> >> Independently of that, it seems like an interesting behavior and it
> >> might tell us something about how to optimize for larger pages.
> >
> > OK, curiosity won:
> >
> > With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100
> > (close to fio's 128 queue depth?) and I'm getting ~70k IOPS [with
> > maxdepth=128] With randwrite on ext4 directio using 1kb the avgqu-sz is just
> 0.7 and I'm getting just ~17-22k IOPS [with maxdepth=128] ->  conclusion:
> something is being locked thus preventing queue to build up With randwrite on
> ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is queued) and
> I'm also getting ~70k IOPS with minimal possible maxdepth=4 ->  conclusion: I
> just need to split the lock contention by 4.
> >
> > The 1kB (slow) profile top function is aio_write() -> .... -> iov_iter_get_pages()
> -> internal_get_user_pages_fast() and there's sadly plenty of "lock" keywords
> inside {related to memory manager, padding to full page size, inode locking}
> also one can find some articles / commits related to it [1] which didn't made a
> good feeling to be honest as the fio is using just 1 file (even while I'm on kernel
> 5.10.x). So I've switched to 4x files and numjobs=4 and got easily 60k IOPS,
> contention solved whatever it was :) So I would assume PostgreSQL (with it's
> splitting data files by default on 1GB boundaries and multiprocess architecture)
> should be relatively safe from such ext4 inode(?)/mm(?) contentions even with
> smallest 1kb block sizes on Direct I/O some day.
> >
> 
> Interesting. So what parameter values would you suggest?

At least have 4x filename= entries and numjobs=4

> FWIW some of the tests I did were on xfs, so I wonder if that might be hitting
> similar/other bottlenecks.

Apparently XFS also shows same contention on single file for 1..2kb randwrite, see [ZZZ]. 

[root@x ~]# mount|grep /mnt/nvme
/dev/nvme0n1 on /mnt/nvme type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)

# using 1 fio job and 1 file
[root@x ~]# grep -r -e 'read :' -e 'write:' libaio
libaio/nvme/randread/128/1k/1.txt:  read : io=5779.1MB, bw=196573KB/s, iops=196573, runt= 30109msec
libaio/nvme/randread/128/2k/1.txt:  read : io=10335MB, bw=352758KB/s, iops=176379, runt= 30001msec
libaio/nvme/randread/128/4k/1.txt:  read : io=22220MB, bw=758408KB/s, iops=189601, runt= 30001msec
libaio/nvme/randread/128/8k/1.txt:  read : io=28914MB, bw=986896KB/s, iops=123361, runt= 30001msec
libaio/nvme/randwrite/128/1k/1.txt:  write: io=694856KB, bw=23161KB/s, iops=23161, runt= 30001msec [ZZZ]
libaio/nvme/randwrite/128/2k/1.txt:  write: io=1370.7MB, bw=46782KB/s, iops=23390, runt= 30001msec [ZZZ]
libaio/nvme/randwrite/128/4k/1.txt:  write: io=8261.3MB, bw=281272KB/s, iops=70318, runt= 30076msec [OK]
libaio/nvme/randwrite/128/8k/1.txt:  write: io=11598MB, bw=394320KB/s, iops=49289, runt= 30118msec

# but it's all ok using 4 fio jobs and 4 files
[root@x ~]# grep -r -e 'read :' -e 'write:' libaio
libaio/nvme/randread/128/1k/1.txt:  read : io=6174.6MB, bw=210750KB/s, iops=210750, runt= 30001msec
libaio/nvme/randread/128/2k/1.txt:  read : io=12152MB, bw=413275KB/s, iops=206637, runt= 30110msec
libaio/nvme/randread/128/4k/1.txt:  read : io=24382MB, bw=832116KB/s, iops=208028, runt= 30005msec
libaio/nvme/randread/128/8k/1.txt:  read : io=29281MB, bw=985831KB/s, iops=123228, runt= 30415msec
libaio/nvme/randwrite/128/1k/1.txt:  write: io=1692.2MB, bw=57748KB/s, iops=57748, runt= 30003msec
libaio/nvme/randwrite/128/2k/1.txt:  write: io=3601.9MB, bw=122940KB/s, iops=61469, runt= 30001msec
libaio/nvme/randwrite/128/4k/1.txt:  write: io=8470.8MB, bw=285857KB/s, iops=71464, runt= 30344msec
libaio/nvme/randwrite/128/8k/1.txt:  write: io=11449MB, bw=390603KB/s, iops=48825, runt= 30014msec
 

> >>> Both scenarios (raw and fs) have had direct=1 set. I just cannot
> >>> understand
> >> how having direct I/O enabled (which disables caching) achieves
> >> better read IOPS on ext4 than on raw device... isn't it contradiction?
> >>>
> >>
> >> Thanks for the clarification. Not sure what might be causing this.
> >> Did you use the same parameters (e.g. iodepth) in both cases?
> >
> > Explanation: it's the CPU scheduler migrations mixing the performance result
> during the runs of fio  (as you have in your framework). Various VCPUs seem to
> be having varying max IOPS characteristics (sic!) and CPU scheduler seems to be
> unaware of it. At least on 1kB and 4kB blocksize this happens also notice that
> some VCPUs [XXXX marker] don't reach 100% CPU reaching almost twice the
> result; while cores 0, 3 do reach 100% and lack CPU power to perform more.
> The only thing that I don't get is that it doesn't make sense from extened lscpu
> output (but maybe it's AWS XEN mixing real CPU mappings, who knows).
> 
> Uh, that's strange. I haven't seen anything like that, but I'm running on physical
> HW and not AWS, so it's either that or maybe I just didn't do the same test.

I couldn't belived it until I've checked via taskset 😊 BTW: I don't have real HW with NVMe , but we might be with worth
checkingif placing (taskset -c ...) fio on hyperthreading VCPU is not causing (there's
/sys/devices/system/cpu/cpu0/topology/thread_siblingsand maybe lscpu(1) output). On AWS I have feeling that lscpu might
simplylie and I cannot identify which VCPU is HT and which isn't.
 

-J.

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Amit Kapila
Дата: 09 июня 2022 г., 14:02:20
Сообщение: Re: Replica Identity check of partition table on subscriber

Следующее

От: "houzj.fnst@fujitsu.com"
Дата: 09 июня 2022 г., 14:43:51
Сообщение: RE: Support logical replication of DDLs

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

RE: pgcon unconference / impact of block size on performance

Предыдущее

Следующее