Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

Поиск
Список
Период
Сортировка
От Tobias Oberstein
Тема Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..
Дата
Msg-id a55b21d1-7c99-2c66-d661-ef5288f29e30@gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..  (Andres Freund <andres@anarazel.de>)
Ответы Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
Hi,

>>  pid |                syscall                |   cnt   | cnt_per_sec
>> -----+---------------------------------------+---------+-------------
>>      | syscalls:sys_enter_lseek              | 4091584 |      136386
>>      | syscalls:sys_enter_newfstat           | 2054988 |       68500
>>      | syscalls:sys_enter_read               |  767990 |       25600
>>      | syscalls:sys_enter_close              |  503803 |       16793
>>      | syscalls:sys_enter_newstat            |  434080 |       14469
>>      | syscalls:sys_enter_open               |  380382 |       12679
>>
>> Note: there isn't a lot of load currently (this is from production).
>
> That doesn't really mean that much - sure it shows that lseek is
> frequent, but it doesn't tell you how much impact this has to the

Above is on a mostly idle system ("idle" for our loads) .. when things 
get hot, lseek calls can reach into the millions/sec.

Doing 5 million syscalls per sec comes with overhead no matter how 
lightweight the syscall is, doesn't it?

Using pread instead of lseek+read halfes the syscalls.

I really don't understand what you are fighting here ..

> overall workload.  For that'd you'd need a generic (i.e. not syscall
> tracepoint, but cpu cycle) perf profile, and look in the call graph (via
> perf report --children) how much of that is below the lseek syscall.

I see. I might find time to extend our helper function f_perf_syscalls.

>>>>> I'm much less against this change than Tom, but doing artificial syscall
>>>>> microbenchmark seems unlikely to make a big case for using it in
>>>>
>>>> This isn't a syscall benchmark, but FIO.
>>>
>>> There's not really a difference between those, when you use fio to
>>> benchmark seek vs pseek.
>>
>> Sorry, I don't understand what you are talking about.
>
> Fio as you appear to have used is a microbenchmark benchmarking
> individual syscalls.

I am benchmarking IOPS, and while doing so, it becomes apparent that at 
these scales it does matter _how_ IO is done.

The most efficient way is libaio. I get 9.7 million/sec IOPS with low 
CPU load. Using any synchronous IO engine is slower and produces higher 
load.

I do understand that switching to libaio isn't going to fly for PG 
(completely different approach). But doing pread instead of lseek+read 
seems simple enough. But then, I don't know about the PG codebase ..

Among the synchronous methods of doing IO, psync is much better than sync.

pvsync, pvsync2 and pvsync2 + hipri (busy polling, no interrupts) are 
better, but the gain is smaller, and all of them are inferior to libaio.

>>> Glad to hear it.
>>
>> With 3TB RAM, huge pages is absolutely essential (otherwise, the system bogs
>> down in TLB etc overhead).
>
> I was one of the people working on adding hugepage support to pg, that's
> why I was glad ;)

Ahh;) Sorry, wasn't aware. This is really invaluable. Thanks for that!

Cheers,
/Tobias




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Corey Huinker
Дата:
Сообщение: Re: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands:\quit_if, \quit_unless)
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..