Re: automatic analyze: readahead - add "IO read time" log message

Поиск
Список
Период
Сортировка
От	Stephen Frost
Тема	Re: automatic analyze: readahead - add "IO read time" log message
Дата	26 октября 2020 г. 18:44:15
Msg-id	20201026154414.GY16415@tamriel.snowman.net обсуждение исходный текст
Ответ на	automatic analyze: readahead - add "IO read time" log message (Jakub Wartak <Jakub.Wartak@tomtom.com>)
Список	pgsql-hackers
Дерево обсуждения
Greetings,

* Jakub Wartak (Jakub.Wartak@tomtom.com) wrote:
> I have I hope interesting observation (and nano patch proposal) on system where statistics freshness is a critical
factor.Autovacuum/autogathering statistics was tuned to be pretty very aggressive: 
> autovacuum_vacuum_cost_delay=0 (makes autovacuum_vacuum_cost_limit irrelevant)
> autovacuum_naptime=1s
> autovacuum_max_workers=4
>
> Some critical table partitions are configured with:  autovacuum_analyze_scale_factor=0.001,
autovacuum_analyze_threshold=50000to force auto-analyze jobs pretty often. The interesting logs are like this: 
> automatic analyze of table "t1" system usage: CPU: user: 37.52 s, system: 23.01 s, elapsed: 252.14 s
> automatic analyze of table "t2" system usage: CPU: user: 38.70 s, system: 22.63 s, elapsed: 317.33 s
> automatic analyze of table "t2" system usage: CPU: user: 39.38 s, system: 21.43 s, elapsed: 213.58 s
> automatic analyze of table "t1" system usage: CPU: user: 37.91 s, system: 24.49 s, elapsed: 229.45 s

That's certainly pretty aggressive. :)

> and this is root-cause of my question.  As you can see there is huge 3x-4x time discrepancy between "elapsed" and
user+systemwhich is strange at least for me as there should be no waiting (autovacuum_vacuum_cost_delay=0). According
tovarious tools it is true: Time was wasted somewhere else, but not in the PostgreSQL analyze. The ps(1) and pidstat(1)
alsoreport the same for the worker: 

The user/system time there is time-on-CPU (hence the 'CPU: ' prefix).

> 06:56:12 AM       PID    %usr %system  %guest    %CPU   CPU  Command
> 06:56:13 AM    114774    8.00   10.00    0.00   18.00    18  postgres
> 06:56:14 AM    114774    8.00   11.00    0.00   19.00    15  postgres
> 06:56:15 AM    114774    5.00   13.00    0.00   18.00    18  postgres
>
> 06:56:17 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
> 06:56:18 AM    114774  63746.53      0.00      0.00  postgres
> 06:56:19 AM    114774  62896.00      0.00      0.00  postgres
> 06:56:20 AM    114774  62920.00      0.00      0.00  postgres
>
> One could argue that such autoanalyze worker could perform 5x more of work (%CPU -> 100%) here. The I/O system is not
performinga lot (total = 242MB/s reads@22k IOPS, 7MB/s writes@7k IOPS, with service time 0.04ms), although reporting
highutilization I'm pretty sure it could push much more. There can be up to 3x-4x of such 70-80MB/s analyzes in
parallel(let's say 300MB/s for statistics collection alone). 

The analyze is doing more-or-less random i/o since it's skipping through
the table picking out select blocks, not doing regular sequential i/o.

> According to various gdb backtraces, a lot of time is spent here:
> #0  0x00007f98cdfc9f73 in __pread_nocancel () from /lib64/libpthread.so.0
> #1  0x0000000000741a16 in pread (__offset=811253760, __nbytes=8192, __buf=0x7f9413ab7280, __fd=<optimized out>) at
/usr/include/bits/unistd.h:84
> #2  FileRead (file=<optimized out>, buffer=0x7f9413ab7280 "\037\005", amount=8192, offset=811253760,
wait_event_info=167772173)at fd.c:1883 
> #3  0x0000000000764b8f in mdread (reln=<optimized out>, forknum=<optimized out>, blocknum=19890902,
buffer=0x7f9413ab7280"\037\005") at md.c:596 
> #4  0x000000000073d69b in ReadBuffer_common (smgr=<optimized out>, relpersistence=112 'p', forkNum=MAIN_FORKNUM,
blockNum=19890902,mode=RBM_NORMAL, strategy=0x1102278, hit=0x7fffba7e2d4f) 
>     at bufmgr.c:897
> #5  0x000000000073e27e in ReadBufferExtended (reln=0x7f98c0c9ded0, forkNum=MAIN_FORKNUM, blockNum=19890902,
mode=<optimizedout>, strategy=<optimized out>) at bufmgr.c:665 
> #6  0x00000000004c2e2f in heapam_scan_analyze_next_block (scan=<optimized out>, blockno=<optimized out>,
bstrategy=<optimizedout>) at heapam_handler.c:998 
> #7  0x0000000000597de1 in table_scan_analyze_next_block (bstrategy=<optimized out>, blockno=<optimized out>,
scan=0x10c8098)at ../../../src/include/access/tableam.h:1462 
> #8  acquire_sample_rows (onerel=0x7f98c0c9ded0, elevel=13, rows=0x1342e08, targrows=1500000,
totalrows=0x7fffba7e3160,totaldeadrows=0x7fffba7e3158) at analyze.c:1048 
> #9  0x0000000000596a50 in do_analyze_rel (onerel=0x7f98c0c9ded0, params=0x10744e4, va_cols=0x0, acquirefunc=0x597ca0
<acquire_sample_rows>,relpages=26763525, inh=false, 
>     in_outer_xact=false, elevel=13) at analyze.c:502
> [..]
> #12 0x00000000006e76b4 in autovacuum_do_vac_analyze (bstrategy=0x1102278, tab=<optimized out>) at autovacuum.c:3118
> [..]

Sure, we're blocked on a read call trying to get the next block.

> The interesting thing that targrows=1.5mlns and that blocks are accessed (as expected) in sorted order:
>
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at
heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at
heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at
heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890935, bstrategy=0x1102278) at
heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890996, bstrategy=0x1102278) at
heapam_handler.c:984

Not really sure what's interesting here, but it does look like we're
skipping through the table as expected.

> And probably this explains the discrepancy, perf with CPU usage reporting shows a lot of frames waiting on I/O on
readaheadsdone by ext4, trimmed for clarity: 
>
> # Children      Self       sys       usr  Command   Shared Object       Symbol
>     63.64%     0.00%     0.00%     0.00%  postgres  [kernel.kallsyms]   [k] entry_SYSCALL_64_after_hwframe
>             ---entry_SYSCALL_64_after_hwframe
>                do_syscall_64
>                |--59.66%--sys_pread64
>                |          vfs_read
>                |           --59.09%--__vfs_read
>                |                      --58.24%--generic_file_read_iter
>                |                                |--47.44%--ondemand_readahead
>                |                                |           --46.88%--__do_page_cache_readahead
>                |                                |                     |--32.67%--ext4_mpage_readpages
>                |                                |                     |          |--16.76%--submit_bio
>                |                                |                     |--10.23%--blk_finish_plug
> [..]
>     63.64%     1.99%     1.99%     0.00%  postgres  [kernel.kallsyms]   [k] do_syscall_64
>             |--61.65%--do_syscall_64
>             |          |--59.66%--sys_pread64
>             |          |          vfs_read
>             |          |           --59.09%--__vfs_read
>             |          |                      --58.24%--generic_file_read_iter
>             |          |                                |--47.44%--ondemand_readahead
>             |          |                                |           --46.88%--__do_page_cache_readahead
>
>     61.36%     0.00%     0.00%     0.00%  postgres  postgres            [.] FileRead
>             ---FileRead
>                __pread_nocancel
>                 --60.51%--entry_SYSCALL_64_after_hwframe
>                           do_syscall_64
>                            --59.66%--sys_pread64
>                                      vfs_read
>                                       --59.09%--__vfs_read
>                                                  --58.24%--generic_file_read_iter
>                                                            |--47.44%--ondemand_readahead
>                                                            |           --46.88%--__do_page_cache_readahead
>
>     61.36%     0.85%     0.00%     0.85%  postgres  libpthread-2.17.so  [.] __pread_nocancel
>             |--60.51%--__pread_nocancel
>             |          entry_SYSCALL_64_after_hwframe
>             |          do_syscall_64
>             |           --59.66%--sys_pread64
>             |                     vfs_read
>             |                      --59.09%--__vfs_read
>             |                                 --58.24%--generic_file_read_iter
>             |                                           |--47.44%--ondemand_readahead
>             |                                           |           --46.88%--__do_page_cache_readahead
>
>
>     59.66%     0.00%     0.00%     0.00%  postgres  [kernel.kallsyms]   [k] sys_pread64
>             ---sys_pread64
>                vfs_read
>                 --59.09%--__vfs_read
>                            --58.24%--generic_file_read_iter
>                                      |--47.44%--ondemand_readahead
>                                      |           --46.88%--__do_page_cache_readahead
>                                      |                     |--32.67%--ext4_mpage_readpages
>

With all those 'readahead' calls it certainly makes one wonder if the
Linux kernel is reading more than just the block we're looking for
because it thinks we're doing a sequential read and will therefore want
the next few blocks when, in reality, we're going to skip past them,
meaning that any readahead the kernel is doing is likely just wasted
I/O.

> [..]
> Perf --no-children also triple confirms that there isn't any function that is burning a lot inside the worker:
>
> # Overhead       sys       usr  Command   Shared Object       Symbol
>      5.40%     0.00%     5.40%  postgres  [vdso]              [.] __vdso_clock_gettime
>      5.11%     0.00%     5.11%  postgres  postgres            [.] acquire_sample_rows
>             ---acquire_sample_rows
>      3.98%     0.00%     3.98%  postgres  postgres            [.] heapam_scan_analyze_next_tuple
>             ---heapam_scan_analyze_next_tuple
>      3.69%     3.69%     0.00%  postgres  [kernel.kallsyms]   [k] pvclock_clocksource_read

Sure, makes sense.

> My questions are:
> a) does anybody know if it is expected that getrusage() doesn't include readahead times  as current thread system
time? (I don't know by may be performed by other kernel threads?) ru_stime is defined as "This is the total amount of
timespent executing in kernel mode". Maybe the "executing" is the keyword here? (waiting != executing?) 

getrusage()'s user/system CPU times are reporting time-on-CPU, not
counting time blocking for i/o.  Waiting isn't the same as executing,
no.

> b) initially I've wanted to add a new pg_rusage_show_verbose() that would also add ru_inblock, but that wouldn't add
muchvalue to the end user. Also adding another timing directly around table_scan_analyze_next_block() seems like the
badidea as it involves locking underneah. So I've tried the most easy approach to simply log $pgStatBlockReadTime as
strictlyI/O time spent in pread() (ReadBuffer_common() already measures time). The attached patch for PgSQL14-devel in
heavyI/O conditions (with track_io_timings=on) logs the following:  
> "LOG:  automatic analyze of table "test.public.t1_default" system usage: IO read time 0.69 s, CPU: user: 0.18 s,
system:0.13 s, elapsed: 0.92 s" 

That definitely seems like a useful thing to include and thanks for the
patch!  Please be sure to register it in the commitfest app:
https://commitfest.postgresql.org

> my interpretation would be that IO reading time was most limiting factor (69/92 = 75%), but *CPU* on kernel side was
just13s. It could give the enduser/DBA the information needed, the information where's the bottleneck given the
autovacuum_vacuum_cost_delay=0.In autovacuum_vacuum_cost_delay>0 maybe it would make sense to include also time spent
onsleeping? 

Yeah, that would certainly be useful.

> c) I'm curious if anybody has any I/O related insights into analyze.c processing especially related to readaheads?
E.g.maybe disabling readahead would help for PostgreSQL analyze.c usecase on NVMe? Is it worth given that only x% of
blocksare needed? The only option I'm aware would be to e.g. hash-partition the table (to introduce parallelism by
autovacuumsand enable even workers). Any hints or comments? 

I would think that, ideally, we'd teach analyze.c to work in the same
way that bitmap heap scans do- that is, use posix_fadvise to let the
kernel know what pages we're going to want next instead of the kernel
guessing (incorrectly) or not doing any pre-fetching.  I didn't spend a
lot of time poking, but it doesn't look like analyze.c tries to do any
prefetching today.  In a similar vein, I wonder if VACUUM should be
doing prefetching too today, at least when it's skipping through the
heap based on the visibility map and jumping over all-frozen pages.

> All of the above observations from PostgreSQL 12.4 on Linux kernel 4.14 with ext4/striped dm with 3x-4x NVMEs.
>
> -Jakub Wartak.

> diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
> index 8af12b5c6b..fea1bd6f44 100644
> --- a/src/backend/commands/analyze.c
> +++ b/src/backend/commands/analyze.c
> @@ -312,6 +312,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
>      Oid            save_userid;
>      int            save_sec_context;
>      int            save_nestlevel;
> +    PgStat_Counter startblockreadtime = 0;
>
>      if (inh)
>          ereport(elevel,
> @@ -347,6 +348,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
>      if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
>      {
>          pg_rusage_init(&ru0);
> +        startblockreadtime = pgStatBlockReadTime;
>          if (params->log_min_duration > 0)
>              starttime = GetCurrentTimestamp();
>      }
> @@ -686,10 +688,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
>              TimestampDifferenceExceeds(starttime, GetCurrentTimestamp(),
>                                         params->log_min_duration))
>              ereport(LOG,
> -                    (errmsg("automatic analyze of table \"%s.%s.%s\" system usage: %s",
> +                    (errmsg("automatic analyze of table \"%s.%s.%s\" system usage: IO read time %.2f s, %s",
>                              get_database_name(MyDatabaseId),
>                              get_namespace_name(RelationGetNamespace(onerel)),
>                              RelationGetRelationName(onerel),
> +                            (double) (pgStatBlockReadTime - startblockreadtime)/1000000,
>                              pg_rusage_show(&ru0))));
>      }
>

Haven't looked too closely at this but in general +1 on the idea and
this approach looks pretty reasonable to me.  Only thing I can think of
off-hand is to check how it compares to other places where we report IO
read time and make sure that it looks similar.

Thanks,

Stephen
Вложения

signature.asc
В списке pgsql-hackers по дате отправления:
Предыдущее

От: Andres Freund
Дата: 26 октября 2020 г., 18:27:46
Сообщение: Re: new heapcheck contrib module
Следующее

От: Andrey Borodin
Дата: 26 октября 2020 г., 18:45:29
Сообщение: Re: MultiXact\SLRU buffers configuration
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: automatic analyze: readahead - add "IO read time" log message

Вложения

Предыдущее

Следующее