Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

Поиск
Список
Период
Сортировка
От Cedric Villemain
Тема Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED
Дата
Msg-id 2f91cf01-3ce6-45c6-aa9d-72e4953264df@abcSQL.com
обсуждение исходный текст
Ответы Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED
Список pgsql-hackers
Hi,

I wonder what you think of making pg_prewarm use recent addition on 
smgrprefetch and readv ?


In order to try, I did it anyway in the attached patches. They contain 
no doc update, but I will proceed if it is of interest.

In summary:

1. The first one adds a new check on parameters (checking last block is 
indeed not before first block).
Consequence is an ERROR is raised instead of silently doing nothing.

2. The second one does implement smgrprefetch with range and loops by 
default per segment to still have a check for interrupts.

3. The third one provides smgrreadv instead of smgrread,  by default on 
a range of 8 buffers. I am absolutely unsure that I used readv correctly...

Q: posix_fadvise may not work exactly the way you think it does, or does 
it ?


In details, and for the question:

It's not so obvious that the "feature" is really required or wanted, 
depending on what are the expectations from user point of view.

The kernel decides on what to do with posix_fadvise calls, and how we 
pass parameters does impact the decision.
With the current situation where prefetch is done step by step, block by 
block, they are very probably most of the time all loaded even if those 
from the beginning of the relation can be discarded at the end of the 
prefetch.

However,  if instead you provide a real range, or the magic len=0 to 
posix_fadvise, then blocks are "more" loaded according to effective vm 
pressure (which is not the case on the previous example).
As a result only a small part of the relation might be loaded, and this 
is probably not what end-users expect despite being probably a good 
choice (you can still free cache beforehand to help the kernel).

An example, below I'm using vm_relation_cachestat() which provides linux 
cachestat output, and vm_relation_fadvise() to unload cache, and 
pg_prewarm for the demo:

# clear cache: (nr_cache is the number of file system pages in cache, 
not postgres blocks)

```
postgres=# select block_start, block_count, nr_pages, nr_cache from 
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
           0 |       32768 |    65536 |        0
       32768 |       32768 |    65536 |        0
       65536 |       32768 |    65536 |        0
       98304 |       32768 |    65536 |        0
      131072 |        1672 |     3344 |        0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm
------------
     132744
(1 row)
```

# Checking results:

```
postgres=# select block_start, block_count, nr_pages, nr_cache from 
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
           0 |       32768 |    65536 |      320
       32768 |       32768 |    65536 |        0
       65536 |       32768 |    65536 |        0
       98304 |       32768 |    65536 |        0
      131072 |        1672 |     3344 |      320  <-- segment 1

```

# Load block by block and check:

```
postgres=# select from generate_series(0, 132743) g(n), lateral 
pg_prewarm('foo','prefetch', 'main', n, n);
postgres=# select block_start, block_count, nr_pages, nr_cache from 
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
           0 |       32768 |    65536 |    65536
       32768 |       32768 |    65536 |    65536
       65536 |       32768 |    65536 |    65536
       98304 |       32768 |    65536 |    65536
      131072 |        1672 |     3344 |     3344

```

The duration of the last example is also really significant: full 
relation is 0.3ms and block by block is 1550ms!
You might think it's because of generate_series or whatever, but I have 
the exact same behavior with pgfincore.
I can compare loading and unloading duration for similar "async" work, 
here each call is from block 0 with len of 132744 and a range of 1 block 
(i.e. posix_fadvise on 8kB at a time).
So they have exactly the same number of operations doing DONTNEED or 
WILLNEED, but distinct duration on the first "load":

```

postgres=# select * from 
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 25.202 ms
postgres=# select * from 
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 1523.636 ms (00:01.524) <----- not free !
postgres=# select * from 
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 24.967 ms
```

Thank you for your time reading this longer than expected email.

Comments ?


---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Dian Fay"
Дата:
Сообщение: Re: add function argument names to regex* functions.
Следующее
От: Matthias van de Meent
Дата:
Сообщение: Re: Reducing output size of nodeToString