Re: Cache relation sizes?

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: Cache relation sizes?
Дата
Msg-id CA+hUKG+vCcCpUkc3zxGYderKxRVQYacp66HDej6ENa7e_k0GwQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Cache relation sizes?  (Andy Fan <zhihui.fan1213@gmail.com>)
Ответы Re: Cache relation sizes?  (Andy Fan <zhihui.fan1213@gmail.com>)
Список pgsql-hackers
On Thu, Dec 17, 2020 at 10:22 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Let me try to understand your point.  Suppose process 1 extends a file to
> 2 blocks from 1 block, and fsync is not called, then a). the lseek *may* still
> return 1 based on the comments in the ReadBuffer_common ("because
> of buggy Linux kernels that sometimes return an seek SEEK_END result
> that doesn't account for a recent write").  b). When this patch is introduced,
> we would get 2 blocks for sure if we can get the size from cache. c). After
> user reads the 2 blocks from cache and then the cache is evicted, and user
> reads the blocks again, it is possible to get 1 again.
>
> Do we need to consider case a) in this patch? and Is case c). the situation you
> want to avoid and do we have to introduce fsync to archive this? Basically I
> think case a) should not happen by practice so we don't need to handle case
> c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag.
> I'm not opposed to adding them, I am just interested in the background in case
> you are also willing to discuss it.

Sorry for the lack of background -- there was some discussion on the
thread "Optimize dropping of relation buffers using dlist", but it's
long and mixed up with other topics.  I'll try to summarise here.

It's easy to reproduce files shrinking asynchronously on network
filesystems that reserve space lazily[1], and we know that there have
historically been problems even with local filesystems[2], leading to
that comment about buggy kernels.  I am only interested in the
behaviour I can demonstrate, because I believe Linux is working as
intended when it does that (hypothetical/unknown bugs would probably
be helped by this approach too, but I'm not interested in speculating
about that beyond these parentheses).

So, why should we consider this?  It's true that commit ffae5cc5a60's
change to ReadBuffer_common() already prevents us from eating data by
overwriting an existing buffer due to this phenomenon, but there are
other problems:

1.  A system that in this state can give an incorrect answer to a
query: heapam.c calls RelationGetNumberOfBlocks() at the beginning of
a scan, so it might not see recently added blocks containing committed
data.  Hopefully we'll panic when we try to create a checkpoint and
see errors, but we could be giving bad results for a while.

2.  Commitfest entry #2319 needs an accurate size, or it might leave
stray blocks in the buffer pool that will cause failures or corruption
later.

It's true that we could decide not to worry about this, and perhaps
even acknowledge in some official way that PostgreSQL doesn't work
reliably on some kinds of filesystems.  But in this prototype I
thought it was fun to try to fix our very high rate of lseek()
syscalls and this reliability problem, at the same time.  I also
speculate that we'll eventually have other reasons to want a
demand-paged per-relation shared memory object.

> I would suggest the below change for smgrnblock_fast,  FYI
>
> @@ -182,7 +182,11 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
>                 /*
>                  * The generation doesn't match, the shared relation must have been
>                  * evicted since we got a pointer to it.  We'll need to do more work.
> +                * and invalid the fast path for next time.
>                  */
> +               reln->smgr_shared = NULL;
>         }

Thanks, makes sense -- I'll add this to the next version.

[1] https://www.postgresql.org/message-id/CA%2BhUKGLZTfKuXir9U4JQkY%3Dk3Tb6M_3toGrGOK_fa2d4MPQQ_w%40mail.gmail.com
[2] https://www.postgresql.org/message-id/BAY116-F146A38326C5630EA33B36BD12B0%40phx.gbl



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andreas Karlsson
Дата:
Сообщение: Re: Discarding DISCARD ALL
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: Re: Cache relation sizes?