Re: Cache relation sizes?
От | Thomas Munro |
---|---|
Тема | Re: Cache relation sizes? |
Дата | |
Msg-id | CA+hUKG+vCcCpUkc3zxGYderKxRVQYacp66HDej6ENa7e_k0GwQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Cache relation sizes? (Andy Fan <zhihui.fan1213@gmail.com>) |
Ответы |
Re: Cache relation sizes?
(Andy Fan <zhihui.fan1213@gmail.com>)
|
Список | pgsql-hackers |
On Thu, Dec 17, 2020 at 10:22 PM Andy Fan <zhihui.fan1213@gmail.com> wrote: > Let me try to understand your point. Suppose process 1 extends a file to > 2 blocks from 1 block, and fsync is not called, then a). the lseek *may* still > return 1 based on the comments in the ReadBuffer_common ("because > of buggy Linux kernels that sometimes return an seek SEEK_END result > that doesn't account for a recent write"). b). When this patch is introduced, > we would get 2 blocks for sure if we can get the size from cache. c). After > user reads the 2 blocks from cache and then the cache is evicted, and user > reads the blocks again, it is possible to get 1 again. > > Do we need to consider case a) in this patch? and Is case c). the situation you > want to avoid and do we have to introduce fsync to archive this? Basically I > think case a) should not happen by practice so we don't need to handle case > c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag. > I'm not opposed to adding them, I am just interested in the background in case > you are also willing to discuss it. Sorry for the lack of background -- there was some discussion on the thread "Optimize dropping of relation buffers using dlist", but it's long and mixed up with other topics. I'll try to summarise here. It's easy to reproduce files shrinking asynchronously on network filesystems that reserve space lazily[1], and we know that there have historically been problems even with local filesystems[2], leading to that comment about buggy kernels. I am only interested in the behaviour I can demonstrate, because I believe Linux is working as intended when it does that (hypothetical/unknown bugs would probably be helped by this approach too, but I'm not interested in speculating about that beyond these parentheses). So, why should we consider this? It's true that commit ffae5cc5a60's change to ReadBuffer_common() already prevents us from eating data by overwriting an existing buffer due to this phenomenon, but there are other problems: 1. A system that in this state can give an incorrect answer to a query: heapam.c calls RelationGetNumberOfBlocks() at the beginning of a scan, so it might not see recently added blocks containing committed data. Hopefully we'll panic when we try to create a checkpoint and see errors, but we could be giving bad results for a while. 2. Commitfest entry #2319 needs an accurate size, or it might leave stray blocks in the buffer pool that will cause failures or corruption later. It's true that we could decide not to worry about this, and perhaps even acknowledge in some official way that PostgreSQL doesn't work reliably on some kinds of filesystems. But in this prototype I thought it was fun to try to fix our very high rate of lseek() syscalls and this reliability problem, at the same time. I also speculate that we'll eventually have other reasons to want a demand-paged per-relation shared memory object. > I would suggest the below change for smgrnblock_fast, FYI > > @@ -182,7 +182,11 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum) > /* > * The generation doesn't match, the shared relation must have been > * evicted since we got a pointer to it. We'll need to do more work. > + * and invalid the fast path for next time. > */ > + reln->smgr_shared = NULL; > } Thanks, makes sense -- I'll add this to the next version. [1] https://www.postgresql.org/message-id/CA%2BhUKGLZTfKuXir9U4JQkY%3Dk3Tb6M_3toGrGOK_fa2d4MPQQ_w%40mail.gmail.com [2] https://www.postgresql.org/message-id/BAY116-F146A38326C5630EA33B36BD12B0%40phx.gbl
В списке pgsql-hackers по дате отправления: