SLRUs in the main buffer pool, redux

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема SLRUs in the main buffer pool, redux
Дата
Msg-id CA+hUKGKAYze99B-jk9NoMp-2BDqAgiRC4oJv+bFxghNgdieq8Q@mail.gmail.com
обсуждение исходный текст
Ответы Re: SLRUs in the main buffer pool, redux  (Robert Haas <robertmhaas@gmail.com>)
Re: SLRUs in the main buffer pool, redux  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
Hi,

I was re-reviewing the proposed batch of GUCs for controlling the SLRU
cache sizes[1], and I couldn't resist sketching out $SUBJECT as an
obvious alternative.  This patch is highly experimental and full of
unresolved bits and pieces (see below for some), but it passes basic
tests and is enough to start trying the idea out and figuring out
where the real problems lie.  The hypothesis here is that CLOG,
multixact, etc data should compete for space with relation data in one
unified buffer pool so you don't have to tune them, and they can
benefit from the better common implementation (mapping, locking,
replacement, bgwriter, checksums, etc and eventually new things like
AIO, TDE, ...).

I know that many people have talked about doing this and maybe they
already have patches along these lines too; I'd love to know what
others imagined differently/better.

In the attached sketch, the SLRU caches are psuedo-relations in
pseudo-database 9.  Yeah.  That's a straw-man idea stolen from the
Zheap/undo project[2] (I also stole DiscardBuffer() from there);
better ideas for identifying these buffers without making BufferTag
bigger are very welcome.  You can list SLRU buffers with:

  WITH slru(relfilenode, path) AS (VALUES (0, 'pg_xact'),
                                          (1, 'pg_multixact/offsets'),
                                          (2, 'pg_multixact/members'),
                                          (3, 'pg_subtrans'),
                                          (4, 'pg_serial'),
                                          (5, 'pg_commit_ts'),
                                          (6, 'pg_notify'))
  SELECT bufferid, path, relblocknumber, isdirty, usagecount, pinning_backends
    FROM pg_buffercache NATURAL JOIN slru
   WHERE reldatabase = 9
   ORDER BY path, relblocknumber;

Here are some per-cache starter hypotheses about locking that might be
completely wrong and obviously need real analysis and testing.

pg_xact:

I couldn't easily get rid of XactSLRULock, because it didn't just
protect buffers, it's also used to negotiate "group CLOG updates".  (I
think it'd be nice to replace that system with an atomic page update
scheme so that concurrent committers stay on CPU, something like [3],
but that's another topic.)  I decided to try a model where readers
only have to pin the page (the reads are sub-byte values that we can
read atomically, and you'll see a value as least as fresh as the time
you took the pin, right?), but writers have to take an exclusive
content lock because otherwise they'd clobber each other at byte
level, and because they need to maintain the page LSN consistently.
Writing back is done with a share lock as usual and log flushing can
be done consistently.  I also wanted to try avoiding the extra cost of
locking and accessing the buffer mapping table in common cases, so I
use ReadRecentBuffer() for repeat access to the same page (this
applies to the other SLRUs too).

pg_subtrans:

I got rid of SubtransSLRULock because it only protected page contents.
Can be read with only a pin.  Exclusive page content lock to write.

pg_multixact:

I got rid of the MultiXact{Offset,Members}SLRULock locks.  Can be read
with only a pin.  Writers take exclusive page content lock.  The
multixact.c module still has its own MultiXactGenLock.

pg_commit_ts:

I got rid of CommitTsSLRULock since it only protected buffers, but
here I had to take shared content locks to read pages, since the
values can't be read atomically.   Exclusive content lock to write.

pg_serial:

I could not easily get rid of SerialSLRULock, because it protects the
SLRU + also some variables in serialControl.  Shared and exclusive
page content locks.

pg_notify:

I got rid of NotifySLRULock.  Shared and exclusive page content locks
are used for reading and writing.  The module still has a separate
lock NotifyQueueLock to coordinate queue positions.

Some problems tackled incompletely:

* I needed to disable checksums and in-page LSNs, since SLRU pages
hold raw data with no header.  We'd probably eventually want regular
(standard? formatted?) pages (the real work here may be implementing
FPI for SLRUs so that checksums don't break your database on torn
writes).  In the meantime, suppressing those things is done by the
kludge of recognising database 9 as raw data, but there should be
something better than this.  A separate array of size NBuffer holds
"external" page LSNs, to drive WAL flushing.

* The CLOG SLRU also tracks groups of async commit LSNs in a fixed
sized array.  The obvious translation would be very wasteful (an array
big enough for NBuffers * groups per page), but I hope that there is a
better way to do this... in the sketch patch I changed it to use the
single per-page LSN for simplicity (basically group size is 32k
instead of 32...), which is certainly not good enough.

Some stupid problems not tackled yet:

* It holds onto the virtual file descriptor for the last segment
accessed, but there is no invalidation for when segment files are
recycled; that could be fixed with a cycle counter or something like
that.

* It needs to pin buffers during the critical section in commit
processing, but that crashes into the ban on allocating memory while
dealing with resowner.c book-keeping.  It's also hard to know how many
buffers you'll need to pin in advance.  For now, I just commented out
the assertions...

* While hacking on the pg_stat_slru view I realised that there is
support for "other" SLRUs, presumably for extensions to define their
own.  Does anyone actually do that?  I, erm, didn't support that in
this sketch (not too hard though, I guess).

* For some reason this is failing on Windows CI, but I haven't looked
into that yet.

Thoughts on the general concept, technical details?  Existing patches
for this that are further ahead/better?

[1] https://commitfest.postgresql.org/36/2627/
[2] https://commitfest.postgresql.org/36/3228/
[3] http://www.vldb.org/pvldb/vol13/p3195-kodandaramaih.pdf

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Logical replication timeout problem
Следующее
От: Pavel Stehule
Дата:
Сообщение: Re: Schema variables - new implementation for Postgres 15