Re: Reworking the writing of WAL

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Reworking the writing of WAL
Дата
Msg-id CA+TgmoYR6sXfyS6gJCE-+BLpcvVDBZaO_=dObL+B+XdQBDsk1w@mail.gmail.com
обсуждение исходный текст
Ответ на Reworking the writing of WAL  (Simon Riggs <simon@2ndQuadrant.com>)
Ответы Re: Reworking the writing of WAL  (Simon Riggs <simon@2ndQuadrant.com>)
Список pgsql-hackers
On Fri, Aug 12, 2011 at 11:34 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> 1. Earlier, I suggested that the sync rep code would allow us to
> redesign the way we write WAL, using ideas from group commit. My
> proposal is that when when a backend needs to flush WAL to local disk
> it will be added to a SHMQUEUE exactly the same as when we flush WAL
> to sync standby. The WALWriter will be woken by latch and then perform
> the actual work. When complete WALWriter will wake the queue in order,
> so there is a natural group commit effect. The WAL queue will be
> protected by a new lock WALFlushRequestLock, which should be much less
> heavily contended than the way we do things now. Notably this approach
> will mean that all waiters get woken quickly, without having to wait
> for the queue of WALWriteLock requests to drain down, so commit will
> be marginally quicker. On almost idle systems this will give very
> nearly the same response time as having each backend write WAL
> directly. On busy systems this will give optimal efficiency by having
> WALWriter working in a very tight loop to perform the I/O instead of
> queuing itself to get the WALWriteLock with all the other backends. It
> will also allow piggybacking of commits even when WALInsertLock is not
> available.

I like the idea of putting all the backends that are waiting for xlog
flush on a SHM_QUEUE, and having a single process do the flush and
then wake them all up.  That seems like a promising approach, and
should avoid quite a bit of context-switching and spinlocking that
would otherwise be necessary.  However, I think it's possible that the
overhead in the single-client case might be pretty significant, and
I'm wondering whether we might be able to set things up so that
backends can flush their own WAL in the uncontended case.

What I'm imagining is something like this:

struct {   slock_t mutex;   XLogRecPtr CurrentFlushLSN;   XLogRecPtr HighestFlushLSN;   SHM_QUEUE
WaitersForCurrentFlush;  SHM_QUEUE WaitersForNextFlush;
 
};

To flush, you first acquire the mutex.  If the CurrentFlushLSN is not
InvalidXLogRecPtr, then there's a flush in progress, and you add
yourself to either WaitersForCurrentFlush or WaitersForNextFlush,
depending on whether your LSN is lower or higher than CurrentFlushLSN.If you queue on WaitersForNextFlush you advance
HighestFlushLSNto
 
the LSN you need flushed.  You then release the spinlock and sleep on
your semaphore.

But if you get the mutex and find that CurrentFlushLSN is XLogRecPtr,
then you know that no flush is in progress.  In that case, you set
CurrentFlushLSN to the maximum of the LSN you need flushed and
HighestFlushLSN and move all WaitersForNextFlush over to
WaitersForCurrentFlush.  You then release the spinlock and perform the
flush.  After doing so, you reacquire the spinlock, remove everyone
from WaitersForCurrentFlush, note whether there are any
WaitersForNextFlush, and release the spinlock.  If there were any
WaitersForNextFlush, you set the WAL writer latch.  You then wake up
anyone you removed from WaitersForCurrentFlush.

Every time the WAL writer latch is set, the WAL writer wakes up and
performs any needed flush, unless there's already one in progress.

This allows processes to flush their own WAL when there's no
contention, but as more contention develops the work moves to the WAL
writer which will then run in a tight loop, as in your proposal.

> 5. And we would finally get rid of the group commit parameters.

That would be great, and I think the performance will be quite a bit
better, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "David E. Wheeler"
Дата:
Сообщение: Re: sha1, sha2 functions into core?
Следующее
От: Tom Lane
Дата:
Сообщение: VACUUM FULL versus system catalog cache invalidation