Обсуждение: WALInsertLock tuning

Поиск
Список
Период
Сортировка

WALInsertLock tuning

От
Simon Riggs
Дата:
In earlier discussions of how to improve WALInsertLock contention, it
was observed that we must zero each new page before we advance the WAL
insertion point.
http://postgresql.1045698.n5.nabble.com/Reworking-WAL-locking-td1983647.html

IMHO the page zeroing is completely unnecessary, and replication works
perfectly well without that (as a test of recovery logic). It is
unnecessary because we already allow non-zeroed parts of WAL files,
and provide a mechanism to detect stale data.

The following trivial patch removes the page zeroing, which reduces
the lock duration.

Comments?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: WALInsertLock tuning

От
Tom Lane
Дата:
Simon Riggs <simon@2ndQuadrant.com> writes:
> In earlier discussions of how to improve WALInsertLock contention, it
> was observed that we must zero each new page before we advance the WAL
> insertion point.
> http://postgresql.1045698.n5.nabble.com/Reworking-WAL-locking-td1983647.html

> IMHO the page zeroing is completely unnecessary,

I don't believe it's "completely unnecessary".  It does in fact offer
additional protection against mistakenly taking stale data as valid.
You could maybe argue that the degree of safety increase isn't
sufficient to warrant the cost of zeroing the page, but you've not
offered any quantification of either the risk or the cost savings.
        regards, tom lane


Re: WALInsertLock tuning

От
Simon Riggs
Дата:
On Mon, Jun 6, 2011 at 5:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> In earlier discussions of how to improve WALInsertLock contention, it
>> was observed that we must zero each new page before we advance the WAL
>> insertion point.
>> http://postgresql.1045698.n5.nabble.com/Reworking-WAL-locking-td1983647.html
>
>> IMHO the page zeroing is completely unnecessary,
>
> I don't believe it's "completely unnecessary".  It does in fact offer
> additional protection against mistakenly taking stale data as valid.
> You could maybe argue that the degree of safety increase isn't
> sufficient to warrant the cost of zeroing the page, but you've not
> offered any quantification of either the risk or the cost savings.

If we did ever reference stale data, it would need to have a value of
xl_prev that exactly matched what we expected AND would also need a
CRC that exactly matched also. That would be fairly difficult to
arrange deliberately and pretty close to zero chance of happening
otherwise.

But that even assumes we write the unzeroed data at the end of the
buffer. We don't. We only write data up to the end of the WAL record
on the current page, unless we do a continuation record, which means
only replay we would read in another page and check page headers. So
for this to cause an error, we'd have to have an overall matching CRC,
a matching xl_prev and at least one matching page header, which
contains a pageaddress.

But that even assumes we would read data in a different way to which
it was written.

So the only way we could ever reference the stale data is if the WAL
reading code didn't match the WAL writing code (1) because of a bug or
(2) because of a corrupt pg_control record that caused random access
to an otherwise unreachable part of WAL
AND
the CRC matched, and the xl_prev matched and the next page header matched.

Risk == zero. If it wasn't zero I would even mention it because this
is not a trade-off optimization.

Cost savings are those already identified by yourself in our previous
discussion on this, link given upthread. It's the biggest single
removable item from within WALInsertLock. We can measure them if you
like, but I don't see the value in that. It will clearly be a useful
saving on what we all agree is a heavily contended lock.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: WALInsertLock tuning

От
Jeff Janes
Дата:
On Mon, Jun 6, 2011 at 11:27 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> But that even assumes we write the unzeroed data at the end of the
> buffer. We don't. We only write data up to the end of the WAL record
> on the current page, unless we do a continuation record,

I see no codepath in XLogWrite which writes anything other than full
buffer pages.

Cheers,

Jeff


Re: WALInsertLock tuning

От
Simon Riggs
Дата:
On Mon, Jun 6, 2011 at 10:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Jun 6, 2011 at 11:27 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> But that even assumes we write the unzeroed data at the end of the
>> buffer. We don't. We only write data up to the end of the WAL record
>> on the current page, unless we do a continuation record,
>
> I see no codepath in XLogWrite which writes anything other than full
> buffer pages.

Second line, boolean variable called "ispartialpage".

As I mentioned, even if spare bytes at the end of page were written,
they aren't ever read except in corner case bugs that would be trapped
by other logic put there to protect us.

We're safe. If I didn't believe it and hadn't tested it, I wouldn't speak.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: WALInsertLock tuning

От
Robert Haas
Дата:
On Mon, Jun 6, 2011 at 6:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Mon, Jun 6, 2011 at 10:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Mon, Jun 6, 2011 at 11:27 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>
>>> But that even assumes we write the unzeroed data at the end of the
>>> buffer. We don't. We only write data up to the end of the WAL record
>>> on the current page, unless we do a continuation record,
>>
>> I see no codepath in XLogWrite which writes anything other than full
>> buffer pages.
>
> Second line, boolean variable called "ispartialpage".

That affects our tracking of the write and flush positions, but not
what we actually write to the file:
                       /* OK to write the page(s) */                       from = XLogCtl->pages + startidx * (Size)
XLOG_BLCKSZ;                      nbytes = npages * (Size) XLOG_BLCKSZ;                       errno = 0;
      if (write(openLogFile, from, nbytes) != nbytes)
 

As you can see, nbytes is always a multiple of XLOG_BLCKSZ.

> As I mentioned, even if spare bytes at the end of page were written,
> they aren't ever read except in corner case bugs that would be trapped
> by other logic put there to protect us.
>
> We're safe. If I didn't believe it and hadn't tested it, I wouldn't speak.

I think you need to do some analysis of how much this actually
improves performance.  I don't like the idea of applying performance
patches without performance testing; sometimes the results are
counterintuitive.  And even when they're not, it's nice to know how
much you've gained.

As to the question of whether it's safe, I think I'd agree that the
chances of this backfiring are pretty remote.  I think that with the
zeroing they are exactly zero, because (now that we start XLOG
positions at 0/1) there is now way that xl_prev = {0, 0} can ever be
valid.  Without the zeroing, well, it's remotely possible that xl_prev
could happen to appear valid and that xl_crc could happen to match...
but the chances are presumably quite remote.  Just the chances of the
CRC matching should be around 2^-32.  The chances of an xl_prev match
are harder to predict, because the matching values for CRCs should be
pretty much uniformly distributed, while xl_prev isn't random.  But
even given that the chance of a match is should be very small, so in
practice there is likely no harm.  It strikes me, though, that we
could probably get nearly all of the benefit of this patch by being
willing to zero the first sizeof(XLogRecord) bytes following a record,
but not the rest of the buffer.  That would pretty much wipe out any
chance of an xl_prev match, I think, and would likely still get nearly
all of the performance benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WALInsertLock tuning

От
Simon Riggs
Дата:
On Mon, Jun 6, 2011 at 11:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> As to the question of whether it's safe, I think I'd agree that the
> chances of this backfiring are pretty remote.  I think that with the
> zeroing they are exactly zero, because (now that we start XLOG
> positions at 0/1) there is now way that xl_prev = {0, 0} can ever be
> valid.  Without the zeroing, well, it's remotely possible that xl_prev
> could happen to appear valid and that xl_crc could happen to match...
> but the chances are presumably quite remote.  Just the chances of the
> CRC matching should be around 2^-32.  The chances of an xl_prev match
> are harder to predict, because the matching values for CRCs should be
> pretty much uniformly distributed, while xl_prev isn't random.  But
> even given that the chance of a match is should be very small, so in
> practice there is likely no harm.

And if such a thing did actually happen we would also need to have an
accidentally correct value of all of the rest of the header values.
And even if we did we would apply at most one junk WAL record. Then we
are onto the next WAL record where we would need have to the same luck
all over again.

The probability of these occurrences is well below the acceptable
threshold for other problems occurring.

> It strikes me, though, that we
> could probably get nearly all of the benefit of this patch by being
> willing to zero the first sizeof(XLogRecord) bytes following a record,
> but not the rest of the buffer.  That would pretty much wipe out any
> chance of an xl_prev match, I think, and would likely still get nearly
> all of the performance benefit.

Which adds something onto the path of every XlogInsert(), rather than
once per page, so I'm a little hesitant to agree.

If we did that, we would only need to write out an additional 12 bytes
per WAL record, not the full sizeof(XLogRecord).

But even so, I think its wasted effort.

Measuring the benefit of a performance patch is normal, but I'm not
proposing this as a risk trade-off. It's just a straight removal of
multiple cycles from a hot code path. The exact benefit will depend
upon whether the WALInsertLock is the hot lock, which it likely will
be when other patches are applied.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: WALInsertLock tuning

От
Heikki Linnakangas
Дата:
On 07.06.2011 10:21, Simon Riggs wrote:
> On Mon, Jun 6, 2011 at 11:25 PM, Robert Haas<robertmhaas@gmail.com>  wrote:
>> It strikes me, though, that we
>> could probably get nearly all of the benefit of this patch by being
>> willing to zero the first sizeof(XLogRecord) bytes following a record,
>> but not the rest of the buffer.  That would pretty much wipe out any
>> chance of an xl_prev match, I think, and would likely still get nearly
>> all of the performance benefit.
>
> Which adds something onto the path of every XlogInsert(), rather than
> once per page, so I'm a little hesitant to agree.

You would only need to do it just before you write out the WAL. I guess 
you'd need to grab WALInsertLock in XLogWrite() to prevent more WAL 
records from being inserted on the page until you're done zeroing it, 
though.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WALInsertLock tuning

От
Simon Riggs
Дата:
On Tue, Jun 7, 2011 at 8:27 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 07.06.2011 10:21, Simon Riggs wrote:
>>
>> On Mon, Jun 6, 2011 at 11:25 PM, Robert Haas<robertmhaas@gmail.com>
>>  wrote:
>>>
>>> It strikes me, though, that we
>>> could probably get nearly all of the benefit of this patch by being
>>> willing to zero the first sizeof(XLogRecord) bytes following a record,
>>> but not the rest of the buffer.  That would pretty much wipe out any
>>> chance of an xl_prev match, I think, and would likely still get nearly
>>> all of the performance benefit.
>>
>> Which adds something onto the path of every XlogInsert(), rather than
>> once per page, so I'm a little hesitant to agree.
>
> You would only need to do it just before you write out the WAL. I guess
> you'd need to grab WALInsertLock in XLogWrite() to prevent more WAL records
> from being inserted on the page until you're done zeroing it, though.

How would that help?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: WALInsertLock tuning

От
Robert Haas
Дата:
On Tue, Jun 7, 2011 at 3:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> It strikes me, though, that we
>> could probably get nearly all of the benefit of this patch by being
>> willing to zero the first sizeof(XLogRecord) bytes following a record,
>> but not the rest of the buffer.  That would pretty much wipe out any
>> chance of an xl_prev match, I think, and would likely still get nearly
>> all of the performance benefit.
>
> Which adds something onto the path of every XlogInsert(), rather than
> once per page, so I'm a little hesitant to agree.

Urk.  Well, we don't want that, for sure.   The previous discussion
was talking about moving the zeroing around somehow, rather than
getting rid of it, so maybe there's some way to make it work...

One other thought is that I think that this patch might cause a
user-visible behavior change.  Right now, when you hit the end of
recovery, you most typically get a message saying - record with zero
length.  Not always, but often.  If we adopt this approach, you'll get
a wider variety of error messages there, depending on exactly how the
new record fails validation.  I dunno if that's important to be worth
caring about, or not.

> If we did that, we would only need to write out an additional 12 bytes
> per WAL record, not the full sizeof(XLogRecord).
>
> But even so, I think its wasted effort.
>
> Measuring the benefit of a performance patch is normal, but I'm not
> proposing this as a risk trade-off. It's just a straight removal of
> multiple cycles from a hot code path. The exact benefit will depend
> upon whether the WALInsertLock is the hot lock, which it likely will
> be when other patches are applied.

I don't think it's too hard to construct a test case where it is, even
now.  pgbench on a medium-sized machine ought to do it, with
synchronous_commit turned off.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WALInsertLock tuning

От
Simon Riggs
Дата:
On Tue, Jun 7, 2011 at 1:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> One other thought is that I think that this patch might cause a
> user-visible behavior change.  Right now, when you hit the end of
> recovery, you most typically get a message saying - record with zero
> length.  Not always, but often.  If we adopt this approach, you'll get
> a wider variety of error messages there, depending on exactly how the
> new record fails validation.  I dunno if that's important to be worth
> caring about, or not.

Not.

We've never said what the message would be, only that it would fail.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: WALInsertLock tuning

От
Heikki Linnakangas
Дата:
On 07.06.2011 10:55, Simon Riggs wrote:
> On Tue, Jun 7, 2011 at 8:27 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> You would only need to do it just before you write out the WAL. I guess
>> you'd need to grab WALInsertLock in XLogWrite() to prevent more WAL records
>> from being inserted on the page until you're done zeroing it, though.
>
> How would that help?

It doesn't matter whether the pages are zeroed while they sit in memory. 
And if you write a full page of WAL data, any wasted bytes at the end of 
the page don't matter, because they're ignored at replay anyway. The 
possibility of mistaking random garbage for valid WAL only occurs when 
we write a partial WAL page to disk. So, it is enough to zero the 
remainder of the partial WAL page (or just the next few words) when we 
write it out.

That's a lot cheaper than fully zeroing every page. (except for the fact 
that you'd need to hold WALInsertLock while you do it)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WALInsertLock tuning

От
Tom Lane
Дата:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 07.06.2011 10:55, Simon Riggs wrote:
>> How would that help?

> It doesn't matter whether the pages are zeroed while they sit in memory. 
> And if you write a full page of WAL data, any wasted bytes at the end of 
> the page don't matter, because they're ignored at replay anyway. The 
> possibility of mistaking random garbage for valid WAL only occurs when 
> we write a partial WAL page to disk. So, it is enough to zero the 
> remainder of the partial WAL page (or just the next few words) when we 
> write it out.

> That's a lot cheaper than fully zeroing every page. (except for the fact 
> that you'd need to hold WALInsertLock while you do it)

I think avoiding the need to hold both locks at once is probably exactly
why the zeroing was done where it is.

An interesting alternative is to have XLogInsert itself just plop down a
few more zeroes immediately after the record it's inserted, before it
releases WALInsertLock.  This will be redundant work once the next
record gets added, but it's cheap enough to not matter IMO.  As was
mentioned upthread, zeroing out the bytes that will eventually hold the
next record's xl_prev field ought to be enough to maintain a guarantee
that we won't believe the next record is valid.
        regards, tom lane


Re: WALInsertLock tuning

От
Simon Riggs
Дата:
On Tue, Jun 7, 2011 at 4:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> On 07.06.2011 10:55, Simon Riggs wrote:
>>> How would that help?
>
>> It doesn't matter whether the pages are zeroed while they sit in memory.
>> And if you write a full page of WAL data, any wasted bytes at the end of
>> the page don't matter, because they're ignored at replay anyway. The
>> possibility of mistaking random garbage for valid WAL only occurs when
>> we write a partial WAL page to disk. So, it is enough to zero the
>> remainder of the partial WAL page (or just the next few words) when we
>> write it out.
>
>> That's a lot cheaper than fully zeroing every page. (except for the fact
>> that you'd need to hold WALInsertLock while you do it)
>
> I think avoiding the need to hold both locks at once is probably exactly
> why the zeroing was done where it is.
>
> An interesting alternative is to have XLogInsert itself just plop down a
> few more zeroes immediately after the record it's inserted, before it
> releases WALInsertLock.  This will be redundant work once the next
> record gets added, but it's cheap enough to not matter IMO.  As was
> mentioned upthread, zeroing out the bytes that will eventually hold the
> next record's xl_prev field ought to be enough to maintain a guarantee
> that we won't believe the next record is valid.

Lets see what the overheads are with a continuous stream of short WAL
records, say xl_heap_delete records.

xl header is 32 bytes, xl_heap_delete is 24 bytes.

So there would be ~145 records per page. 12 byte zeroing overhead per
record gives 1740 total zero bytes written per page.

The overhead is at worst case less than 25% of current overhead, plus
its spread out across multiple records.

When we get lots of full pages into WAL just after checkpoint we don't
get as much overhead - nearly every full page forces a page switch. So
we're removing overhead from where it hurts the most and amortising
across other records.

Maths work for me.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: WALInsertLock tuning

От
Fujii Masao
Дата:
On Tue, Jun 7, 2011 at 9:54 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Tue, Jun 7, 2011 at 1:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> One other thought is that I think that this patch might cause a
>> user-visible behavior change.  Right now, when you hit the end of
>> recovery, you most typically get a message saying - record with zero
>> length.  Not always, but often.  If we adopt this approach, you'll get
>> a wider variety of error messages there, depending on exactly how the
>> new record fails validation.  I dunno if that's important to be worth
>> caring about, or not.
>
> Not.
>
> We've never said what the message would be, only that it would fail.

BTW, walreceiver doesn't zero the page before writing the WAL. So,
if zeroing the page is *really* required for safe recovery, we might need
to change walreceiver.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: WALInsertLock tuning

От
Bruce Momjian
Дата:
I assume this was addressed with this commit:
commit 465883b0a2b4236ba6b31b648a9eabef3b7cdddbAuthor: Simon Riggs <simon@2ndQuadrant.com>Date:   Tue Jun 28 22:58:17
2011+0100    Introduce compact WAL record for the common case of commit (non-DDL).    XLOG_XACT_COMMIT_COMPACT leaves
outinvalidation messages and relfilenodes,    saving considerable space for the vast majority of transaction commits.
XLOG_XACT_COMMIT keeps same definition as XLOG_PAGE_MAGIC 0xD067 and earlier.    Leonardo Francalanci and Simon Riggs

 

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Mon, Jun 6, 2011 at 11:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> > As to the question of whether it's safe, I think I'd agree that the
> > chances of this backfiring are pretty remote. ?I think that with the
> > zeroing they are exactly zero, because (now that we start XLOG
> > positions at 0/1) there is now way that xl_prev = {0, 0} can ever be
> > valid. ?Without the zeroing, well, it's remotely possible that xl_prev
> > could happen to appear valid and that xl_crc could happen to match...
> > but the chances are presumably quite remote. ?Just the chances of the
> > CRC matching should be around 2^-32. ?The chances of an xl_prev match
> > are harder to predict, because the matching values for CRCs should be
> > pretty much uniformly distributed, while xl_prev isn't random. ?But
> > even given that the chance of a match is should be very small, so in
> > practice there is likely no harm.
> 
> And if such a thing did actually happen we would also need to have an
> accidentally correct value of all of the rest of the header values.
> And even if we did we would apply at most one junk WAL record. Then we
> are onto the next WAL record where we would need have to the same luck
> all over again.
> 
> The probability of these occurrences is well below the acceptable
> threshold for other problems occurring.
> 
> > It strikes me, though, that we
> > could probably get nearly all of the benefit of this patch by being
> > willing to zero the first sizeof(XLogRecord) bytes following a record,
> > but not the rest of the buffer. ?That would pretty much wipe out any
> > chance of an xl_prev match, I think, and would likely still get nearly
> > all of the performance benefit.
> 
> Which adds something onto the path of every XlogInsert(), rather than
> once per page, so I'm a little hesitant to agree.
> 
> If we did that, we would only need to write out an additional 12 bytes
> per WAL record, not the full sizeof(XLogRecord).
> 
> But even so, I think its wasted effort.
> 
> Measuring the benefit of a performance patch is normal, but I'm not
> proposing this as a risk trade-off. It's just a straight removal of
> multiple cycles from a hot code path. The exact benefit will depend
> upon whether the WALInsertLock is the hot lock, which it likely will
> be when other patches are applied.
> 
> -- 
> ?Simon Riggs?????????????????? http://www.2ndQuadrant.com/
> ?PostgreSQL Development, 24x7 Support, Training & Services
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: WALInsertLock tuning

От
Robert Haas
Дата:
On Thu, Oct 13, 2011 at 9:35 PM, Bruce Momjian <bruce@momjian.us> wrote:
>
> I assume this was addressed with this commit:
>
>        commit 465883b0a2b4236ba6b31b648a9eabef3b7cdddb
>        Author: Simon Riggs <simon@2ndQuadrant.com>
>        Date:   Tue Jun 28 22:58:17 2011 +0100
>
>            Introduce compact WAL record for the common case of commit (non-DDL).
>            XLOG_XACT_COMMIT_COMPACT leaves out invalidation messages and relfilenodes,
>            saving considerable space for the vast majority of transaction commits.
>            XLOG_XACT_COMMIT keeps same definition as XLOG_PAGE_MAGIC 0xD067 and earlier.
>
>            Leonardo Francalanci and Simon Riggs

No, that's completely unrelated.  I think it just never quite made it
to prime time - it was analyzed theoretically but some of the testing
discussed on the thread doesn't seem to have been done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company