Обсуждение: Lock problem with autovacuum truncating heap

Поиск
Список
Период
Сортировка

Lock problem with autovacuum truncating heap

От
Jan Wieck
Дата:
We have run across a problem with autovacuum that occurs when it can 
truncate off a large amount of empty blocks. It behaves different in 
version 9.0 than previous versions. Both behaviors are bad.

Consider a relation receives constant inserts/updates that are satisfied 
using freespace at the beginning of the heap. Delete operations now have 
removed large amounts of tuples at the end of the relation. The 
following autovacuum will find a large amount of blocks at the end, that 
can be truncated.

Vacuumlazy now takes out an access exclusive lock and scans the relation 
*backwards* to find out if concurrent access has created new tuples in 
the to be truncated space. Apparently such a backward scan in 8K blocks 
isn't really a good access strategy.

Up to 8.4, it simply holds the lock until it is done, which in our case 
stalled a production system for 12 minutes! This is obviously bad.

In 9.0, the autovacuum process will be aborted about 1 second after 
another transaction starts waiting for a lock. The result is that even a 
simple INSERT will take 1 second. The autovacuum restarts shortly after 
and somehow gets to a point, where it will cause this 1 second hiccup 
ever 2 minutes. This is slightly better but still far from optimal in a 
world, where transaction response times are measured in milliseconds.

My current idea for a fix is to modify lazy_truncate_heap(). It does 
acquire and release the exclusive lock, so it should be possible to do 
this in smaller chunks, releasing and reacquiring the lock so that 
client transactions can get their work done as well. At the same time I 
would change count_nondeletable_pages() so that it uses a forward scan 
direction (if that leads to a speedup).


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Lock problem with autovacuum truncating heap

От
Simon Riggs
Дата:
On Sat, Mar 26, 2011 at 2:30 PM, Jan Wieck <JanWieck@yahoo.com> wrote:

> My current idea for a fix is to modify lazy_truncate_heap(). It does acquire
> and release the exclusive lock, so it should be possible to do this in
> smaller chunks, releasing and reacquiring the lock so that client
> transactions can get their work done as well.

Agreed, presumably with vacuum delay in there as well?

> At the same time I would
> change count_nondeletable_pages() so that it uses a forward scan direction
> (if that leads to a speedup).

Do we need that? Linux readahead works in both directions doesn't it?
Guess it wouldn't hurt too much.

BTW does it read the blocks at that point using a buffer strategy?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Lock problem with autovacuum truncating heap

От
Itagaki Takahiro
Дата:
On Sun, Mar 27, 2011 at 01:12, Simon Riggs <simon@2ndquadrant.com> wrote:
>> At the same time I would
>> change count_nondeletable_pages() so that it uses a forward scan direction
>> (if that leads to a speedup).

+1.

> Do we need that? Linux readahead works in both directions doesn't it?
> Guess it wouldn't hurt too much.

Yes, probably. AFAIK, RHEL 5 cannot readahead in backward scans.
It might be improved in the latest kernel, but it would be safe
not to rely on kernels except simple forward scans.

-- 
Itagaki Takahiro


Re: Lock problem with autovacuum truncating heap

От
Robert Haas
Дата:
On Mar 26, 2011, at 1:44 PM, Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote:
> On Sun, Mar 27, 2011 at 01:12, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> At the same time I would
>>> change count_nondeletable_pages() so that it uses a forward scan direction
>>> (if that leads to a speedup).
>
> +1.

Hmm.  That would speed up truncations that are large relative to the table size, but slow down small truncations.  And
smalltruncations are likely to be more common than big ones. 

Maybe we could do a mix... back up 16MB and scan forward; if all those pages are empty then back up 16MB from the start
pointand scan forward from there.  Or whatever we think the right chunk size is to get some benefit from kernel
readaheadwithout making the "truncate 1 block" case slow. 

...Robert

Re: Lock problem with autovacuum truncating heap

От
Jan Wieck
Дата:
On 3/26/2011 12:12 PM, Simon Riggs wrote:
> On Sat, Mar 26, 2011 at 2:30 PM, Jan Wieck<JanWieck@yahoo.com>  wrote:
>
>>  My current idea for a fix is to modify lazy_truncate_heap(). It does acquire
>>  and release the exclusive lock, so it should be possible to do this in
>>  smaller chunks, releasing and reacquiring the lock so that client
>>  transactions can get their work done as well.
>
> Agreed, presumably with vacuum delay in there as well?

Not sure about that. My theory is that unless somebody needs access to 
that table, just have at it like it is now.

The current implementation seems to assume that the blocks, checked for 
being empty, are still found in memory (vacuum just scanned them). And 
that seems to be correct most of the time, in which case adding vacuum 
delay only gives more time that the blocks get evicted and have to be 
read back in.


>
>>  At the same time I would
>>  change count_nondeletable_pages() so that it uses a forward scan direction
>>  (if that leads to a speedup).
>
> Do we need that? Linux readahead works in both directions doesn't it?
> Guess it wouldn't hurt too much.
>
> BTW does it read the blocks at that point using a buffer strategy?

Is reading a file backwards "in 8K blocks" actually an access pattern, 
that may confuse buffer strategies?

I don't know. I also don't know if what I am suggesting is much better. 
If you think about it, I merely suggested to "try" and do the same 
access pattern with larger chunks. We need to run some tests to find out.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Lock problem with autovacuum truncating heap

От
Jan Wieck
Дата:
On 3/26/2011 3:17 PM, Robert Haas wrote:
> On Mar 26, 2011, at 1:44 PM, Itagaki Takahiro<itagaki.takahiro@gmail.com>  wrote:
>>  On Sun, Mar 27, 2011 at 01:12, Simon Riggs<simon@2ndquadrant.com>  wrote:
>>>>  At the same time I would
>>>>  change count_nondeletable_pages() so that it uses a forward scan direction
>>>>  (if that leads to a speedup).
>>
>>  +1.
>
> Hmm.  That would speed up truncations that are large relative to the table size, but slow down small truncations.
Andsmall truncations are likely to be more common than big ones.
 

For small truncations the blocks to check are most likely found in 
memory (shared or OS buffer) anyway, in which case the access pattern 
should be rather irrelevant.

>
> Maybe we could do a mix... back up 16MB and scan forward; if all those pages are empty then back up 16MB from the
startpoint and scan forward from there.  Or whatever we think the right chunk size is to get some benefit from kernel
readaheadwithout making the "truncate 1 block" case slow.
 

That was what I meant. Go in steps of 16-64MB backwards and scan from 
there to the current end in forward direction to find a nondeletable 
block. In between these steps, release and reacquire the exclusive lock 
so that client transactions can get their work done.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Lock problem with autovacuum truncating heap

От
Simon Riggs
Дата:
On Sat, Mar 26, 2011 at 8:05 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
> On 3/26/2011 12:12 PM, Simon Riggs wrote:
>>
>> On Sat, Mar 26, 2011 at 2:30 PM, Jan Wieck<JanWieck@yahoo.com>  wrote:
>>
>>>  My current idea for a fix is to modify lazy_truncate_heap(). It does
>>> acquire
>>>  and release the exclusive lock, so it should be possible to do this in
>>>  smaller chunks, releasing and reacquiring the lock so that client
>>>  transactions can get their work done as well.
>>
>> Agreed, presumably with vacuum delay in there as well?
>
> Not sure about that. My theory is that unless somebody needs access to that
> table, just have at it like it is now.
>
> The current implementation seems to assume that the blocks, checked for
> being empty, are still found in memory (vacuum just scanned them). And that
> seems to be correct most of the time, in which case adding vacuum delay only
> gives more time that the blocks get evicted and have to be read back in.

I think someone fairly clever already thought of that.
vacuum_cost_page_hit = 1 by default, so the cost of accessing pages
still in memory is 1/10th the cost of disk access. So we will only
perform the delay for each chunk if we had to read it from disk.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Lock problem with autovacuum truncating heap

От
Robert Haas
Дата:
On Mar 26, 2011, at 4:16 PM, Jan Wieck <JanWieck@Yahoo.com> wrote:
> That was what I meant. Go in steps of 16-64MB backwards and scan from there to the current end in forward direction
tofind a nondeletable block. In between these steps, release and reacquire the exclusive lock so that client
transactionscan get their work done. 

Well, VACUUM uses a 16MB ring buffer, so anything that size or smaller should hit shared_buffers most of the time.

I wonder though if this might defeat read-behind on operating systems that do have a working implementation.  With our
currentapproach each read will end at the point the previous read started, which might be an algorithm somebody is
usingto detect a backward scan. 

...Robert

Re: Lock problem with autovacuum truncating heap

От
Simon Riggs
Дата:
On Sun, Mar 27, 2011 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mar 26, 2011, at 4:16 PM, Jan Wieck <JanWieck@Yahoo.com> wrote:
>> That was what I meant. Go in steps of 16-64MB backwards and scan from there to the current end in forward direction
tofind a nondeletable block. In between these steps, release and reacquire the exclusive lock so that client
transactionscan get their work done. 
>
> Well, VACUUM uses a 16MB ring buffer, so anything that size or smaller should hit shared_buffers most of the time.
>
> I wonder though if this might defeat read-behind on operating systems that do have a working implementation.  With
ourcurrent approach each read will end at the point the previous read started, which might be an algorithm somebody is
usingto detect a backward scan. 

Good point. That means the last 16MB of buffers will be in
shared_buffers. Anything more than that will definitely not be,
because we wrote them out ourselves.

So we should truncate in 16MB chunks also.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Lock problem with autovacuum truncating heap

От
Jan Wieck
Дата:
On 3/27/2011 1:24 PM, Simon Riggs wrote:
> So we should truncate in 16MB chunks also.

On a second though, fiddling with the scan direction is probably too 
much of a change for back releases anyway. That 8.3/8.4 can get into a 
situation, where autovacuum causes a 12 minute freeze of a production 
server could be argued as a bug. Likewise that 9.0 will never succeed to 
truncate but cause a 1 second hiccup every two minutes.

Since we are talking about stable releases, I think just releasing and 
reacquiring the exclusive lock is enough. We can then try to further 
improve things for future releases.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Lock problem with autovacuum truncating heap

От
Greg Stark
Дата:
On Sun, Mar 27, 2011 at 8:25 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
>
> Since we are talking about stable releases, I think just releasing and
> reacquiring the exclusive lock is enough. We can then try to further improve
> things for future releases.

I like all of:

1) move the truncating to a new transaction just like we currently do
toast tables in a separate transaction from the main vacuum. I'm not
sure why we do it for toast tables but it makes sense here. If we get
killed by autovacuum detecting a lock conflict we want to commit the
changes to pg_class entry so that autovacuum doesn't invoke us again.

2) Don't bother trying to truncate if we've been called from
autovacuum at all. This doesn't help people who run vacuum from a cron
job but it does help anyone who doesn't know what's going on and is
just randomly having their table exclusive-locked at arbitrary times
in the middle of peak production hours. I doubt the truncation really
helps much in normal operation anyways and if you've deleted all the
rows in your table it's not a bad recommendation to say you should run
vacuum manually and not rely on autovacuum in that instance.

3) Scanning backwards 8MB at a time scanning each 8MB forwards instead
of just going back by block backwards.

-- 
greg


Re: Lock problem with autovacuum truncating heap

От
Robert Haas
Дата:
On Sun, Mar 27, 2011 at 3:25 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
> Since we are talking about stable releases, I think just releasing and
> reacquiring the exclusive lock is enough. We can then try to further improve
> things for future releases.

That seems unsafe - things can change under you while you don't hold the lock...

I kind of like the idea of committing the transaction and then
beginning a new one just to do the truncation.  Given the way the
deadlock detector treats autovacuum, the current coding seems quite
risky.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Lock problem with autovacuum truncating heap

От
Jan Wieck
Дата:
On 3/27/2011 6:21 PM, Robert Haas wrote:
> On Sun, Mar 27, 2011 at 3:25 PM, Jan Wieck<JanWieck@yahoo.com>  wrote:
>>  Since we are talking about stable releases, I think just releasing and
>>  reacquiring the exclusive lock is enough. We can then try to further improve
>>  things for future releases.
>
> That seems unsafe - things can change under you while you don't hold the lock...

The only change relevant in this case would be some concurrent client 
extending the relation while we don't hold the lock. A call to 
RelationGetNumberOfBlocks() after reacquiring the lock will tell. Safety 
reestablished.

> I kind of like the idea of committing the transaction and then
> beginning a new one just to do the truncation.  Given the way the
> deadlock detector treats autovacuum, the current coding seems quite
> risky.

I don't like a 1,000 ms hiccup in my system, regardless of how many 
transaction hoops you make it go through.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Lock problem with autovacuum truncating heap

От
Robert Haas
Дата:
On Sun, Mar 27, 2011 at 9:41 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
> On 3/27/2011 6:21 PM, Robert Haas wrote:
>>
>> On Sun, Mar 27, 2011 at 3:25 PM, Jan Wieck<JanWieck@yahoo.com>  wrote:
>>>
>>>  Since we are talking about stable releases, I think just releasing and
>>>  reacquiring the exclusive lock is enough. We can then try to further
>>> improve
>>>  things for future releases.
>>
>> That seems unsafe - things can change under you while you don't hold the
>> lock...
>
> The only change relevant in this case would be some concurrent client
> extending the relation while we don't hold the lock. A call to
> RelationGetNumberOfBlocks() after reacquiring the lock will tell. Safety
> reestablished.

I thought that the risk was that someone might write tuples into the
blocks that we're thinking of truncating.

>> I kind of like the idea of committing the transaction and then
>> beginning a new one just to do the truncation.  Given the way the
>> deadlock detector treats autovacuum, the current coding seems quite
>> risky.
>
> I don't like a 1,000 ms hiccup in my system, regardless of how many
> transaction hoops you make it go through.

I can't argue with that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Lock problem with autovacuum truncating heap

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Mar 27, 2011 at 3:25 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
>> Since we are talking about stable releases, I think just releasing and
>> reacquiring the exclusive lock is enough. We can then try to further improve
>> things for future releases.

> That seems unsafe - things can change under you while you don't hold the lock...

And more to the point, it wouldn't actually fix anything, unless you
chop things up so finely that autovac is never holding the lock for more
than milliseconds.  (I believe it wouldn't even be enough if you could
guarantee that autovac didn't hold the lock for more than
deadlock_timeout, because some other process could reach the timeout and
run the deadlock detector very shortly after autovac acquires its lock.)

I don't believe that *any* of what's being discussed here is suitable
material for back-patching.  And it's not material for 9.1, either.
The time for rewriting VACUUM for 9.1 was three months ago.
        regards, tom lane


Re: Lock problem with autovacuum truncating heap

От
Tom Lane
Дата:
Greg Stark <gsstark@mit.edu> writes:
> I like all of:

> 1) move the truncating to a new transaction just like we currently do
> toast tables in a separate transaction from the main vacuum.

+1 if we are going to continue the behavior of allowing other
transactions to kick autovac off the exclusive lock.  However, if we can
find a way to avoid the need of that, then breaking it into multiple
transactions would just be useless complication (and extra cycles).

In particular, I thought the direction Jan was headed was to release and
reacquire the lock between truncating off limited-size chunks of the
file.  If we do that, we probably *don't* want or need to allow autovac
to be booted off the lock more quickly.

> 2) Don't bother trying to truncate if we've been called from
> autovacuum at all.

No, I think that's seriously going in the wrong direction.  We are
trying to make autovacuum more useful and transparent, not find new
reasons why people have to use manual vacuuming.

> 3) Scanning backwards 8MB at a time scanning each 8MB forwards instead
> of just going back by block backwards.

Maybe.  I'd want to see some experimental evidence justifying the choice
of chunk size; I'm pretty sure this will become counterproductive once
the chunk size is too large.
        regards, tom lane


Re: Lock problem with autovacuum truncating heap

От
Jan Wieck
Дата:
On 3/27/2011 9:51 PM, Robert Haas wrote:
> On Sun, Mar 27, 2011 at 9:41 PM, Jan Wieck<JanWieck@yahoo.com>  wrote:
>>  On 3/27/2011 6:21 PM, Robert Haas wrote:
>>>
>>>  On Sun, Mar 27, 2011 at 3:25 PM, Jan Wieck<JanWieck@yahoo.com>    wrote:
>>>>
>>>>    Since we are talking about stable releases, I think just releasing and
>>>>    reacquiring the exclusive lock is enough. We can then try to further
>>>>  improve
>>>>    things for future releases.
>>>
>>>  That seems unsafe - things can change under you while you don't hold the
>>>  lock...
>>
>>  The only change relevant in this case would be some concurrent client
>>  extending the relation while we don't hold the lock. A call to
>>  RelationGetNumberOfBlocks() after reacquiring the lock will tell. Safety
>>  reestablished.
>
> I thought that the risk was that someone might write tuples into the
> blocks that we're thinking of truncating.

Currently the risk is that while vacuum is doing its main work, someone 
can either extend the relation or reuse space inside one of the empty 
blocks (that are about to be truncated away). That is why the function 
lazy_truncate_heap() does the following:
    1) acquire exclusive lock    2) check via RelationGetNumberOfBlocks() if it has been extended       before locking
-abort if so    3) check via count_nondeletable_pages() what the highest block       in the to be truncated range is,
thatcontains a (newly created)       tuple    4) truncate the relation    5) release the lock
 

The function count_nondeletable_pages() is the one doing the block wise 
reverse scan. It does check for interrupts and that is the place, where 
the deadlock code will boot vacuum.

What I am proposing is to put all those 5 steps into a loop that tries 
to bite off smaller bits from the end of the table, instead of trying to 
swallow the whole dead space at once.

count_nondeletable_pages() is a static function and only called from 
lazy_truncate_heap(), so fiddling with the scan direction inside of it 
would be totally safe from a functional side effect point of view. Doing 
so or not depends on whether reversing its scan direction does have a 
performance benefit or not. I agree with Tom that at some "chunk" size, 
the effect might be negative. That is because currently it scans 
backwards and returns at the first block containing a tuple. To scan 
forward, it has to scan all the blocks, remembering the last that 
contained a tuple.

>>  I don't like a 1,000 ms hiccup in my system, regardless of how many
>>  transaction hoops you make it go through.
>
> I can't argue with that.

I assumed we have a consensus that both, locking a system for 10+ 
minutes as well as having a 1,000ms hiccup every 2 minutes, are problems 
we need to fix.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Lock problem with autovacuum truncating heap

От
Jan Wieck
Дата:
On 3/27/2011 10:43 PM, Tom Lane wrote:

> In particular, I thought the direction Jan was headed was to release and
> reacquire the lock between truncating off limited-size chunks of the
> file.  If we do that, we probably *don't* want or need to allow autovac
> to be booted off the lock more quickly.

That is correct.

>>  3) Scanning backwards 8MB at a time scanning each 8MB forwards instead
>>  of just going back by block backwards.
>
> Maybe.  I'd want to see some experimental evidence justifying the choice
> of chunk size; I'm pretty sure this will become counterproductive once
> the chunk size is too large.

Me too, which is why that part of my proposal is highly questionable and 
requires a lot of evidence to be even remotely considered for back releases.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Lock problem with autovacuum truncating heap

От
Jan Wieck
Дата:
On 3/28/2011 12:35 PM, Jan Wieck wrote:
> On 3/27/2011 10:43 PM, Tom Lane wrote:
>
>>  In particular, I thought the direction Jan was headed was to release and
>>  reacquire the lock between truncating off limited-size chunks of the
>>  file.  If we do that, we probably *don't* want or need to allow autovac
>>  to be booted off the lock more quickly.
>
> That is correct.
>
>>>   3) Scanning backwards 8MB at a time scanning each 8MB forwards instead
>>>   of just going back by block backwards.
>>
>>  Maybe.  I'd want to see some experimental evidence justifying the choice
>>  of chunk size; I'm pretty sure this will become counterproductive once
>>  the chunk size is too large.
>
> Me too, which is why that part of my proposal is highly questionable and
> requires a lot of evidence to be even remotely considered for back releases.

Attached is a patch against HEAD that implements the part that truncates
the heap in small batches (512 pages at a time) without fiddling with
the scan direction.

It does several retries when attempting to get the exclusive lock. This
is because when doing it this way I discovered that locks queued up
behind the exclusive lock held by autovacuum make it too likely that it
fails after just a few batches.

I am going to see what a similar logic will do to 8.4, where the
exclusive lock has far more severe consequences to client connections.


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Вложения

Re: Lock problem with autovacuum truncating heap

От
Jim Nasby
Дата:
On Mar 27, 2011, at 9:43 PM, Tom Lane wrote:
>> 1) move the truncating to a new transaction just like we currently do
>> toast tables in a separate transaction from the main vacuum.
>
> +1 if we are going to continue the behavior of allowing other
> transactions to kick autovac off the exclusive lock.  However, if we can
> find a way to avoid the need of that, then breaking it into multiple
> transactions would just be useless complication (and extra cycles).
>
> In particular, I thought the direction Jan was headed was to release and
> reacquire the lock between truncating off limited-size chunks of the
> file.  If we do that, we probably *don't* want or need to allow autovac
> to be booted off the lock more quickly.
<snip>
>> 3) Scanning backwards 8MB at a time scanning each 8MB forwards instead
>> of just going back by block backwards.
>
> Maybe.  I'd want to see some experimental evidence justifying the choice
> of chunk size; I'm pretty sure this will become counterproductive once
> the chunk size is too large.

I have an idea that might be better than protecting truncation by exclusive-locking the table: a "virtual end of
relation"pointer we can lock that makes the truncation fast and easy? The idea is that vacuum would grab the lock
protectingthat pointer, see what the last valid page actually is, and then truncate everything else. Incrementing that
pointeris one more thing we'd have to do when extending a relation, but it's just an increment (and AFAIK we already
havea lock for extending a rel). 

The challenge would come in actually moving that pointer backwards. My thought is that vacuum can simply set it back to
thelast known page with data and set a bit indicating that the pointer *might* be ahead of the actual valid end of the
relation.Anything that wants to extend the relation when that bit is set would need to 

- See if the next page actually exists
- See if it actually contains data (vacuum could mark completely empty pages as invalid to speed that up)

Additionally, we'd need to make sure that nothing else could write to a page > virtual end of relation.

Once that's set, vacuum could start at virtual_end_of_relation and scan forward, ensuring no later pages have data.
Afterthat's done it can truncate anything > virtual_end (which might have moved forward in the meantime). 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net