Обсуждение: Open issues for HOT patch

Поиск
Список
Период
Сортировка

Open issues for HOT patch

От
Tom Lane
Дата:
I have finished a first review pass over all of the HOT patch
(updated code is posted on -patches).  I haven't found any showstoppers,
but there seem still several areas that need discussion:

* The patch makes undocumented changes that cause autovacuum's decisions
to be driven by total estimated dead space rather than total number of
dead tuples.  Do we like this?  What should happen to the default
threshold parameters (they are not even in the same units as before...)?
Is there any value in even continuing to track dead tuple counts, per
se, in the pgstats machinery?  It seems redundant/expensive to track
both tuple counts and byte counts, and it's not like the size of the
stats file is not already known to be a performance issue ...

* I'm still pretty unhappy about the patch's use of a relcache copy of
GetAvgFSMRequestSize()'s result.  The fact that there's no provision for
ever updating the value while the relcache entry lives is part of it,
but the bigger part is that I'd rather not have anything at all
depending on that number.  FSM in its current form desperately needs to
die; and once it's replaced by some form of distributed on-disk storage,
it's unlikely that we will have any simple means of getting an
equivalent number.  The average request size was never meant for
external use anyway, but only as a filter to help reject useless entries
from getting into the limited shared-memory FSM space.  Perhaps we could
replace that heuristic with something that is page-local; seems like
dividing the total used space by the number of item pointers would give
at least a rough approximation of the page's average tuple size.

* We also need to think harder about when to invoke the page pruning
code.  As the patch stands, if you set a breakpoint at
heap_page_prune_opt it'll seem to be hit constantly (eg, once for every
system catalog probe), which seems uselessly often.  And yet it also
seems not often enough, because one thing I found out real fast is that
the "prune if free space < 1.2 average tuple size" heuristic fails badly
when you look at queries that execute multiple updates within the same
heap page.  We only prune when we first pin a particular target page,
and so the additional updates don't afford another chance to see if it's
time to prune.

I'd like to see if we can arrange to only do pruning when reading a page
that is known to be an update target (ie, never during plain SELECTs);
I suspect this would be relatively easy with some executor and perhaps
planner changes.  But that only fixes the first half of the gripe above;
I'm not at all sure what to do about the multiple-updates-per-page
issue.

Comments?
        regards, tom lane


Re: Open issues for HOT patch

От
Bruce Momjian
Дата:
Tom Lane wrote:
> * We also need to think harder about when to invoke the page pruning
> code.  As the patch stands, if you set a breakpoint at
> heap_page_prune_opt it'll seem to be hit constantly (eg, once for every
> system catalog probe), which seems uselessly often.  And yet it also
> seems not often enough, because one thing I found out real fast is that
> the "prune if free space < 1.2 average tuple size" heuristic fails badly
> when you look at queries that execute multiple updates within the same
> heap page.  We only prune when we first pin a particular target page,
> and so the additional updates don't afford another chance to see if it's
> time to prune.
> 
> I'd like to see if we can arrange to only do pruning when reading a page
> that is known to be an update target (ie, never during plain SELECTs);
> I suspect this would be relatively easy with some executor and perhaps
> planner changes.  But that only fixes the first half of the gripe above;
> I'm not at all sure what to do about the multiple-updates-per-page
> issue.

If we only prune on an update (or insert) why not just do prune every
time?  I figure the prune/defrag has to be lighter than the
update/insert itself.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Open issues for HOT patch

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> If we only prune on an update (or insert) why not just do prune every
> time?

The problem is you can't prune anymore once you have existing pin on the
target page.  I'd really like to get around that, but so far it seems
unacceptably fragile --- the executor really doesn't expect tuples to
get moved around underneath it.
        regards, tom lane


Re: Open issues for HOT patch

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > If we only prune on an update (or insert) why not just do prune every
> > time?
> 
> The problem is you can't prune anymore once you have existing pin on the
> target page.  I'd really like to get around that, but so far it seems
> unacceptably fragile --- the executor really doesn't expect tuples to
> get moved around underneath it.

I thought you could do the pruning before you pin the page only in
update/insert cases.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Open issues for HOT patch

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> The problem is you can't prune anymore once you have existing pin on the
>> target page.  I'd really like to get around that, but so far it seems
>> unacceptably fragile --- the executor really doesn't expect tuples to
>> get moved around underneath it.

> I thought you could do the pruning before you pin the page only in
> update/insert cases.

But then what happens when you want to update a second tuple on the same
page?  None of our existing plan types release and reacquire pin if they
don't have to, and I really doubt that we want to give up that
optimization.
        regards, tom lane


Re: Open issues for HOT patch

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Tom Lane wrote:
> >> The problem is you can't prune anymore once you have existing pin on the
> >> target page.  I'd really like to get around that, but so far it seems
> >> unacceptably fragile --- the executor really doesn't expect tuples to
> >> get moved around underneath it.
> 
> > I thought you could do the pruning before you pin the page only in
> > update/insert cases.
> 
> But then what happens when you want to update a second tuple on the same
> page?  None of our existing plan types release and reacquire pin if they
> don't have to, and I really doubt that we want to give up that
> optimization.

You will prune when you lock the page and at that point unless you got
enough room for both tuples I doubt trying just before the second tuple
is going to help.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Open issues for HOT patch

От
"Heikki Linnakangas"
Дата:
Bruce Momjian wrote:
> If we only prune on an update (or insert) why not just do prune every
> time?  I figure the prune/defrag has to be lighter than the
> update/insert itself.

Pruning is a quite costly operation. You need to check the visibility of
each tuple on the page, following tuple chains as you go, mark line
pointers as not used or redirected, and finally memmove all the tuples
to remove the gaps between them. And it needs to be WAL-logged.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Open issues for HOT patch

От
Heikki Linnakangas
Дата:
Tom Lane wrote:
> * We also need to think harder about when to invoke the page pruning
> code.  As the patch stands, if you set a breakpoint at
> heap_page_prune_opt it'll seem to be hit constantly (eg, once for every
> system catalog probe), which seems uselessly often.  And yet it also
> seems not often enough, because one thing I found out real fast is that
> the "prune if free space < 1.2 average tuple size" heuristic fails badly
> when you look at queries that execute multiple updates within the same
> heap page.  We only prune when we first pin a particular target page,
> and so the additional updates don't afford another chance to see if it's
> time to prune.
> 
> I'd like to see if we can arrange to only do pruning when reading a page
> that is known to be an update target (ie, never during plain SELECTs);
> I suspect this would be relatively easy with some executor and perhaps
> planner changes.  But that only fixes the first half of the gripe above;
> I'm not at all sure what to do about the multiple-updates-per-page
> issue.

There is one wacky idea I haven't dared to propose yet:

We could lift the limitation that you can't defragment a page that's
pinned, if we play some smoke and mirrors in the buffer manager. When
you prune a page, make a *copy* of the page you're pruning, and keep
both versions in the buffer cache. Old pointers keep pointing to the old
version. Any new calls to ReadBuffer will return the new copy, and the
old copy can be dropped when its pin count drops to zero.

Tracking multiple copies of a page requires some changes to the buffer
manager. LockBuffer would need to return the latest version of the page,
because anything that checks visibility or does updates would need to
use the latest copy, and callers of LockBuffer would need to be changed
accordingly.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Open issues for HOT patch

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> But then what happens when you want to update a second tuple on the same
>> page?  None of our existing plan types release and reacquire pin if they
>> don't have to, and I really doubt that we want to give up that
>> optimization.

> You will prune when you lock the page and at that point unless you got
> enough room for both tuples I doubt trying just before the second tuple
> is going to help.

No, you're missing the point completely.  If the free space on the page
is, say, 1.5x the average tuple size, the code *won't* prune, and then
it will be stuck when it goes to do the second tuple update, because
there is no chance to reconsider the prune/no-prune decision after some
space is eaten by the first update.
        regards, tom lane


Re: Open issues for HOT patch

От
"Pavan Deolasee"
Дата:


On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

* I'm still pretty unhappy about the patch's use of a relcache copy of
GetAvgFSMRequestSize()'s result.  The fact that there's no provision for
ever updating the value while the relcache entry lives is part of it,
but the bigger part is that I'd rather not have anything at all
depending on that number.


We could fix the first part by adding relcache invalidation whenever
the average FSM request size changes by a certain margin. But
I am not insisting on using the avgFSM mechanism to decide when
to prune.
 

Perhaps we could
replace that heuristic with something that is page-local; seems like
dividing the total used space by the number of item pointers would give
at least a rough approximation of the page's average tuple size.


We might get it completely wrong unless we know the number of
normal line pointers (redirected, dead and unused line pointers
do not take any real storage).

Another option would be to prune whenever the free space goes
below table fillfactor and hope that users would set fillfactor so that
atleast one updated tuple can fit in the block. I know its not best to
rely on the users though. But it can be good hint.

Yet another option would be to set a hint on the page whenever we
fail to do HOT update because of not-enough-free-space in the
block. Next time we shall prune and so the subsequent updates
would be HOT update.

None of these are perfect though.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Open issues for HOT patch

От
Tom Lane
Дата:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> There is one wacky idea I haven't dared to propose yet:

> We could lift the limitation that you can't defragment a page that's
> pinned, if we play some smoke and mirrors in the buffer manager. When
> you prune a page, make a *copy* of the page you're pruning, and keep
> both versions in the buffer cache. Old pointers keep pointing to the old
> version. Any new calls to ReadBuffer will return the new copy, and the
> old copy can be dropped when its pin count drops to zero.

No, that's way too wacky.  How do you prevent people from making further
changes to the "old" version?  For instance, marking a tuple deleted?

The actual practical application we have, I think, would only require
being able to defrag a page that our own backend has pins on, which is
something that might be more workable --- but it still seems awfully
fragile.  It could maybe be made to work in the simplest case of a
plain UPDATE, because in practice I think the executor will never
reference the old tuple's contents after heap_update() returns.  But
this falls down in more complex situations involving joins --- we might
continue to try to join the same "old" tuple to other rows, and then any
pass-by-reference Datums we are using are corrupt if the tuple got
moved.
        regards, tom lane


Re: Open issues for HOT patch

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Tom Lane wrote:
> >> But then what happens when you want to update a second tuple on the same
> >> page?  None of our existing plan types release and reacquire pin if they
> >> don't have to, and I really doubt that we want to give up that
> >> optimization.
> 
> > You will prune when you lock the page and at that point unless you got
> > enough room for both tuples I doubt trying just before the second tuple
> > is going to help.
> 
> No, you're missing the point completely.  If the free space on the page
> is, say, 1.5x the average tuple size, the code *won't* prune, and then
> it will be stuck when it goes to do the second tuple update, because
> there is no chance to reconsider the prune/no-prune decision after some
> space is eaten by the first update.

My point is that if you only do this for INSERT/UPDATE, you can prune
when you have less than enough room for 3-4 tuples, and if you add the
xmin of the earliest prune xact you can prune even more aggressively.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Open issues for HOT patch

От
Tom Lane
Дата:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Perhaps we could
>> replace that heuristic with something that is page-local; seems like
>> dividing the total used space by the number of item pointers would give
>> at least a rough approximation of the page's average tuple size.
>> 
> We might get it completely wrong unless we know the number of
> normal line pointers (redirected, dead and unused line pointers
> do not take any real storage).

Sure, but it's only a heuristic anyway.  Probably a more serious
objection is that it fails in the wrong direction: if you start to get
line pointer bloat then the estimated average tuple size goes down,
making it less likely to prune instead of more.  But maybe do something
that looks at free space and number of pointers independently, rather
than operating in terms of average tuple size?

> Another option would be to prune whenever the free space goes
> below table fillfactor and hope that users would set fillfactor so that
> atleast one updated tuple can fit in the block. I know its not best to
> rely on the users though. But it can be good hint.

If default fillfactor weren't 100% then this might be good ;-).  But
we could use max(1-fillfactor, BLCKSZ/8) or some such.

> Yet another option would be to set a hint on the page whenever we
> fail to do HOT update because of not-enough-free-space in the
> block. Next time we shall prune and so the subsequent updates
> would be HOT update.

This would be a good idea independent of anything else, I think.
There's plenty of room for page hint bits, and the write will be
"free" since we must set the old tuple XMAX anyway.
        regards, tom lane


Re: Open issues for HOT patch

От
"Pavan Deolasee"
Дата:


On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Another option would be to prune whenever the free space goes
> below table fillfactor

If default fillfactor weren't 100% then this might be good ;-).  But
we could use max(1-fillfactor, BLCKSZ/8) or some such.

> Yet another option would be to set a hint on the page whenever we
> fail to do HOT update

This would be a good idea independent of anything else, I think.



Or may be a combination of the above two would work well. If the user
has not specified any fillfactor (and so table is using default 100), the
first update is most likely a COLD update and we set the hint bit. The
retired tuple is then pruned before the next update comes and we shall
do HOT update. OTOH if the use has specified a fillfactor less than
to avoid any COLD update, the first update will be HOT. At that point,
the used space would go above fillfactor and hence we shall prune
before the next update comes making room for next HOT update.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Open issues for HOT patch

От
Tom Lane
Дата:
I wrote:
> * The patch makes undocumented changes that cause autovacuum's decisions
> to be driven by total estimated dead space rather than total number of
> dead tuples.  Do we like this?

No one seems to have picked up on this point, but after reflection
I think there's actually a pretty big problem here.  Per-page pruning
is perfectly capable of keeping dead space in check.  In a system with
HOT running well, the reasons to vacuum a table will be:

1. Remove dead index entries.
2. Remove LP_DEAD line pointers.
3. Truncate off no-longer-used end pages.
4. Transfer knowledge about free space into FSM.

Pruning cannot accomplish #1, #2, or #3, and without significant changes
in the FSM infrastructure it has no hope about #4 either.  What I'm
afraid of is that steady page-level pruning will keep the amount of dead
space low, causing autovacuum never to fire, causing the indexes to
bloat indefinitely because of #1 and the table itself to bloat
indefinitely because of #2 and #4.  Thus, the proposed change in
autovacuum seems badly misguided: instead of making autovacuum trigger
on things that only it can fix, it makes autovacuum trigger on something
that per-page pruning can deal with perfectly well.

I'm inclined to think that we should continue to drive autovac off a
count of dead rows, as this is directly related to points #1 and #2,
and doesn't seem any worse for #3 and #4 than an estimate based on space
would be.  Possibly it would be sensible for per-page pruning to report
a reduction in number of dead rows when it removes heap-only tuples,
but I'm not entirely sure --- any thoughts?

If we do this, then it's not clear that having pgstats track dead space
is worth the trouble at all.  It might possibly be of value for testing
purposes to see how well pruning is doing, but I'm unconvinced that it's
worth bloating stats messages and files to have this number in a
production system.  An alternative that would serve as well for testing
would be to teach contrib/pgstattuple to measure dead space.

Comments?
        regards, tom lane


Re: Open issues for HOT patch

От
"Joshua D. Drake"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Tom Lane wrote:
> I wrote:
>> * The patch makes undocumented changes that cause autovacuum's decisions
>> to be driven by total estimated dead space rather than total number of
>> dead tuples.  Do we like this?

> If we do this, then it's not clear that having pgstats track dead space
> is worth the trouble at all.  It might possibly be of value for testing
> purposes to see how well pruning is doing, but I'm unconvinced that it's
> worth bloating stats messages and files to have this number in a
> production system.  An alternative that would serve as well for testing
> would be to teach contrib/pgstattuple to measure dead space.

As a DBA, I can say it doesn't really matter to me *how we track* the
dead space, as long as tracking it is:

1. Clear
2. Simple
3. Available by default (thus pgstattuple needs to push into core)


Sincerely,

Joshua D. Drake



- --
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/        UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG7/1HATb/zqfZUUQRAorXAJ47OZI8n7Bpj4pRyxRO1nGCUC7L0wCgojPZ
74vcXOZ1KqTFKw8v/w4WngI=
=Bpc2
-----END PGP SIGNATURE-----


Re: Open issues for HOT patch

От
"Pavan Deolasee"
Дата:


On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> * The patch makes undocumented changes that cause autovacuum's decisions
> to be driven by total estimated dead space rather than total number of
> dead tuples.  Do we like this?

No one seems to have picked up on this point, but after reflection
I think there's actually a pretty big problem here.  Per-page pruning
is perfectly capable of keeping dead space in check.  In a system with
HOT running well, the reasons to vacuum a table will be:

1. Remove dead index entries.
2. Remove LP_DEAD line pointers.
3. Truncate off no-longer-used end pages.
4. Transfer knowledge about free space into FSM.

Pruning cannot accomplish #1, #2, or #3, and without significant changes
in the FSM infrastructure it has no hope about #4 either.


I guess we already have mechanism to remove dead index entries
outside vacuum. So my take would be handle vacuum based on dead
index entries separately. May be we can track number of dead index
entries and trigger vacuum on the base relation if it goes beyond a
threshold.

For LP_DEAD line pointers, with some adjustments to the patch, we
can make it track dead space in a page by accounting for the
the LP_DEAD pointers. So if there is a bloat because of LP_DEAD
pointers, that will be reflected in the dead space and help us trigger
vacuum on the table.



I'm inclined to think that we should continue to drive autovac off a
count of dead rows

If we do that, I guess it would make sense to count only non-HOT
dead tuples because HOT tuples neither create LP_DEAD line
pointers nor cause index bloats.

Thanks,
Pavan
 

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Open issues for HOT patch

От
Tom Lane
Дата:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In a system with
>> HOT running well, the reasons to vacuum a table will be:
>> 
>> 1. Remove dead index entries.
>> 2. Remove LP_DEAD line pointers.
>> 3. Truncate off no-longer-used end pages.
>> 4. Transfer knowledge about free space into FSM.
>> 
>> Pruning cannot accomplish #1, #2, or #3, and without significant changes
>> in the FSM infrastructure it has no hope about #4 either.

> I guess we already have mechanism to remove dead index entries
> outside vacuum.

Not a trustworthy one --- unless you have a solid proposal for making it
work with bitmap indexscans, it would be foolish to design autovacuum
behavior on the assumption that dead index entries aren't a problem.

(Also, IIRC only btree has been taught to recover dead entries at all.)
        regards, tom lane


Re: Open issues for HOT patch

От
"Pavan Deolasee"
Дата:


On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In a system with
>> HOT running well, the reasons to vacuum a table will be:
>>
>> 1. Remove dead index entries.
>> 2. Remove LP_DEAD line pointers.
>> 3. Truncate off no-longer-used end pages.
>> 4. Transfer knowledge about free space into FSM.
>>
>> Pruning cannot accomplish #1, #2, or #3, and without significant changes
>> in the FSM infrastructure it has no hope about #4 either.

> I guess we already have mechanism to remove dead index entries
> outside vacuum.

Not a trustworthy one --- unless you have a solid proposal for making it
work with bitmap indexscans, it would be foolish to design autovacuum
behavior on the assumption that dead index entries aren't a problem.



Hmm.. I think we need to drop this for now because I am sure any
such proposal would need a lot more discussion. May be something
we can pick up for 8.4

So we go back to tracking dead tuples. I would still be inclined
towards tracking non-HOT dead tuples or subtract the count of
pruned HOT tuples because we don't want to trigger autovacuum
too often, rather let pruning clean as much dead space as possible.
What it means also that the tuple storage reclaimed by pruning
a non-HOT dead tuples does not impact the autovacuum behavior,
positively or negatively. And ISTM that this does not address 4 ?

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Open issues for HOT patch

От
Decibel!
Дата:
On Tue, Sep 18, 2007 at 11:32:52AM -0400, Tom Lane wrote:
> > Another option would be to prune whenever the free space goes
> > below table fillfactor and hope that users would set fillfactor so that
> > atleast one updated tuple can fit in the block. I know its not best to
> > rely on the users though. But it can be good hint.
>
> If default fillfactor weren't 100% then this might be good ;-).  But

Erik Jones and I were just talking about FILLFACTOR...

Is the plan to keep it at 100% with HOT? ISTM that's not such a great
idea, since it forces at least the first update (if not many more) to be
COLD.

I realize that ideally we'd probably want FILLFACTOR to take things like
average tuple size and average number of updates per page into account,
but for a first pass 90% would likely be a good compromise...
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Re: Open issues for HOT patch

От
Decibel!
Дата:
On Tue, Sep 18, 2007 at 09:31:03AM -0700, Joshua D. Drake wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Tom Lane wrote:
> > I wrote:
> >> * The patch makes undocumented changes that cause autovacuum's decisions
> >> to be driven by total estimated dead space rather than total number of
> >> dead tuples.  Do we like this?
>
> > If we do this, then it's not clear that having pgstats track dead space
> > is worth the trouble at all.  It might possibly be of value for testing
> > purposes to see how well pruning is doing, but I'm unconvinced that it's
> > worth bloating stats messages and files to have this number in a
> > production system.  An alternative that would serve as well for testing
> > would be to teach contrib/pgstattuple to measure dead space.
>
> As a DBA, I can say it doesn't really matter to me *how we track* the
> dead space, as long as tracking it is:
>
> 1. Clear
> 2. Simple
> 3. Available by default (thus pgstattuple needs to push into core)
3 isn't that important to me, but 4 is:

4. Doesn't hammer the database to measure

And pgstattuple fails #4 miserably. Want to know the average dead space
in a 500GB database? Yeah, right....
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Re: Open issues for HOT patch

От
"Joshua D. Drake"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Decibel! wrote:
> On Tue, Sep 18, 2007 at 09:31:03AM -0700, Joshua D. Drake wrote:

>>> If we do this, then it's not clear that having pgstats track dead space
>>> is worth the trouble at all.  It might possibly be of value for testing
>>> purposes to see how well pruning is doing, but I'm unconvinced that it's
>>> worth bloating stats messages and files to have this number in a
>>> production system.  An alternative that would serve as well for testing
>>> would be to teach contrib/pgstattuple to measure dead space.
>> As a DBA, I can say it doesn't really matter to me *how we track* the
>> dead space, as long as tracking it is:
>>
>> 1. Clear
>> 2. Simple
>> 3. Available by default (thus pgstattuple needs to push into core)
>  
> 3 isn't that important to me, but 4 is:
> 
> 4. Doesn't hammer the database to measure
> 
> And pgstattuple fails #4 miserably. Want to know the average dead space
> in a 500GB database? Yeah, right....

Point taken, and agreed (although I still think it needs to be in core).

Joshua D. Drake

- --
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/        UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG8JpeATb/zqfZUUQRAprDAJ9PtAUx8ZG5P/HnQSM9KZZ/ii3QzwCdHRZ6
JHwNQMkwpS63huymdN0r4Yc=
=0TuC
-----END PGP SIGNATURE-----


Re: Open issues for HOT patch

От
Simon Riggs
Дата:
On Tue, 2007-09-18 at 12:10 -0400, Tom Lane wrote:
> I wrote:
> > * The patch makes undocumented changes that cause autovacuum's decisions
> > to be driven by total estimated dead space rather than total number of
> > dead tuples.  Do we like this?
> 
> No one seems to have picked up on this point, but after reflection
> I think there's actually a pretty big problem here.  Per-page pruning
> is perfectly capable of keeping dead space in check.  In a system with
> HOT running well, the reasons to vacuum a table will be:
> 
> 1. Remove dead index entries.
> 2. Remove LP_DEAD line pointers.
> 3. Truncate off no-longer-used end pages.
> 4. Transfer knowledge about free space into FSM.
> 
> Pruning cannot accomplish #1, #2, or #3, and without significant changes
> in the FSM infrastructure it has no hope about #4 either.  What I'm
> afraid of is that steady page-level pruning will keep the amount of dead
> space low, causing autovacuum never to fire, causing the indexes to
> bloat indefinitely because of #1 and the table itself to bloat
> indefinitely because of #2 and #4.  Thus, the proposed change in
> autovacuum seems badly misguided: instead of making autovacuum trigger
> on things that only it can fix, it makes autovacuum trigger on something
> that per-page pruning can deal with perfectly well.
> 
> I'm inclined to think that we should continue to drive autovac off a
> count of dead rows, as this is directly related to points #1 and #2,
> and doesn't seem any worse for #3 and #4 than an estimate based on space
> would be.  Possibly it would be sensible for per-page pruning to report
> a reduction in number of dead rows when it removes heap-only tuples,
> but I'm not entirely sure --- any thoughts?

Some behavioural comments only: I was part of the earlier discussion
about when-to-VACUUM and don't have any fixed view of how to do this.

If HOT is running well, then there will be less need for #1, #3 and #4,
as I understand it. Deletes will still cause the need for #1, #3, #4 as
well as dead-space removal. Many tables have only Inserts and Deletes,
so we need to take that into account.

On large tables, VACUUM hurts very badly, so I would like to see it run
significantly less often.

In your last post you mentioned multiple UPDATEs. Pruning multiple times
for successive UPDATEs isn't going to release more space, so why do it?

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Open issues for HOT patch

От
Heikki Linnakangas
Дата:
Decibel! wrote:
> On Tue, Sep 18, 2007 at 11:32:52AM -0400, Tom Lane wrote:
>>> Another option would be to prune whenever the free space goes
>>> below table fillfactor and hope that users would set fillfactor so that
>>> atleast one updated tuple can fit in the block. I know its not best to
>>> rely on the users though. But it can be good hint.
>> If default fillfactor weren't 100% then this might be good ;-).  But
> 
> Erik Jones and I were just talking about FILLFACTOR...
> 
> Is the plan to keep it at 100% with HOT? ISTM that's not such a great
> idea, since it forces at least the first update (if not many more) to be
> COLD.

I think we should still keep it at 100%. Most tables are not updated,
and a non-100% fillfactor will be waste of space when the extra space is
not needed. Even a table that is updated should reach a steady state
after a few cold updates. Those cold updates will make room on the pages
for future updates, now that we can prune them and leave only dead line
pointers behind.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Open issues for HOT patch

От
"Pavan Deolasee"
Дата:


On 9/19/07, Simon Riggs <simon@2ndquadrant.com> wrote:

In your last post you mentioned multiple UPDATEs. Pruning multiple times
for successive UPDATEs isn't going to release more space, so why do it?


I guess the point is when you do the first update, there is enough free space
in the page and hence we won't prune the page. We run out of free space for
subsequent updates and want to prune, but now can't prune because executor
is holding a pin on the buffer.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Open issues for HOT patch

От
Heikki Linnakangas
Дата:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> We could lift the limitation that you can't defragment a page that's
>> pinned, if we play some smoke and mirrors in the buffer manager. When
>> you prune a page, make a *copy* of the page you're pruning, and keep
>> both versions in the buffer cache. Old pointers keep pointing to the old
>> version. Any new calls to ReadBuffer will return the new copy, and the
>> old copy can be dropped when its pin count drops to zero.
> 
> No, that's way too wacky.  How do you prevent people from making further
> changes to the "old" version?  For instance, marking a tuple deleted?

To make any changes to the "old" version, you need to lock the page with
LockBuffer. LockBuffer needs to return a buffer with the latest version
of the page, and the caller has to use that version for any changes.
Changing all callers of LockBuffer (that lock heap pages) to do that is
the biggest change involved, AFAICS.

Hint bit updates to the old version we could just forget about.

> The actual practical application we have, I think, would only require
> being able to defrag a page that our own backend has pins on, which is
> something that might be more workable --- but it still seems awfully
> fragile.  It could maybe be made to work in the simplest case of a
> plain UPDATE, because in practice I think the executor will never
> reference the old tuple's contents after heap_update() returns.  But
> this falls down in more complex situations involving joins --- we might
> continue to try to join the same "old" tuple to other rows, and then any
> pass-by-reference Datums we are using are corrupt if the tuple got
> moved.

Ugh, yeah that's too fragile.

Another wacky idea:

Within our own backend, we could keep track of which tuples we've
accessed, and defrag could move all other tuples as long as the ones
that we might still have pointers to are not touched. The bookkeeping
wouldn't have to be exact, as long as it's conservative.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Open issues for HOT patch

От
Gregory Stark
Дата:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> There is one wacky idea I haven't dared to propose yet:
>
> We could lift the limitation that you can't defragment a page that's
> pinned, if we play some smoke and mirrors in the buffer manager. When
> you prune a page, make a *copy* of the page you're pruning, and keep
> both versions in the buffer cache. Old pointers keep pointing to the old
> version. Any new calls to ReadBuffer will return the new copy, and the
> old copy can be dropped when its pin count drops to zero.

Fwiw when Heikki first mentioned this idea I thought it was the craziest thing
I ever heard. But the more I thought about it the more I liked it. I've come
to the conclusion that while it's a wart, it's not much worse than the wart of
the super-exclusive lock which it replaces. In fact it's arguably cleaner in
some ways.

As a result vacuum would never have to wait for arbitrarily long pins and
there wouldn't be the concept of a vacuum waiting for a vacuum lock with
strange lock queueing semantics. It also means we could move tuples around on
the page more freely.

The only places which would have to deal with a possible new buffer would be
precisely those places that lock the page. If you aren't locking the page then
you definitely aren't about to fiddle with any bits that matter since your
writes could be lost. Certainly you're not about to set xmin or xmax or
anything like that. 

You might set hint bits which would be lost but probably not often since you
would have already checked the visibility of the tuples with the page locked.
There may be one or two places where we fiddle bits for a tuple we've just
inserted ourselves thinking nobody else can see it yet, but the current
philosophy seems to be leaning towards treating such coding practices as
unacceptably fragile anyways.

The buffer manager doesn't really need to track multiple "versions" of pages.
It would just mark the old version as an orphaned buffer which is
automatically a victim for the clock sweep as soon as the pin count drops to
0. It will never need to return such a buffer. What we would need is enough
information to reread the buffer if someone tries to lock it and to unpin it
when someone unpins a newer version.

At first I thought the cost of copying the page would be a downside but in
fact Heikki pointed out that in defragmentation we're already copying the
page. In fact copying it to new memory instead of memory which is almost
certainly likely in processor caches which would need to be invalidated would
actually be faster and avoiding the use of memmove could be faster too.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Open issues for HOT patch

От
Tom Lane
Дата:
Decibel! <decibel@decibel.org> writes:
> 3 isn't that important to me, but 4 is:
> 4. Doesn't hammer the database to measure

> And pgstattuple fails #4 miserably. Want to know the average dead space
> in a 500GB database? Yeah, right....

So we could put a vacuum_cost_delay() in it ...
        regards, tom lane


Re: Open issues for HOT patch

От
Decibel!
Дата:
On Sep 19, 2007, at 8:08 AM, Tom Lane wrote:
> Decibel! <decibel@decibel.org> writes:
>> 3 isn't that important to me, but 4 is:
>> 4. Doesn't hammer the database to measure
>
>> And pgstattuple fails #4 miserably. Want to know the average dead  
>> space
>> in a 500GB database? Yeah, right....
>
> So we could put a vacuum_cost_delay() in it ...

pg_stat_delay()? ;)

That's better than what we have now, without a doubt. But I'd still  
prefer to have a table I can just read on the fly.

We do have a TODO to replace the current stats infrastructure with  
something that has less overhead, right? :)
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828




Re: Open issues for HOT patch

От
Alvaro Herrera
Дата:
Gregory Stark wrote:
> 
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> 
> > There is one wacky idea I haven't dared to propose yet:
> >
> > We could lift the limitation that you can't defragment a page that's
> > pinned, if we play some smoke and mirrors in the buffer manager. When
> > you prune a page, make a *copy* of the page you're pruning, and keep
> > both versions in the buffer cache. Old pointers keep pointing to the old
> > version. Any new calls to ReadBuffer will return the new copy, and the
> > old copy can be dropped when its pin count drops to zero.
> 
> Fwiw when Heikki first mentioned this idea I thought it was the craziest thing
> I ever heard. But the more I thought about it the more I liked it. I've come
> to the conclusion that while it's a wart, it's not much worse than the wart of
> the super-exclusive lock which it replaces. In fact it's arguably cleaner in
> some ways.

I read this as very similar to RCU.  In some scenarios it makes a lot of
sense.  I don't think it's a thing to be attacked in 8.3 though, since
it is a big change to the bufmgr.

-- 
Alvaro Herrera                        http://www.advogato.org/person/alvherre
"The Postgresql hackers have what I call a "NASA space shot" mentality.Quite refreshing in a world of "weekend drag
racer"developers."
 
(Scott Marlowe)