Обсуждение: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

Поиск
Список
Период
Сортировка

For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Phoenix Kiula"
Дата:
The manual is vague. Several threads about this, in language that is
ambiguous to me.

So a YES/NO question:

Q: To get rid of index bloat, is a VACUUM ANALYZE enough? Or must I
reindex/cluster indexes?

Thanks.

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Joshua D. Drake"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Phoenix Kiula wrote:
> The manual is vague. Several threads about this, in language that is
> ambiguous to me.
>
> So a YES/NO question:
>
> Q: To get rid of index bloat, is a VACUUM ANALYZE enough? Or must I
> reindex/cluster indexes?

If you overrun your max_fsm_pages, no:
  else yes;

Sincerely,

Joshua D. Drake

>
> Thanks.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>


- --

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/
            UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG717QATb/zqfZUUQRAh6uAJ9CGXbA2BxXvMbSZP9Gv8gI9QBkXgCePhqe
6aS3fp60g7YrWECspTVcxyE=
=u2o/
-----END PGP SIGNATURE-----

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Ow Mun Heng
Дата:
On Mon, 2007-09-17 at 22:14 -0700, Joshua D. Drake wrote:
> Phoenix Kiula wrote:
> > So a YES/NO question:
> >
> > Q: To get rid of index bloat, is a VACUUM ANALYZE enough? Or must I
> > reindex/cluster indexes?
>
> If you overrun your max_fsm_pages, no:
>   else yes;

Maybe my english suck, but I don't understand the above answer.

If I overrun my Max_FSM_pages then a vacuum analyse is enough to return
it back to normal.

If I _didn't_ overrun my fsm, then a reindex/cluster is necessary.

Did I get that right? (I feel it's wrong and a reindex/cluster is needed
only when I overrun my max_fsm)




Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Joshua D. Drake"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ow Mun Heng wrote:
> On Mon, 2007-09-17 at 22:14 -0700, Joshua D. Drake wrote:
>> Phoenix Kiula wrote:
>>> So a YES/NO question:
>>>
>>> Q: To get rid of index bloat, is a VACUUM ANALYZE enough? Or must I
>>> reindex/cluster indexes?
>> If you overrun your max_fsm_pages, no:
>>   else yes;
>
> Maybe my english suck, but I don't understand the above answer.
>
> If I overrun my Max_FSM_pages then a vacuum analyse is enough to return
> it back to normal.

If you overrun your max_fsm_pages then vacuum analyze IS NOT ENOUGH. If
you do not overrun your max_fsm_pages, yes vacuum analyze can deal with
the issue.

Joshua D. Drake

>
> If I _didn't_ overrun my fsm, then a reindex/cluster is necessary.
>
> Did I get that right? (I feel it's wrong and a reindex/cluster is needed
> only when I overrun my max_fsm)
>
>
>


- --

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/
            UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG72kwATb/zqfZUUQRAqNMAJsFjqWirgGF+VlEIwaVDnxBAefeSwCfesD1
osqiudjcEY/tyibvNZRJ/UU=
=apjz
-----END PGP SIGNATURE-----

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Filip Rembiałkowski"
Дата:
2007/9/18, Joshua D. Drake <jd@commandprompt.com>:

> If you overrun your max_fsm_pages then vacuum analyze IS NOT ENOUGH. If
> you do not overrun your max_fsm_pages, yes vacuum analyze can deal with
> the issue.

Are you sure? I have a situation where above is no true. postgres
version 8.1.8. while vacuum verbose says:

INFO:  free space map contains 2329221 pages in 490 relations
DETAIL:  A total of 2345744 page slots are in use (including overhead).
2345744 page slots are required to track all free space.
Current limits are:  10000000 page slots, 1000 relations, using 58698 KB.

... and we have constant problem with index bloat and need to REINDEX
frequently.

the database is very redundant and has quite hight data retention rate
(it's an ecommerce site)


--
Filip Rembiałkowski

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Bill Moran
Дата:
"Filip Rembiałkowski" <plk.zuber@gmail.com> wrote:
>
> 2007/9/18, Joshua D. Drake <jd@commandprompt.com>:
>
> > If you overrun your max_fsm_pages then vacuum analyze IS NOT ENOUGH. If
> > you do not overrun your max_fsm_pages, yes vacuum analyze can deal with
> > the issue.
>
> Are you sure? I have a situation where above is no true. postgres
> version 8.1.8. while vacuum verbose says:
>
> INFO:  free space map contains 2329221 pages in 490 relations
> DETAIL:  A total of 2345744 page slots are in use (including overhead).
> 2345744 page slots are required to track all free space.
> Current limits are:  10000000 page slots, 1000 relations, using 58698 KB.
>
> ... and we have constant problem with index bloat and need to REINDEX
> frequently.
>
> the database is very redundant and has quite hight data retention rate
> (it's an ecommerce site)

I've been involved in a number of the discussions on this, and I think
part of the confusion stems from the fact that "index bloat" is an
ambiguous term.

If the index gets large enough that it no longer fits in shared memory,
and reindexing it will reduce its size to where it _will_ fit in shared
memory, then the index _could_ be said to be "bloated".

However, an equally valid solution to that problem is to increase the
amount of shared memory available (possibly by adding RAM).

Unfortunately, folks like Phoenix are looking for yes/no answers, and
with many of these questions, the _correct_ answer is "it depends on
your workload"

If you find that reindexing improves performance, then you should
investigate further.  Depending on the exact nature of the problem,
there are many possible solutions, three that come to mind:
* Add RAM/SHM
* REINDEX on a regular schedule
* (with newer version) reduce the fill factor and REINDEX

--
Bill Moran
http://www.potentialtech.com

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Ow Mun Heng
Дата:
On Tue, 2007-09-18 at 06:01 -0400, Bill Moran wrote:
> * (with newer version) reduce the fill factor and REINDEX

What is fill factor?

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Rodrigo De León"
Дата:
On 9/18/07, Ow Mun Heng <Ow.Mun.Heng@wdc.com> wrote:
> On Tue, 2007-09-18 at 06:01 -0400, Bill Moran wrote:
> > * (with newer version) reduce the fill factor and REINDEX
>
> What is fill factor?

See "Index Storage Parameters":
http://www.postgresql.org/docs/8.2/static/sql-createindex.html

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Phoenix Kiula"
Дата:
> Unfortunately, folks like Phoenix are looking for yes/no answers, and
> with many of these questions, the _correct_ answer is "it depends on
> your workload"


I wanted merely to simplify the advice that gets dispensed on this
list, often conflicting to novice ears like mine. So I appreciate your
notes very much.


> If you find that reindexing improves performance, then you should
> investigate further.  Depending on the exact nature of the problem,
> there are many possible solutions, three that come to mind:
> * Add RAM/SHM


Can I add SHM with merely by managing the entry in sysctl.conf? My
current values:

kernel.shmmax = 536870912
kernel.shmall = 536870912

My "shared_buffers" in postgresql.conf is "20000". From the website
http://www.desknow.com/kb/idx/12/061/article/ I notice that shmmax
should be sharedbuffer*8192, so I suppose my shmmax can be much lower
than the above, but I raised it for performance. Am I wrong to do so?


> * REINDEX on a regular schedule


This is sadly not really feasible, because we need to offer a 100%
availability website. REINDEX does not work concurrently so it is not
really an option for us. My max_fsm_pages and max_fsm_relations are
way above the numbers that come up after the VACUUM ANALYZE VERBOSE
run.

But still, the autovacuum stuff seems like it is not working at all.
Some related entries in the conf file:

autovacuum                   = on
autovacuum_vacuum_cost_delay = 20
vacuum_cost_delay            = 20
autovacuum_naptime           = 30
stats_start_collector        = on
stats_row_level              = on
autovacuum_vacuum_threshold  = 80
autovacuum_analyze_threshold = 80

And yet, the db often slows down, at which point I manually login and
run a manual VACUUM ANALYZE and it seems fine for some more time.
Sometimes, I also restart pgsql and that seems to help for a while.

Another advice on these forums is to see "vmstat 1", without actually
specifying how to draw inferences from it. The "free" version of it is
coming up at decent rates, as follows:


procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0  29124 110760 108980 3467736    0    1   206   140    0     4  2  1 85 12
 0  0  29124 110632 108980 3467736    0    0     0     0 1052   108  0  0 100  0
 2  0  29124 108840 108980 3467736    0    0     0     0 1112   299  1  1 98  0
 1  0  29124 109288 108980 3467736    0    0     0     0 1073   319  2  1 98  0
.....




> * (with newer version) reduce the fill factor and REINDEX
>


I think some of my tables are updated very frequently so a smaller
fill factor will be nice. How can I find the current fill factor on my
tables? Also, is there some method or science to calculating a decent
fill factor -- size of table, number of indexes, frequency of updates,
and such? We have one major table which faces a lot of INSERTs and
UPDATES in a day (up to 10,000) but many many more SELECTs (up to 10
million).

Thanks.

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Sander Steffann"
Дата:
Hi,

> Can I add SHM with merely by managing the entry in sysctl.conf? My
> current values:
>
> kernel.shmmax = 536870912
> kernel.shmall = 536870912
>
> My "shared_buffers" in postgresql.conf is "20000". From the website
> http://www.desknow.com/kb/idx/12/061/article/ I notice that shmmax
> should be sharedbuffer*8192, so I suppose my shmmax can be much lower
> than the above, but I raised it for performance. Am I wrong to do so?

You need to configure the kernel so it allows processes to use more shared
memory. This does not mean that a process automatically uses it. For
PostgreSQL you will need to increase shared_buffers to make it use the extra
available shared memory. With your shared memory settings you can probably
increase shared_buffers to about 65000.

With the 'ipcs' command you can see how much shared memory PostgreSQL uses.
Look under 'Shared Memory Segments' to memory owned by user postgres.

- Sander



Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Phoenix Kiula"
Дата:
On 18/09/2007, Sander Steffann <s.steffann@computel.nl> wrote:
> Hi,
>
> > Can I add SHM with merely by managing the entry in sysctl.conf? My
> > current values:
> >
> > kernel.shmmax = 536870912
> > kernel.shmall = 536870912
> >
> > My "shared_buffers" in postgresql.conf is "20000". From the website
> > http://www.desknow.com/kb/idx/12/061/article/ I notice that shmmax
> > should be sharedbuffer*8192, so I suppose my shmmax can be much lower
> > than the above, but I raised it for performance. Am I wrong to do so?
>
> You need to configure the kernel so it allows processes to use more shared
> memory. This does not mean that a process automatically uses it. For
> PostgreSQL you will need to increase shared_buffers to make it use the extra
> available shared memory. With your shared memory settings you can probably
> increase shared_buffers to about 65000.
>


Thanks, the IPCS command shows me this:


------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x0052e2c1 6782976    postgres  600        176668672  2


Now, I can merrily increase the shared_buffers, but the manual warns
me against increasing the value too much because it is "per
transaction" value.

So here's the conflict for a novice like me:

1. Do not increase shared_buffer too much because it is per-transaction.
2. Do increase the SHM for performance, but it is only useful if you
also increase shared_buffer.

So which is it?

Would it help to increase the effective_cache_size? It is currently at
"512000".

I have 4GB ram on the machine, but am willing to devote about 2GB to pgsql.

Thanks!

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Sander Steffann"
Дата:
Hi,

> Now, I can merrily increase the shared_buffers, but the manual warns
> me against increasing the value too much because it is "per
> transaction" value.

Shared_buffers is not per-transaction. Where did you find this information?

- Sander



Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Bill Moran
Дата:
In response to "Phoenix Kiula" <phoenix.kiula@gmail.com>:

> > If you find that reindexing improves performance, then you should
> > investigate further.  Depending on the exact nature of the problem,
> > there are many possible solutions, three that come to mind:
> > * Add RAM/SHM
>
> Can I add SHM with merely by managing the entry in sysctl.conf? My
> current values:
>
> kernel.shmmax = 536870912
> kernel.shmall = 536870912

These values define the max allowed.  They exist to keep poorly written
applications from sucking up all the available memory.  Setting them
higher than is needed does not cause any problems, unless a greedy or
poorly-written application grabs all that memory.

> My "shared_buffers" in postgresql.conf is "20000". From the website
> http://www.desknow.com/kb/idx/12/061/article/ I notice that shmmax
> should be sharedbuffer*8192, so I suppose my shmmax can be much lower
> than the above, but I raised it for performance. Am I wrong to do so?

It's completely impossible to tell without knowing more about your
physical hardware.  The rule of thumb is 1/3 physical RAM to start, then
adjust if more or less seems to help.  That advice is for versions of
PG >= 8.  If you're still running a 7.X version, upgrade.

How much RAM does this system have in it?  Unless you have other
applications running on this system using RAM, you should allocate
more of it to shared_buffers.  If 160M is 1/3 your RAM, you probably
need to add more RAM.

How big is your database?  If it's possible to fit it all in
shared_buffers, that will give you the best performance.

> > * REINDEX on a regular schedule
>
> This is sadly not really feasible, because we need to offer a 100%
> availability website. REINDEX does not work concurrently so it is not
> really an option for us. My max_fsm_pages and max_fsm_relations are
> way above the numbers that come up after the VACUUM ANALYZE VERBOSE
> run.

Hence my comment about "depending on your workload" and "investigating
the situation" to determine the best solution.

> But still, the autovacuum stuff seems like it is not working at all.
> Some related entries in the conf file:
>
> autovacuum                   = on
> autovacuum_vacuum_cost_delay = 20
> vacuum_cost_delay            = 20
> autovacuum_naptime           = 30
> stats_start_collector        = on
> stats_row_level              = on
> autovacuum_vacuum_threshold  = 80
> autovacuum_analyze_threshold = 80
>
> And yet, the db often slows down, at which point I manually login and
> run a manual VACUUM ANALYZE and it seems fine for some more time.
> Sometimes, I also restart pgsql and that seems to help for a while.

You don't mention *_scale_factor settings.  Those are going to be
important as well.  Based on your symptoms, it sounds like autovacuum
is not getting those tables vacuumed enough.  I recommend raising the
debug level and watching the logs to see if autovacuum is actually
getting tables vacuumed.  Consider lowering your *_scale_factor values
if not.  Or even reducing autovacuum_naptime.

> Another advice on these forums is to see "vmstat 1", without actually
> specifying how to draw inferences from it. The "free" version of it is
> coming up at decent rates, as follows:
>
>
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  1  0  29124 110760 108980 3467736    0    1   206   140    0     4  2  1 85 12
>  0  0  29124 110632 108980 3467736    0    0     0     0 1052   108  0  0 100  0
>  2  0  29124 108840 108980 3467736    0    0     0     0 1112   299  1  1 98  0
>  1  0  29124 109288 108980 3467736    0    0     0     0 1073   319  2  1 98  0
> .....

Explaining how to interpret the output of this command and determine
what to do with it is not something easily done in a short paragraph.
However, it looks like you've got a lot of RAM being used for the disk
cache.  That memory would probably be better used as shared_buffers, so
I suggest you increase that value considerably.

> > * (with newer version) reduce the fill factor and REINDEX
>
> I think some of my tables are updated very frequently so a smaller
> fill factor will be nice. How can I find the current fill factor on my
> tables? Also, is there some method or science to calculating a decent
> fill factor -- size of table, number of indexes, frequency of updates,
> and such? We have one major table which faces a lot of INSERTs and
> UPDATES in a day (up to 10,000) but many many more SELECTs (up to 10
> million).

I'm not sure how to find the current value, but a smaller fill factor
on busy tables should lead to less fragmentation, thus more efficient
indexes over time.  Keep in mind that a smaller fill factor will also
lead to larger indexes initially.

--
Bill Moran
http://www.potentialtech.com

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Phoenix Kiula"
Дата:
Thanks for a very informative post! One question:


> I'm not sure how to find the current value, but a smaller fill factor
> on busy tables should lead to less fragmentation, thus more efficient
> indexes over time.  Keep in mind that a smaller fill factor will also
> lead to larger indexes initially.


What constitutes a "small fill factor"? Would 70 be good? I guess my
current must have been the default, which the manual says is 100. Or
did you mean really small fill factor like 20? In this context, what
is "packing" in the manual -- is that some kind of compression?

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Richard Broersma Jr
Дата:
--- Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

> What constitutes a "small fill factor"? Would 70 be good? I guess my
> current must have been the default, which the manual says is 100.

On the following link:

http://www.postgresql.org/docs/8.2/interactive/sql-createindex.html#SQL-CREATEINDEX-STORAGE-PARAMETERS

I found this:

"B-trees use a default fillfactor of 90, but any value from 10 to 100 can be selected."

Regards,
Richard Broersma Jr.

index fillfactor (was Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER)

От
Bill Moran
Дата:
In response to "Phoenix Kiula" <phoenix.kiula@gmail.com>:

> Thanks for a very informative post! One question:
>
> > I'm not sure how to find the current value, but a smaller fill factor
> > on busy tables should lead to less fragmentation, thus more efficient
> > indexes over time.  Keep in mind that a smaller fill factor will also
> > lead to larger indexes initially.
>
> What constitutes a "small fill factor"? Would 70 be good?

Unfortunately, I can't say.  I have not yet had the opportunity to
experiment with different fillfactors, so I can only speak in vague
estimations on this topic.

> I guess my
> current must have been the default, which the manual says is 100.

I expect it's at the default, but the docs say that is 90%:
http://www.postgresql.org/docs/8.2/static/sql-createindex.html

Where did you see 100?

> Or
> did you mean really small fill factor like 20? In this context, what
> is "packing" in the manual -- is that some kind of compression?

Hopefully, someone more knowledgeable will chime in with some wise
suggestions.  Barring that, I can only suggest you experiment to find
what works for your workload, but don't rule out the possibility that
extremely low fillfactor values might work well for you.

--
Bill Moran
http://www.potentialtech.com

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Gregory Williamson"
Дата:

Sorry for top-posting -- challenged reader.

Can't speak directly to PostgreSQL but in Informix the fill factor is useful for tweaking indexes. A very high fill factor is useful for tables that are static -- any inserts or changes to the index trigger a *lot* of moving of b-tree branches. But the high fill factor means that each page has more useful data references in it. A very low fill factor means that pages are "sparse" and so inserts and updates are less likely to trigger massive b-tree rebalancings.

I've never used it on PostgreSQL (yet!) but am looking forward to it.

Beware of premature optimization!

HTH,

Greg Williamson
Senior DBA
GlobeXplorer LLC, a DigitalGlobe company

Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information and must be protected in accordance with those provisions. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.

(My corporate masters made me say this.)



-----Original Message-----
From: pgsql-general-owner@postgresql.org on behalf of Richard Broersma Jr
Sent: Tue 9/18/2007 10:29 AM
To: Phoenix Kiula; Bill Moran
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

--- Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

> What constitutes a "small fill factor"? Would 70 be good? I guess my
> current must have been the default, which the manual says is 100.

On the following link:

http://www.postgresql.org/docs/8.2/interactive/sql-createindex.html#SQL-CREATEINDEX-STORAGE-PARAMETERS

I found this:

"B-trees use a default fillfactor of 90, but any value from 10 to 100 can be selected."

Regards,
Richard Broersma Jr.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Richard Broersma Jr
Дата:
--- Gregory Williamson <Gregory.Williamson@digitalglobe.com> wrote:

> A very low fill factor means that pages are
> "sparse" and so inserts and updates are less likely to trigger massive b-tree rebalancings.

I take it that "massive b-tree rebalancings" could cause a problem with the performance of disk
writing though-put from UPDATEs and INSERTs?

Regards,
Richard Broersma Jr.

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Greg Williamson
Дата:
Richard Broersma Jr wrote:
> --- Gregory Williamson <Gregory.Williamson@digitalglobe.com> wrote:
>
>
>> A very low fill factor means that pages are
>> "sparse" and so inserts and updates are less likely to trigger massive b-tree rebalancings.
>>
>
> I take it that "massive b-tree rebalancings" could cause a problem with the performance of disk
> writing though-put from UPDATEs and INSERTs?
>
> Regards,
> Richard Broersma Jr.
>
Precisely -- even if it can keep everything in RAM it can occupy quite a
few cycles to rebalance a large b-tree. And eventually those changes do
need to get written to disk so the next checkpoint (I think) will also
have more work.
G

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Phoenix Kiula"
Дата:
On 19/09/2007, Gregory Williamson <Gregory.Williamson@digitalglobe.com> wrote:

...

>  Can't speak directly to PostgreSQL but in Informix the fill factor is
> useful for tweaking indexes. A very high fill factor is useful for tables
> that are static -- any inserts or changes to the index trigger a *lot* of
> moving of b-tree branches. But the high fill factor means that each page has
> more useful data references in it. A very low fill factor means that pages
> are "sparse" and so inserts and updates are less likely to trigger massive
> b-tree rebalancings.



Assuming pgsql's fill factor is similar to Informix' (yes, a highly
suspect assumption), could we say:

1. A small fill factor such as 10 or 20 would be good for the index
size and will not trigger massive btree rebalancings? (I'm first
playing with a value of 60 for now and seeing how it works out...seems
ok at the moment!)

2. Is this fill factor enough to have on the table, or should I also
do a fill factor for specific indexes? Or both the table and the
index? (I have four btree indexes on the table)

Thanks

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Richard Broersma Jr
Дата:
--- Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

> 2. Is this fill factor enough to have on the table, or should I also
> do a fill factor for specific indexes? Or both the table and the
> index? (I have four btree indexes on the table)

I don't think that fill factor can be applied to the table.  The CREATE TABLE reference doc show
that fill factor can be used in the CREATE TABLE statement, but it is only applied to syntax that
creates an implied index.

i.e.  CREATE TABLE test (
        test_id    INTEGER   PRIMARY KEY WITH ( FILLFACTOR = 70 ),
        test_val   TEXT );

Primary key will create an implied index.  Fill factor is applied to that implied index.

Regards,
Richard Broersma Jr.

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Phoenix Kiula"
Дата:
On 19/09/2007, Richard Broersma Jr <rabroersma@yahoo.com> wrote:
> --- Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
>
> > 2. Is this fill factor enough to have on the table, or should I also
> > do a fill factor for specific indexes? Or both the table and the
> > index? (I have four btree indexes on the table)
>
> I don't think that fill factor can be applied to the table.  The CREATE TABLE reference doc show
> that fill factor can be used in the CREATE TABLE statement, but it is only applied to syntax that
> creates an implied index.
>
> i.e.  CREATE TABLE test (
>         test_id    INTEGER   PRIMARY KEY WITH ( FILLFACTOR = 70 ),
>         test_val   TEXT );
>
> Primary key will create an implied index.  Fill factor is applied to that implied index.
>
> Regards,
> Richard Broersma Jr.
>



Then I am confused again about how the fill factor stuff works. Let's
say I have a table with four BTREE indexes. Should all of them have a
fill factor of about 60 (lower than the default 90, that is) to be
effective? Or will it help if I lower the fill factor on only a couple
of the most often used ones? The primary key index is very, very
rarely updated so I don't need it to have a fill factor.

I could try and see these one by one, but that's no better than
touching/feeling the database blind-folded. I would love to see some
writeup about this whole thing, but it seems hard to come by!

Many thanks

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Richard Broersma Jr
Дата:
--- Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

> Then I am confused again about how the fill factor stuff works. Let's
> say I have a table with four BTREE indexes. Should all of them have a
> fill factor of about 60 (lower than the default 90, that is) to be
> effective? Or will it help if I lower the fill factor on only a couple
> of the most often used ones? The primary key index is very, very
> rarely updated so I don't need it to have a fill factor.
>
> I could try and see these one by one, but that's no better than
> touching/feeling the database blind-folded. I would love to see some
> writeup about this whole thing, but it seems hard to come by!

I will try to explain everything that I understand about indexes and tables.  I am sure that if
some part of my understanding is incorrect, someone will chime in to correct me.

In PostgreSQL, tables are physically distinct from indexes.  This means that any give table is
written to disk as a file(s), and indexes are also written to disk as a separate file(s).  A table
and index are both saved to disk in segmented block referred to a pages (I believe the default
size is 8K).

The advantage of the index file is that it is significantly smaller in size, so it takes less time
to sequentially scan and less time to read from disk.

Now when you want to find a record, PostgreSQL will/(may choose to) sequentially scan the index
until it find the record entry that corresponds with your criteria.  This entry has a table
cross-reference to the actual page that contains the record that is "pointed" at by the index.
Lastly, the entire table page containing your record is read from disk to memory for further query
processing.

<ASSUMPTION>
When you insert a record into a table that generates an entry into the b-tree index file,
PostgreSQL will scan the pages of the index file to find the correct place and index page to add
this entry.  If the page is already full, PostgreSQL "probably" replaces the old full pages with
two new pages with a distribution of that chunk of the B-tree index, and then adds the new entry
to one of those pages. This operation can become very expensive if many new pages need to be
created from single INSERT/UPDATE statement.
</ASSUMPTION>

By using fill factor, you are telling PostgreSQL to automatically leave a portion of any newly
created index page partially blank for future use.  When a newly created index entry needs to be
saved, it can be stored in one of the "holes" left in the index page.

A large fill factor create both advantages and dis-advantages.  For writing operations, it is a
big advantage because, a large fill factor will leave alot of holes for INSERT and UPDATE
operations to use.  This can help increase the number of UPDATE/INSERT per seconds that you server
can handle since, they index pages do not have to be rebuilt very often.

However, the disadvantage is that, a newly created index with a large fill factor has "index
bloat" designed into it.  This mean that the index pages have a large portion of holes.  So
PostgreSQL will create more index pages than it would normally in order to hold both your index
and the pre-defined holes that you specified by your fill-factor.  Larger indexes require more
time to be read from disk to memory and will require more time to sequentially scan to find to
find the cross-reference table page location of records of interest.  So the net effect is that
larger indexes will make SELECT statement slower.

This is my understanding for tables indexes and fill factor. I hope it helps.

Regards,
Richard Broersma Jr.

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Albe Laurenz"
Дата:
Phoenix Kiula wrote:
> Then I am confused again about how the fill factor stuff works. Let's
> say I have a table with four BTREE indexes. Should all of them have a
> fill factor of about 60 (lower than the default 90, that is) to be
> effective? Or will it help if I lower the fill factor on only a couple
> of the most often used ones? The primary key index is very, very
> rarely updated so I don't need it to have a fill factor.
>
> I could try and see these one by one, but that's no better than
> touching/feeling the database blind-folded. I would love to see some
> writeup about this whole thing, but it seems hard to come by!

Let me present my understanding; I hope I won't confuse issues further.

Indexes and tables are bath organized in pages, each page contains
several
entries or rows.

When an INSERT or UPDATE on the table occurs, a new row (version) is
created (and the old version of the row will be freed upon VACUUM).

Any index entry that points to this row will have to be changed
because the location of the row has changed.
That means that there will also have to be a new entry in the index,
even if the key has not changed (simply modifying the existing
index entry to point to the new row location won't do, because there
may be transactions that still need the old version of the row).

Reducing fillfactor on tables (default 100) will reduce the number
of table pages that need to be touched during an UPDATE.

Reducing fillfactor on a B-tree index (default 90) will reduce the
frequency of page splits that can happen upon INSERT or UPDATE.

Both at the cost of wasting some disk (and memory) space.

So I *guess* that when you decide that a table will be heavily updated
and you want to reduce disk I/O at the cost of wasting some space,
it will be a good idea to reduce fillfactor on the table and all its
indexes.

I emphasize the "guess" because
a) I may have made a mistake in my deductions :^) and
b) I cannot tell you good numbers to choose.

As in most performance tuning questions, the best thing you can probably
is to test and compare various settings and see which performs best
for you....

Yours,
Laurenz Albe

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Erik Jones
Дата:
On Sep 19, 2007, at 9:29 AM, Richard Broersma Jr wrote:

> --- Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
>
>> Then I am confused again about how the fill factor stuff works. Let's
>> say I have a table with four BTREE indexes. Should all of them have a
>> fill factor of about 60 (lower than the default 90, that is) to be
>> effective? Or will it help if I lower the fill factor on only a
>> couple
>> of the most often used ones? The primary key index is very, very
>> rarely updated so I don't need it to have a fill factor.
>>
>> I could try and see these one by one, but that's no better than
>> touching/feeling the database blind-folded. I would love to see some
>> writeup about this whole thing, but it seems hard to come by!
>
> I will try to explain everything that I understand about indexes
> and tables.  I am sure that if
> some part of my understanding is incorrect, someone will chime in
> to correct me.
>
> In PostgreSQL, tables are physically distinct from indexes.  This
> means that any give table is
> written to disk as a file(s), and indexes are also written to disk
> as a separate file(s).  A table
> and index are both saved to disk in segmented block referred to a
> pages (I believe the default
> size is 8K).

Yes, that can be changed at compile time, although I  don't think
I've ever heard of any advantages to doing that.

>
> The advantage of the index file is that it is significantly smaller
> in size, so it takes less time
> to sequentially scan and less time to read from disk.
>
> Now when you want to find a record, PostgreSQL will/(may choose to)
> sequentially scan the index
> until it find the record entry that corresponds with your
> criteria.  This entry has a table
> cross-reference to the actual page that contains the record that is
> "pointed" at by the index.
> Lastly, the entire table page containing your record is read from
> disk to memory for further query
> processing.
>
> <ASSUMPTION>
> When you insert a record into a table that generates an entry into
> the b-tree index file,
> PostgreSQL will scan the pages of the index file to find the
> correct place and index page to add
> this entry.  If the page is already full, PostgreSQL "probably"
> replaces the old full pages with
> two new pages with a distribution of that chunk of the B-tree
> index, and then adds the new entry
> to one of those pages. This operation can become very expensive if
> many new pages need to be
> created from single INSERT/UPDATE statement.
> </ASSUMPTION>

Yes, the point of B-trees is that they have fast lookup times, but
updates can be expensive when you have to re-balance your leaf nodes.

> By using fill factor, you are telling PostgreSQL to automatically
> leave a portion of any newly
> created index page partially blank for future use.  When a newly
> created index entry needs to be
> saved, it can be stored in one of the "holes" left in the index page.

That future use is only for updates.

>
> A large fill factor create both advantages and dis-advantages.  For
> writing operations, it is a
> big advantage because, a large fill factor will leave alot of holes
> for INSERT and UPDATE
> operations to use.  This can help increase the number of UPDATE/
> INSERT per seconds that you server
> can handle since, they index pages do not have to be rebuilt very
> often.

Not quite.  Once a page has reached it's fill factor percentage full,
no more inserts will happen on that page, only updates.  Also, I
think you have large/small backwards wrt fill factor.  If  you have a
fill factor of, say, 40% then once a page has reached 40% full no
more inserts will happen (unless some space is reclaimed by vacuum).
So, smaller fill factors == bigger holes.  The bigger the fill
factor, the smaller the whole:  if you have a fill factor of 90%,
only 10% is reserved for updates of rows on that page.

>
> However, the disadvantage is that, a newly created index with a
> large fill factor has "index
> bloat" designed into it.  This mean that the index pages have a
> large portion of holes.  So
> PostgreSQL will create more index pages than it would normally in
> order to hold both your index
> and the pre-defined holes that you specified by your fill-factor.
> Larger indexes require more
> time to be read from disk to memory and will require more time to
> sequentially scan to find to
> find the cross-reference table page location of records of
> interest.  So the net effect is that
> larger indexes will make SELECT statement slower.
>
> This is my understanding for tables indexes and fill factor. I hope
> it helps.

Again, with the large v. small fill factor point.  Using fill factor
seems to be a trade-off between space and update efficiency.  Let's
say that after so many (potential) updates you know that each row
will become static, i.e. no more updates will happen.  Let's use some
numbers and make them easy to work with.  Say each row will be
updated exactly once and you use a fill factor of 50%.  Now, say 5K
of fresh index data is written.  The first 4K will go into one page
at which point it has hit it's 50% fill factor threshold and the
other 1K has to go into another page.  Now, let's say each tuple in
the index is updated, the first page is now at 100% full as all of
the updates for the first page filled up the other 50%.  The page is
now full until you vacuum.  However, once you vacuum and the first 4K
that was originally inserted is freed, that 4K will never get used
again as the page is back down to 50% so no more inserts can happen
on that page, and we know that each tuple would only be updated that
once and you're left with 50% "bloat".  So, you can see that for
tables/indexes that aren't going to see a lot of updates to the same
tuples something as low as 50% is probably a pretty aggressive fill
factor.  In fact, I would say that you shouldn't really even bother
playing around with the fill factor unless you know that the table/
index sees a lot of updates.  For my example where each tuple sees
only one update, the index default fill factor of 90% is probably fine.

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com



Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Richard Broersma Jr
Дата:
> Not quite.  Once a page has reached it's fill factor percentage full,
> no more inserts will happen on that page, only updates.  Also, I
> think you have large/small backwards wrt fill factor.  If  you have a
> fill factor of, say, 40% then once a page has reached 40% full no
> more inserts will happen (unless some space is reclaimed by vacuum).
> So, smaller fill factors == bigger holes.  The bigger the fill
> factor, the smaller the whole:  if you have a fill factor of 90%,
> only 10% is reserved for updates of rows on that page.

So (just to reiterate), fill factor can be applied to both a table and/or an index(es).  But the
"holes" built into the page of a table or index can only be filled by UPDATE Statements.

Thanks for the clarification!

Regards,

Richard Broesma Jr.


Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Erik Jones
Дата:
On Sep 19, 2007, at 10:30 AM, Richard Broersma Jr wrote:

>> Not quite.  Once a page has reached it's fill factor percentage full,
>> no more inserts will happen on that page, only updates.  Also, I
>> think you have large/small backwards wrt fill factor.  If  you have a
>> fill factor of, say, 40% then once a page has reached 40% full no
>> more inserts will happen (unless some space is reclaimed by vacuum).
>> So, smaller fill factors == bigger holes.  The bigger the fill
>> factor, the smaller the whole:  if you have a fill factor of 90%,
>> only 10% is reserved for updates of rows on that page.
>
> So (just to reiterate), fill factor can be applied to both a table
> and/or an index(es).  But the
> "holes" built into the page of a table or index can only be filled
> by UPDATE Statements.
>
> Thanks for the clarification!

Yep.  Although, to be strictly honest, I guess the term UPDATE isn't
the best term to use for indexes.  My description works best for
tables, see the section on FILLFACTOR in http://www.postgresql.org/
docs/8.2/interactive/sql-createindex.html for a better description of
what happens for indexes -- slightly different semantics, but the
same general effect.

Also, note that once we have HOT, figuring out fill factor for
indexes will be a whole different ball game.  Currently, an update to
any tuple in a table, results in a new index entry.  With hot, index
entries will only happen if the indexed column is changed in the update.

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com



Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Richard Broersma Jr
Дата:
--- Erik Jones <erik@myemma.com> wrote:

> Also, note that once we have HOT...

I am not sure what the acronym "HOT" stands for.  Does it have something to do with MVCC?

Regards,
Richard Broersma Jr.



Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Erik Jones
Дата:
On Sep 19, 2007, at 11:00 AM, Richard Broersma Jr wrote:

> --- Erik Jones <erik@myemma.com> wrote:
>
>> Also, note that once we have HOT...
>
> I am not sure what the acronym "HOT" stands for.  Does it have
> something to do with MVCC?
>

Heap Only Tuple.  Here's a link to the (latest?) readme for it:
http://archives.postgresql.org/pgsql-patches/2007-09/msg00261.php

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com



Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Vivek Khera
Дата:
On Sep 18, 2007, at 1:14 AM, Joshua D. Drake wrote:

>> Q: To get rid of index bloat, is a VACUUM ANALYZE enough? Or must I
>> reindex/cluster indexes?
>
> If you overrun your max_fsm_pages, no:
>   else yes;

my algorithm is: if (true) then yes;

my FSM is way bigger than I ever use (vacuum never reports shortage)
and I still get bloat that needs to be purged out with a reindex on
occasion.


Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Phoenix Kiula"
Дата:
On 24/09/2007, Vivek Khera <vivek@khera.org> wrote:
>
> my FSM is way bigger than I ever use (vacuum never reports shortage)
> and I still get bloat that needs to be purged out with a reindex on
> occasion.



Vivek,

I feel your pain. But I seem to have (mostly) solved my problem in three ways:

1. Increase the shared_buffer and effective_cache_size settings in
postgresql.conf. There are some websites that suggest that increasing
shared_buffer beyond 40,000 may in fact have counter-intuitive
results, but not in my case. I'm at 60,000 and it seems to work well.
Effective_cache_size is 512000.

2. Reduce the fill factor on your table. This is the single most
performance boost. On a table that is frequently updated on a TEXT
column, I reduced it to 60 and have never looked back. For others, I'm
experimenting with 80 and it seems to be working well.

3. Make your autovacuum settings as aggressive as can be. Basically I
found that doing a cronjob of vacuuming every five hours worked really
well, which suggested that autovacuum was not really kicking in as
often it was needed. So I reduced the threshold (100 for vacuum, 80
for analyze...i.e., the number of tuples that get updated before
either process kicks in) and reduced quite aggressively the scores.
Here are my settings:

autovacuum                   = on
autovacuum_vacuum_cost_delay = 10
vacuum_cost_delay            = 10
autovacuum_naptime           = 10
autovacuum_vacuum_threshold  = 75
autovacuum_analyze_threshold = 25
autovacuum_analyze_scale_factor  = 0.02
autovacuum_vacuum_scale_factor   = 0.01


Hope this helps some.

PK.

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Vivek Khera
Дата:
On Sep 24, 2007, at 12:00 PM, Phoenix Kiula wrote:

> I feel your pain. But I seem to have (mostly) solved my problem in
> three ways:

My particular usage pattern (add data continuously, purge *some* of
the data once per week or every other week.  The purge is what seems
to kill it.  Last time I reindexed, I reclaimed over 20Gb of disk
space.  That was after 6 months from the prior reindex.

Recommending I run vacuum intermixed with the data purge is a non-
starter; the vacuum on these tables takes a couple of hours.  I'd
never finish purging my data with that kind of delay.

Recommending splitting my tables with inheritance is not going to
work since my purges don't happen across any lines which would make
sense for splitting with inheritance.

I will investigate the fill-factor.  That seems like it may make some
sense the way I do inserts and updates...


Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
Gregory Stark
Дата:
"Vivek Khera" <vivek@khera.org> writes:

> On Sep 24, 2007, at 12:00 PM, Phoenix Kiula wrote:
>
>> I feel your pain. But I seem to have (mostly) solved my problem in  three
>> ways:
>
> My particular usage pattern (add data continuously, purge *some* of  the data
> once per week or every other week.  The purge is what seems  to kill it.  Last
> time I reindexed, I reclaimed over 20Gb of disk  space.  That was after 6
> months from the prior reindex.

Do you have a pattern of loading a ton of data covering a range of indexed key
values and then deleting all but a few values spread evenly throughout that
range? And then never inserting new key values in those ranges again? For
example loading records indexed by timestamp and then deleting all but the
first record of the day.

That kind of pattern does need a regular reindex because the index pages will
have those few values left on them preventing them from being reused.

If that's not your usage pattern then perhaps you should describe your usage
pattern in more detail. But I suspect you would be best served by simply
vacuuming much more often.


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

От
"Phoenix Kiula"
Дата:
On 25/09/2007, Vivek Khera <vivek@khera.org> wrote:

> Recommending I run vacuum intermixed with the data purge is a non-
> starter; the vacuum on these tables takes a couple of hours.  I'd
> never finish purging my data with that kind of delay.

...
> I will investigate the fill-factor.  That seems like it may make some
> sense the way I do inserts and updates...


Undoubtedly.

But if most of your indexed keys are gone, then a reindex is useful.
If this is a hugely live system and you don't have a great number of
indexes, then a somewhat kludgish way to try could be to create a copy
of the table, do what you wish with it (delete rows, index them, then
cluster them on that index)...and whenever the process finishes (3
hours, or 3 days...no matter, because it doesn't hurt your live
system), you simply rename the old table to TABLE_OLD and the new
table to TABLE. The renaming operation is instant.

Anyway, what is your maintenance_work_mem? Try increasing your
maintenance_work_mem and see if that helps vacuuming first. Vacuum
operations can be sped up dramatically. We need regular vacuums and
that is critical to our application, so I have a m_w_m of 512K.

Mind you -- even if your DB vacuums for a couple hours, vacuum doesn't
affect the performance of your live system while it is happening, so
frequent vacuuming cannot hurt you one way or another, and it can
surely help.