Обсуждение: Dumping a database that is not accepting commands?

Поиск
Список
Период
Сортировка

Dumping a database that is not accepting commands?

От
Natalie Wenz
Дата:
Hi all,

Background:

I have a large database from our test environment that got into trouble with some high volume and some long-running
queriesabout…six weeks ago? We have a buffer mechanism that has been storing the new data since the database stopped
acceptingconnections, so we haven't really lost any data, which is good. But the single-user backend vacuum freeze is
stillgrinding away, using 100% cpu most of the time, except when it's doing a lot of writes. We did cancel the vacuum
onceso we could stop the database and take a snapshot of the filesystem (we are using ZFS on FreeBSD) and copy the
snapshotto another machine. This allowed us to get a fresh database started (with 9.3!) where we could unspool the last
sixweeks of data into a fresh database, and be able to access at least that much of our data. 

Now:
I have a copy of the database (with data from all time up until the database shut itself down six weeks ago) that I
justneed the data from. I am becoming impatient with the vacuum, as it appears to have not even started working on the
filesfor one of the largest relations in the database (that table was about 14TB last I saw). I'm trying to find
alternativesto waiting another who knows how many weeks for the vacuum to finish just to have the database in a state
whereI can dump the data out, since this is no longer the "live" version. This copy running on hardware with plenty of
spaceto work with. The database has about a million transactions before it wraps.  

Is it possible to somehow just dump the data using pg_dump or pg_dumpall? I haven't tried to see if those utilities
willwork when the database is protecting itself from data loss. If it were possible, would it be wise (do those
utilitiesuse tons of transactions to dump, or would it be safely within the million or so that I have)? I suppose I
coulduse copy? Would there be any risks if I were to do that? Maybe none of this is risky at this point because we can
alwaysclone the original snapshot, and try again. 

Any ideas or suggestions?

Thanks!
Natalie

Re: Dumping a database that is not accepting commands?

От
Kevin Grittner
Дата:
Natalie Wenz <nataliewenz@ebureau.com> wrote:

> I have a large database from our test environment that got into trouble with
> some high volume and some long-running queries about…six weeks ago? We have a
> buffer mechanism that has been storing the new data since the database stopped
> accepting connections, so we haven't really lost any data, which is good.
> But the single-user backend vacuum freeze is still grinding away, using 100% cpu
> most of the time, except when it's doing a lot of writes. We did cancel the
> vacuum once so we could stop the database and take a snapshot of the filesystem
> (we are using ZFS on FreeBSD) and copy the snapshot to another machine. This
> allowed us to get a fresh database started (with 9.3!) where we could unspool
> the last six weeks of data into a fresh database, and be able to access at least
> that much of our data.

I'm not sure how you could have done that without dealing with the
wraparound before the upgrade.

> Now:
> I have a copy of the database (with data from all time up until the database
> shut itself down six weeks ago) that I just need the data from. I am becoming
> impatient with the vacuum, as it appears to have not even started working on the
> files for one of the largest relations in the database (that table was about
> 14TB last I saw). I'm trying to find alternatives to waiting another who
> knows how many weeks for the vacuum to finish just to have the database in a
> state where I can dump the data out, since this is no longer the
> "live" version. This copy running on hardware with plenty of space to
> work with. The database has about a million transactions before it wraps.

The copy running on 9.3, or the original?

> Is it possible to somehow just dump the data using pg_dump or pg_dumpall? I
> haven't tried to see if those utilities will work when the database is
> protecting itself from data loss. If it were possible, would it be wise (do
> those utilities use tons of transactions to dump, or would it be safely within
> the million or so that I have)? I suppose I could use copy?

pg_dump uses COPY (by default, anyway), and does all its work in a
single transaction.  But it needs to start that transaction.

> Would there be any risks if I were to do that? Maybe none of this
> is risky at this point because we can always clone the original
> snapshot, and try again.

I'm not sure that pg_dump followed by restore would be expected to
be faster than finishing the VACUUM, unless that is configured to
pace itself way too slowly.

> Any ideas or suggestions?

After getting past this crisis, I would take a close look at your
vacuuming regimen -- it sounds like it is not aggressive enough to
keep you out of trouble.

I'm sorry that I don't have a better suggestion for resolving the
crisis than running VACUUM at maximum speed.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Dumping a database that is not accepting commands?

От
Natalie Wenz
Дата:
On Sep 17, 2013, at 7:43 AM, Kevin Grittner <kgrittn@ymail.com> wrote:

Natalie Wenz <nataliewenz@ebureau.com> wrote:

I have a large database from our test environment that got into trouble with
some high volume and some long-running queries about…six weeks ago? We have a
buffer mechanism that has been storing the new data since the database stopped
accepting connections, so we haven't really lost any data, which is good.
But the single-user backend vacuum freeze is still grinding away, using 100% cpu
most of the time, except when it's doing a lot of writes. We did cancel the
vacuum once so we could stop the database and take a snapshot of the filesystem
(we are using ZFS on FreeBSD) and copy the snapshot to another machine. This
allowed us to get a fresh database started (with 9.3!) where we could unspool
the last six weeks of data into a fresh database, and be able to access at least
that much of our data.

I'm not sure how you could have done that without dealing with the
wraparound before the upgrade.

Sorry; my description of what is going on was a little unclear. We didn't upgrade the existing database. We moved it to different hardware, and just created a brand new database to accept the data that had been backing up in sqlite files while our original database was offline. I'm still dealing with the wraparound on the original, just on a different machine.



Now:
I have a copy of the database (with data from all time up until the database
shut itself down six weeks ago) that I just need the data from. I am becoming
impatient with the vacuum, as it appears to have not even started working on the
files for one of the largest relations in the database (that table was about
14TB last I saw). I'm trying to find alternatives to waiting another who
knows how many weeks for the vacuum to finish just to have the database in a
state where I can dump the data out, since this is no longer the
"live" version. This copy running on hardware with plenty of space to
work with. The database has about a million transactions before it wraps.

The copy running on 9.3, or the original?
The brand new database on 9.3 is running great. The original, running on 9.1, is the one I'd like to extract data from (it is currently vacuuming). After I get the data out of the original, whether I wait for the vacuum to finish or not, I'll have to manually import all of the old data into the new database, but it was decided that that was a reasonable tradeoff if it meant we could get our current data out of the buffer files and get our regular tools and reports back online today. We want all of the older data eventually, but it's not critical for our day-to-day operation.  


Is it possible to somehow just dump the data using pg_dump or pg_dumpall? I
haven't tried to see if those utilities will work when the database is
protecting itself from data loss. If it were possible, would it be wise (do
those utilities use tons of transactions to dump, or would it be safely within
the million or so that I have)? I suppose I could use copy?

pg_dump uses COPY (by default, anyway), and does all its work in a
single transaction.  But it needs to start that transaction.

I'll give that a try. It sounds like I technically have enough transactions left to copy the data out, so that is good. Hopefully the database will let me use that utility. Otherwise I'll just do a copy from the backend. 


Would there be any risks if I were to do that? Maybe none of this
is risky at this point because we can always clone the original
snapshot, and try again.

I'm not sure that pg_dump followed by restore would be expected to
be faster than finishing the VACUUM, unless that is configured to
pace itself way too slowly.

That is actually a great point. I guess I'm anxious to *do* something, but maybe it will be so slow I might not gain much. I have been wondering about our vacuum configuration, actually. I would absolutely welcome feedback in this area, either in general, how a database like this should be tuned, and also if there is any tuning to change when there is an emergency backend vacuum that has to be run, in the interest of the vacuum running as aggressively as possible.


 version                         | PostgreSQL 9.1.9 on x86_64-unknown-freebsd9.1, compiled by gcc (GCC) 4.2.1 20070831 patched [FreeBSD], 64-bit
 autovacuum                      | on
 autovacuum_analyze_scale_factor | 0.1
 autovacuum_freeze_max_age       | 800000000
 autovacuum_max_workers          | 3
 autovacuum_vacuum_cost_delay    | 0
 autovacuum_vacuum_scale_factor  | 0.1
 checkpoint_segments             | 128
 effective_cache_size            | 12GB
 listen_addresses                | *
 log_autovacuum_min_duration     | 10s
 log_destination                 | stderr
 log_filename                    | logfile-%A.log
 log_line_prefix                 | %t:%u:%r:[%p]: 
 log_rotation_age                | 1d
 log_rotation_size               | 1GB
 log_truncate_on_rotation        | on
 logging_collector               | on
 maintenance_work_mem            | 10GB
 max_connections                 | 500
 max_stack_depth                 | 2MB
 random_page_cost                | 1
 seq_page_cost                   | 1
 shared_buffers                  | 128MB
 synchronous_commit              | off
 temp_buffers                    | 128MB
 TimeZone                        | US/Central
 vacuum_cost_limit               | 500
 wal_buffers                     | 32MB
 work_mem                        | 256MB

This is the tuning of the original database, anything changed from the default settings. The machine it was running on had 48GB of memory. The database was 36TB, with 2 tables taking up the bulk of that (about 14TB each), and about 10 other tables and a few large indexes making up the rest. Our typical usage pattern is mostly inserts, with a some hourly summaries (which take maybe 5 minutes), some daily summaries (which take about 20-40 minutes), and a couple of end of month queries that take several hours. We have the same setup and tuning in production, which is about the same size, with an additional end of month query that runs off one of the 14TB tables, which can take 4-7 days. 

Any ideas or suggestions?

After getting past this crisis, I would take a close look at your
vacuuming regimen -- it sounds like it is not aggressive enough to
keep you out of trouble.

Normally the autovacuum seem to keep up sufficiently. We got into trouble with a bad combination of not-typical long-running queries, a disk failure and subsequent zpool repair, and the only person who checks the log files regularly (me) was out of the office. This has been so painful and slow to recover from, I don't think we'll ever get into this mess again. (At least not quite like this. I seem to have a knack for finding *new* ways to break things.) I will also take this opportunity to mention again that if anyone is considering a making a patch for 64-bit xids, you would make at least one small group of people very, very happy.  :)

Side question: some of the longer queries that were running when the database yakked were deletes of old data that had been manually migrated to a new table with a more appropriate format. We were running out of disk space, so we were trying to clear up some space by removing data we now had in two places. Does a delete of previously-frozen rows unfreeze them, or anything like that? Because in a series of maybe a dozen queries or so, we deleted billions of rows. Does that generate a significant amount of extra work for the autovacuumer? 

I'm sorry that I don't have a better suggestion for resolving the
crisis than running VACUUM at maximum speed.

Thanks for your feedback! It's kind of a dismal situation, but I'm anxious to learn whatever I can from it. 


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: Dumping a database that is not accepting commands?

От
bricklen
Дата:
On Tue, Sep 17, 2013 at 9:48 AM, Natalie Wenz <nataliewenz@ebureau.com> wrote:
 maintenance_work_mem            | 10GB
 shared_buffers                  | 128MB

maintenance_work_mem seems pretty high, and shared_buffers seems really low.  Out of curiousity, were those set as a product of internal testing which determined those were effective settings?

Re: Dumping a database that is not accepting commands?

От
Natalie Wenz
Дата:
No…  the shared_buffers value is just a legacy value that never got changed (the shmmax value in sysctl is still 1073741824).  When I set up the new database, I set the shared_buffers to 25% of system memory, so 12GB. (And since the new database is on 9.3, I didn't have to adjust the sysctl value for shmmax! Happy day!)

We used to have maintenance_work_mem set to something smaller, but we bumped that up after…<coughs>… the last time this database shut itself down to avoid wraparound in March 2012. We were hoping that would help speed the recovery at that time. Not sure if it did, but we left it that way afterward anyway.


On Sep 17, 2013, at 2:02 PM, bricklen <bricklen@gmail.com> wrote:

On Tue, Sep 17, 2013 at 9:48 AM, Natalie Wenz <nataliewenz@ebureau.com> wrote:
 maintenance_work_mem            | 10GB
 shared_buffers                  | 128MB

maintenance_work_mem seems pretty high, and shared_buffers seems really low.  Out of curiousity, were those set as a product of internal testing which determined those were effective settings?


Re: Dumping a database that is not accepting commands?

От
Natalie Wenz
Дата:
It occurs to me that asking for feedback on the tuning, I am asking about two separate things:

Was there anything in the tuning below that contributed to the database getting into trouble? And is there anything I should change in that tuning to make the single-user vacuum as fast as it can be for optimal recovery time?

 version                         | PostgreSQL 9.1.9 on x86_64-unknown-freebsd9.1, compiled by gcc (GCC) 4.2.1 20070831 patched [FreeBSD], 64-bit
 autovacuum                      | on
 autovacuum_analyze_scale_factor | 0.1
 autovacuum_freeze_max_age       | 800000000
 autovacuum_max_workers          | 3
 autovacuum_vacuum_cost_delay    | 0
 autovacuum_vacuum_scale_factor  | 0.1
 checkpoint_segments             | 128
 effective_cache_size            | 12GB
 listen_addresses                | *
 log_autovacuum_min_duration     | 10s
 log_destination                 | stderr
 log_filename                    | logfile-%A.log
 log_line_prefix                 | %t:%u:%r:[%p]: 
 log_rotation_age                | 1d
 log_rotation_size               | 1GB
 log_truncate_on_rotation        | on
 logging_collector               | on
 maintenance_work_mem            | 10GB
 max_connections                 | 500
 max_stack_depth                 | 2MB
 random_page_cost                | 1
 seq_page_cost                   | 1
 shared_buffers                  | 128MB
 synchronous_commit              | off
 temp_buffers                    | 128MB
 TimeZone                        | US/Central
 vacuum_cost_limit               | 500
 wal_buffers                     | 32MB
 work_mem                        | 256MB

This is the tuning of the original database, anything changed from the default settings. The machine it was running on had 48GB of memory. The database was 36TB, with 2 tables taking up the bulk of that (about 14TB each), and about 10 other tables and a few large indexes making up the rest. Our typical usage pattern is mostly inserts, with a some hourly summaries (which take maybe 5 minutes), some daily summaries (which take about 20-40 minutes), and a couple of end of month queries that take several hours. We have the same setup and tuning in production, which is about the same size, with an additional end of month query that runs off one of the 14TB tables, which can take 4-7 days. 


As far as ideal tuning for the new database, running on 9.3, which will eventually hold all the data from the sad, recovering original database with the usage patterns described below, how is this for a starting point? I tried to follow the basic guidelines in the High Performance book, but sometimes I feel like I'm largely guessing.

              name               |                                                current_setting                                                
---------------------------------+---------------------------------------------------------------------------------------------------------------
 version                         | PostgreSQL 9.3.0 on x86_64-unknown-freebsd9.1, compiled by gcc (GCC) 4.2.1 20070831 patched [FreeBSD], 64-bit
 autovacuum                      | on
 autovacuum_analyze_scale_factor | 0.1
 autovacuum_freeze_max_age       | 800000000
 autovacuum_max_workers          | 3
 autovacuum_vacuum_cost_delay    | 0
 autovacuum_vacuum_scale_factor  | 0.1
 checkpoint_segments             | 128
 effective_cache_size            | 12GB
 lc_collate                      | C
 lc_ctype                        | C
 listen_addresses                | *
 log_autovacuum_min_duration     | 1min
 log_destination                 | stderr
 log_filename                    | logfile-%A.log
 log_line_prefix                 | %t:%u:%r:[%p]: 
 log_min_duration_statement      | 1min
 log_rotation_age                | 1d
 log_rotation_size               | 1GB
 log_truncate_on_rotation        | on
 logging_collector               | on
 maintenance_work_mem            | 4GB
 max_connections                 | 500
 max_stack_depth                 | 2MB
 random_page_cost                | 1
 seq_page_cost                   | 1
 shared_buffers                  | 12GB
 synchronous_commit              | off
 temp_buffers                    | 128MB
 TimeZone                        | US/Central
 vacuum_cost_limit               | 500
 wal_buffers                     | 16MB
 work_mem                        | 256MB




Re: Dumping a database that is not accepting commands?

От
Kevin Grittner
Дата:
Natalie Wenz <nataliewenz@ebureau.com> wrote:

> Sorry; my description of what is going on was a little unclear.
> We didn't upgrade the existing database. We moved it to different
> hardware, and just created a brand new database to accept the
> data that had been backing up in sqlite files while our original
> database was offline. I'm still dealing with the wraparound on
> the original, just on a different machine.

OK, to restate to be sure I understand, the original database is
being vacuumed in a new location, and a new 9.3 database in the
original location has absorbed the queued data?  Once you complete
the vacuum, you will copy the old data back to the new database at
the old location?

> autovacuum_freeze_max_age       | 800000000

> Normally the autovacuum seem to keep up sufficiently. We got into
> trouble with a bad combination of not-typical long-running
> queries, a disk failure and subsequent zpool repair, and the only
> person who checks the log files regularly (me) was out of the
> office. This has been so painful and slow to recover from, I
> don't think we'll ever get into this mess again. (At least not
> quite like this. I seem to have a knack for finding *new* ways to
> break things.)

A perfect storm of events, eh?  It's hard to have things always go
smoothly in the face of such events, but I see a couple things you
might want to consider.  Increasing autovacuum_freeze_max_age
reduces the amount of time you have to get back on track.  You
might want to take that back down to the default.  There was a bug
causing wraparound prevention autovacuums to trigger too
frequently, which is now fixed in the latest minor releases, so
making that chnage might not be as painful as you expect.  Make
sure you are monitoring for long-running transactions, so you don't
get burned by one that is accidental.  Also, if you have a
processing cycle where there are off-peak hours on a daily or
weekly basis, you might want to run a VACUUM ANALYZE command
durning those windows, to get some of the freezing done before it
is critical.

> I will also take this opportunity to mention again that if anyone
> is considering a making a patch for 64-bit xids, you would make
> at least one small group of people very, very happy.  :)

While 64-bit xids isn't likely, there is some work taking a more
creatie approach to the issue which might make you even happier in
a year or so.  :-)

> Side question: some of the longer queries that were running when
> the database yakked were deletes of old data that had been
> manually migrated to a new table with a more appropriate format.
> We were running out of disk space, so we were trying to clear up
> some space by removing data we now had in two places. Does a
> delete of previously-frozen rows unfreeze them, or anything like
> that? Because in a series of maybe a dozen queries or so, we
> deleted billions of rows. Does that generate a significant amount
> of extra work for the autovacuumer?

The pages which had tuples deleted would need to be cleaned up by
vacuum, and rewritten.  It would also remove all index entries for
all deleted rows.  It might also scan backward from the end of the
table to release space to the OS.  That could conceivably be enough
avoidable work to make your idea of copying out the remaining data
feasible.  What percentage of the rows were deleted?  Could your
copy out be to a separate set of drives?

If you know at the time a row is added what group it will be in for
deletion, it might pay to move to partitioning, so that a group of
rows could be deleted pretty much as fast as you can drop a table.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Dumping a database that is not accepting commands?

От
Natalie Wenz
Дата:

On Sep 17, 2013, at 3:46 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Natalie Wenz <nataliewenz@ebureau.com> wrote:

Sorry; my description of what is going on was a little unclear.
We didn't upgrade the existing database. We moved it to different
hardware, and just created a brand new database to accept the
data that had been backing up in sqlite files while our original
database was offline. I'm still dealing with the wraparound on
the original, just on a different machine.

OK, to restate to be sure I understand, the original database is
being vacuumed in a new location, and a new 9.3 database in the
original location has absorbed the queued data?  Once you complete
the vacuum, you will copy the old data back to the new database at
the old location?

Yes, exactly.


autovacuum_freeze_max_age       | 800000000

Normally the autovacuum seem to keep up sufficiently. We got into
trouble with a bad combination of not-typical long-running
queries, a disk failure and subsequent zpool repair, and the only
person who checks the log files regularly (me) was out of the
office. This has been so painful and slow to recover from, I
don't think we'll ever get into this mess again. (At least not
quite like this. I seem to have a knack for finding *new* ways to
break things.)

A perfect storm of events, eh?  It's hard to have things always go
smoothly in the face of such events, but I see a couple things you
might want to consider.  Increasing autovacuum_freeze_max_age
reduces the amount of time you have to get back on track.  You
might want to take that back down to the default.  There was a bug
causing wraparound prevention autovacuums to trigger too
frequently, which is now fixed in the latest minor releases, so
making that chnage might not be as painful as you expect.  Make
sure you are monitoring for long-running transactions, so you don't
get burned by one that is accidental.  Also, if you have a
processing cycle where there are off-peak hours on a daily or
weekly basis, you might want to run a VACUUM ANALYZE command
durning those windows, to get some of the freezing done before it
is critical.

We talked a little bit about lowering the autovacuum_max_freeze_age, at least some, but there was concern that it would end up doing a lot more lengthy full-table scans. Is that a legitimate concern? Would it be prudent to change any of the other values back to their defaults at the same time? For example, we have the autovacuum_vacuum_cost_delay set to 0, with the idea that we don't mind if we take a hit on performance while the autovacuum is running; our priority is that it be able to finish as quickly as possible. If we start the vacuum earlier, though, maybe that should be bumped up too?

Does the autovacuum do different work when it is vacuuming to prevent wraparound (and that's triggered when a table passes the autovacuum_max_freeze_age, right?) and a vacuum triggered by the table changing in size by a certain amount, or a manually-invoked vacuum?

(Are there any books, or articles, that cover "Vacuuming and Autovacuuming: the gory details"?)

Rereading the Routine Vacuuming page in the docs, this sentence caught my eye:

"However, for static tables (including tables that receive inserts, but no updates or deletes), there is no need to vacuum for space reclamation, so it can be useful to try to maximize the interval between forced autovacuums on very large static tables. Obviously one can do this either by increasing autovacuum_freeze_max_age or decreasing vacuum_freeze_min_age."

We generally never delete from this database at all. This case was unusual; I was migrating the data from one table to another because we added some columns, and changed the datatype of many of the columns from text to more appropriate types (timestamp, int, uuid, inet, etc). Ideally, even then we wouldn't have preferred to delete anything until the whole table was migrated, but disk space became an issue. Bleh. 

With that in mind, would you still recommend putting the autovacuum_max_freeze_age back to 200 million?

I will also take this opportunity to mention again that if anyone
is considering a making a patch for 64-bit xids, you would make
at least one small group of people very, very happy.  :)

While 64-bit xids isn't likely, there is some work taking a more
creatie approach to the issue which might make you even happier in
a year or so.  :-)


Oh, the suspense! I will be on the lookout for that!


Side question: some of the longer queries that were running when
the database yakked were deletes of old data that had been
manually migrated to a new table with a more appropriate format.
We were running out of disk space, so we were trying to clear up
some space by removing data we now had in two places. Does a
delete of previously-frozen rows unfreeze them, or anything like
that? Because in a series of maybe a dozen queries or so, we
deleted billions of rows. Does that generate a significant amount
of extra work for the autovacuumer?

The pages which had tuples deleted would need to be cleaned up by
vacuum, and rewritten.  It would also remove all index entries for
all deleted rows.  It might also scan backward from the end of the
table to release space to the OS.  That could conceivably be enough
avoidable work to make your idea of copying out the remaining data
feasible.  What percentage of the rows were deleted?  Could your
copy out be to a separate set of drives?

I just did a little estimating, and it looks like around 48% of the data from one of the 14TB tables was deleted, about 45 billion rows. Sounds like I gave the vacuum a major amount of work to do. Yikes. I'll look into whether I can copy out to separate drives. I know the server I'm working on has over 300TB free, but I'm not certain of how it's set up. I'll check to see if it's possible to copy to a separate set of drives. 

If you know at the time a row is added what group it will be in for
deletion, it might pay to move to partitioning, so that a group of
rows could be deleted pretty much as fast as you can drop a table.

Partitioning has come up occasionally in the past, as this database can be so cumbersome to work with. I'll look into that some more, and try to pitch it to the Official Deciders.


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: Dumping a database that is not accepting commands?

От
Kevin Grittner
Дата:
Natalie Wenz <nataliewenz@ebureau.com> wrote:

>>> autovacuum_freeze_max_age       | 800000000

> We talked a little bit about lowering the
> autovacuum_max_freeze_age, at least some, but there was concern
> that it would end up doing a lot more lengthy full-table scans.
> Is that a legitimate concern?

It will cause full-table scans to occur more often, but possibly
not as much as you fear if your baseline is based on how it behaved
before recent bug fixes.  It will tend to flush "hot" data from the
caches, at least to some degree.  The up side of doing it more
often is that it will have fewer writes to do each time it is run,
which might avoid write gluts that hurt performance more than the
reads.

> Would it be prudent to change any of the other values back to
> their defaults at the same time? For example, we have the
> autovacuum_vacuum_cost_delay set to 0, with the idea that we
> don't mind if we take a hit on performance while the autovacuum
> is running; our priority is that it be able to finish as quickly
> as possible. If we start the vacuum earlier, though, maybe that
> should be bumped up too?

Maybe, but you do have a lot of machine there.  You might not
notice the hit very much.


> Does the autovacuum do different work when it is vacuuming to
> prevent wraparound (and that's triggered when a table passes the
> autovacuum_max_freeze_age, right?) and a vacuum triggered by the
> table changing in size by a certain amount, or a manually-invoked
> vacuum?

A "normal" vacuum just visits pages which need work based on the
visibility map, so those really are almost a fixed amount of work
per week regardless of the frequency of runs.  It's just a matter
of whether you wait until a lot of work needs to be done and do it
all at once, or do smaller runs that nibble away at it.  The latter
usually has less noticeable impact.

> (Are there any books, or articles, that cover "Vacuuming and
> Autovacuuming: the gory details"?)

The best book I know of for this is Greg Smith's "PostgreSQL 9.0
High Performance":

http://www.postgresql.org/docs/books/

(Full disclosure, I was one of the technical editors, but don't get
money from sales.)


> Rereading the Routine Vacuuming page in the docs, this sentence
> caught my eye:
>
> "However, for static tables (including tables that receive
> inserts, but no updates or deletes), there is no need to vacuum
> for space reclamation, so it can be useful to try to maximize the
> interval between forced autovacuums on very large static tables.
> Obviously one can do this either by increasing
> autovacuum_freeze_max_age or decreasing vacuum_freeze_min_age."
>
> We generally never delete from this database at all. This case
> was unusual; I was migrating the data from one table to another
> because we added some columns, and changed the datatype of many
> of the columns from text to more appropriate types (timestamp,
> int, uuid, inet, etc). Ideally, even then we wouldn't have
> preferred to delete anything until the whole table was migrated,
> but disk space became an issue. Bleh.

That might be a reason to tweak the settings, but since recent
versions of vacuum skip heap pages that won't benefit from vacuum,
I probably wouldn't.

> With that in mind, would you still recommend putting the
> autovacuum_max_freeze_age back to 200 million?

Where possible, I'm a big fan of incremental change.  I might nudge
it in that direction a little at a time and watch the behavior.  I
do think that periodic VACUUM ANALYZE statements (weekly?) of the
database might be a good supplement to the autovacuum jobs,
especially if you have a time when load tends to be lower to
schedule that in.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company