Обсуждение: wrong database name in error message?

Поиск
Список
Период
Сортировка

wrong database name in error message?

От
Rural Hunter
Дата:
I'm on 9.2.4 and I have several databases on the instance say db1, db2,
db2 etc.  Today I got this error message on connection of any of the
databases:
ERROR: database is not accepting commands to avoid wraparound data loss
in database "db1"
   Suggestion:Stop the postmaster and use a standalone backend to
vacuum that database.

The error messages says problem is on db1 when I connected to any database. But after I did vacuum full on some other
databasessuch as db2, db2, but not db1, the problem disappeared. 
Was that just a wrong database name in the error message?



Re: wrong database name in error message?

От
David Johnston
Дата:
Rural Hunter wrote
> I'm on 9.2.4 and I have several databases on the instance say db1, db2,
> db2 etc.  Today I got this error message on connection of any of the
> databases:
> ERROR: database is not accepting commands to avoid wraparound data loss
> in database "db1"
>    Suggestion:Stop the postmaster and use a standalone backend to
> vacuum that database.
>
> The error messages says problem is on db1 when I connected to any
> database. But after I did vacuum full on some other databases such as db2,
> db2, but not db1, the problem disappeared.
> Was that just a wrong database name in the error message?

Likely auto-vacuum kicked in and took care of the wraparound issue while the
system was handling your manual vacuum routines...but that is just a theory

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/wrong-database-name-in-error-message-tp5770815p5770822.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.


Re: wrong database name in error message?

От
Rural Hunter
Дата:
于 2013/9/14 10:25, David Johnston 写道:
> Likely auto-vacuum kicked in and took care of the wraparound issue
> while the system was handling your manual vacuum routines...but that
> is just a theory
I don't think so. I had to use single connection mode to run the vacuum
full on other dbs.



Re: wrong database name in error message?

От
Alvaro Herrera
Дата:
Rural Hunter escribió:
> I'm on 9.2.4 and I have several databases on the instance say db1,
> db2, db2 etc.  Today I got this error message on connection of any
> of the databases:
> ERROR: database is not accepting commands to avoid wraparound data
> loss in database "db1"
>   Suggestion:Stop the postmaster and use a standalone backend to
> vacuum that database.
>
> The error messages says problem is on db1 when I connected to any database. But after I did vacuum full on some other
databasessuch as db2, db2, but not db1, the problem disappeared. 
> Was that just a wrong database name in the error message?

The ultimate source of truth here are the pg_class and pg_database
catalogs (pg_class for each database stores the age of every table in
that database; pg_database stores the minimum of such values in each
database).  The database name you see in the error messages is stored in
pg_control (actually it's the OID that's stored not the name), but
vacuuming other databases might have updated the pg_control info because
of updated calculations from the shared catalog.

I don't know how to explain the discrepancy other than concurrent
processing by autovacuum, though.  Perhaps autovacuum, in the last few
Xids you had left, processed that database, but the field in pg_control
didn't get updated until after you processed the other databases?  Not
sure about this.  But it's past my bed time here, so no further
speculation from me.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: wrong database name in error message?

От
Rural Hunter
Дата:
Sure. thanks any away and have a good night.

Let me put here the whole scenario:
1. I was called by our application users that all the updating was
failing. So I went to check the db. Any update transaction including
manual vacuum is blocked out by the error message:
ERROR: database is not accepting commands to avoid wraparound data loss
in database "db1"
   Suggestion:Stop the postmaster and use a standalone backend to
vacuum that database.

2. Since db1 is a very large database(it is the main db the user is
using) I can not afford to take long time to vacuum full on that. So I
thought about to try on other small dbs first.

3. I stop the instance.

4. I use "echo 'vacuum full;' | postgres --single -D /pgdata [other
dbs]" to vacuum some other dbs. I still got several warning messages
when vacuum the first database(let's say db2):
2013-09-14 08:56:44 CST [5536]: [453-1] user=,db=,host= WARNING:
database "db1" must be vacuumed within 999775 transactions
2013-09-14 08:56:44 CST [5536]: [454-1] user=,db=,host= HINT:  To avoid
a database shutdown, execute a database-wide VACUUM in that database.

Here the error message still points to db1.

5.  When I ran the single connection vacuum on other dbs(not db1), there
was not any error/warning message. So I tried to start whole instance.

6. I started the instance and found everything is fine.

So actually I have 3 questions here:
1. Was the db name in the error message wrong?
2. How would that happend? Shouldn't auto vacuum handle it and avoid
such problem?
3. How to detect such problem earlier?

于 2013/9/14 12:55, Alvaro Herrera 写道:
> The ultimate source of truth here are the pg_class and pg_database
> catalogs (pg_class for each database stores the age of every table in
> that database; pg_database stores the minimum of such values in each
> database). The database name you see in the error messages is stored
> in pg_control (actually it's the OID that's stored not the name), but
> vacuuming other databases might have updated the pg_control info
> because of updated calculations from the shared catalog. I don't know
> how to explain the discrepancy other than concurrent processing by
> autovacuum, though. Perhaps autovacuum, in the last few Xids you had
> left, processed that database, but the field in pg_control didn't get
> updated until after you processed the other databases? Not sure about
> this. But it's past my bed time here, so no further speculation from me.



Re: wrong database name in error message?

От
Kevin Grittner
Дата:
Rural Hunter <ruralhunter@gmail.com> wrote:

> Let me put here the whole scenario:
> 1. I was called by our application users that all the updating was
> failing. So I went to check the db. Any update transaction including
> manual vacuum is blocked out by the error message:
> ERROR: database is not accepting commands to avoid wraparound data loss
> in database "db1"
>   Suggestion:Stop the postmaster and use a standalone backend to
> vacuum that database.
>
> 2. Since db1 is a very large database(it is the main db the user is
> using) I can not afford to take long time to vacuum full on that. So I
> thought about to try on other small dbs first.

Why in the world would you want to use VACUUM FULL in this circumstance?

> 3. I stop the instance.
>
> 4. I use "echo 'vacuum full;' | postgres --single -D /pgdata [other
> dbs]" to vacuum some other dbs. I still got several warning messages
> when vacuum the first database(let's say db2):
> 2013-09-14 08:56:44 CST [5536]: [453-1] user=,db=,host= WARNING:
> database "db1" must be vacuumed within 999775 transactions
> 2013-09-14 08:56:44 CST [5536]: [454-1] user=,db=,host= HINT:  To avoid
> a database shutdown, execute a database-wide VACUUM in that database.

WARNING, not error, so the VACUUM would have run.

> Here the error message still points to db1.

I'm not sure which database would be referenced if the table which
needed the VACUUM was a shared table, like pg_database or
pg_authid.

> 5.  When I ran the single connection vacuum on other dbs(not db1), there
> was not any error/warning message. So I tried to start whole instance.
>
> 6. I started the instance and found everything is fine.
>
> So actually I have 3 questions here:
> 1. Was the db name in the error message wrong?

Probably not, to the extent that running VACUUM (FULL is not
necessary) against that database would have solved the problem.  If
it was a shared catalog table it might be that it was not the
*only* database which would work.

> 2. How would that happend? Shouldn't auto vacuum handle it and avoid
> such problem?

There are two possibilities -- either you had a long-running
transaction in the cluster or your autovacuum is not configured to
be aggressive enough to keep you out of trouble.

> 3. How to detect such problem earlier?

We would need a description of the machine (cores, RAM, storage
system) and the output of these queries to be able to make good
suggestions on tuning autovacuum:

SELECT version();
SELECT name, current_setting(name), source
  FROM pg_settings
  WHERE source NOT IN ('default', 'override');

In addition, make sure that you are monitoring for long-running
transactions.  A reasonable monitoring scheme might be to alert
when either of these queries returns any rows:

select * from pg_stat_activity where xact_start < (now() - interval '1 hour');
select * from pg_prepared_xacts where prepared < (now() - interval '1 minute');

You can, of course, adjust the intervals to what makes the most
sense for your environment.  If you have max_prepared_transactions
set to zero, the latter query is not really necessary.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: wrong database name in error message?

От
Rural Hunter
Дата:
于 2013/9/15 1:06, Kevin Grittner 写道:
> Rural Hunter <ruralhunter@gmail.com> wrote:
>
> Why in the world would you want to use VACUUM FULL in this circumstance?
> the db name in the error message wrong?
I just googled around and found the solution. What's the other option?
> There are two possibilities -- either you had a long-running
> transaction in the cluster or your autovacuum is not configured to
> be aggressive enough to keep you out of trouble.
I checked the running transactions before I stopped the instance. There
was no long running transaction exception one auto-vacuum running for
about 1 hour.
>
> We would need a description of the machine (cores, RAM, storage
> system) and the output of these queries to be able to make good
> suggestions on tuning autovacuum:
>
> SELECT version();
> SELECT name, current_setting(name), source
>    FROM pg_settings
>    WHERE source NOT IN ('default', 'override');
I'm on Ubuntu 12.04.1 64bit with 32 cores and 377G memory. The data is
stored on several rai10 SAS 15k disks.

postgres=# SELECT version();
version
---------------------------------------------------------------------------------------------------------------
  PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit
(1 row)
postgres=# SELECT name, current_setting(name), source
postgres-#   FROM pg_settings
postgres-#   WHERE source NOT IN ('default', 'override');
             name             |
current_setting                                   | source

-----------------------------+------------------------------------------------------------------------------------+----------------------
  application_name            | psql.bin | client
  archive_command             | test ! -f /dbbk/postgres/logarch/%f.gz
&& gzip -c %p >/dbbk/postgres/logarch/%f.gz | configuration file
  archive_mode                | on | configuration file
  autovacuum                  | on | configuration file
  autovacuum_freeze_max_age   | 2000000000 | configuration file
  checkpoint_segments         | 20 | configuration file
  client_encoding             | UTF8 | client
  DateStyle                   | ISO, YMD | configuration file
  default_text_search_config  | chinesecfg | configuration file
  effective_cache_size        | 100GB | configuration file
  full_page_writes            | off | configuration file
  lc_messages                 | zh_CN.utf8 | configuration file
  lc_monetary                 | zh_CN.utf8 | configuration file
  lc_numeric                  | zh_CN.utf8 | configuration file
  lc_time                     | zh_CN.utf8 | configuration file
  listen_addresses            | * | configuration file
  log_autovacuum_min_duration | 30min | configuration file
  log_destination             | stderr | configuration file
  log_line_prefix             | %t [%p]: [%l-1]
user=%u,db=%d,host=%h                                              |
configuration file
  log_min_duration_statement  | 10s | configuration file
  log_statement               | ddl | configuration file
  log_timezone                | PRC | configuration file
  logging_collector           | on | configuration file
  maintenance_work_mem        | 20GB | configuration file
  max_connections             | 2500 | configuration file
  max_stack_depth             | 2MB | environment variable
  max_wal_senders             | 1 | configuration file
  port                        | 3500 | configuration file
  shared_buffers              | 16GB | configuration file
  synchronous_commit          | off | configuration file
  TimeZone                    | PRC | configuration file
  track_activities            | on | configuration file
  track_counts                | on | configuration file
  vacuum_freeze_table_age     | 1000000000 | configuration file
  wal_buffers                 | 16MB | configuration file
  wal_level                   | hot_standby | configuration file
  work_mem                    | 8MB | configuration file
(37 rows)
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



Re: wrong database name in error message?

От
bricklen
Дата:
On Sat, Sep 14, 2013 at 6:05 PM, Rural Hunter <ruralhunter@gmail.com> wrote:
于 2013/9/15 1:06, Kevin Grittner 写道:
Rural Hunter <ruralhunter@gmail.com> wrote:

Why in the world would you want to use VACUUM FULL in this circumstance?
the db name in the error message wrong?
I just googled around and found the solution. What's the other option?

A plain "VACUUM" will suffice -- there is no need to also use the FULL keyword in this case (wraparound).

Re: wrong database name in error message?

От
Alvaro Herrera
Дата:
Rural Hunter escribió:

> 2. Since db1 is a very large database(it is the main db the user is
> using) I can not afford to take long time to vacuum full on that. So
> I thought about to try on other small dbs first.
>
> 3. I stop the instance.
>
> 4. I use "echo 'vacuum full;' | postgres --single -D /pgdata [other
> dbs]" to vacuum some other dbs.

Two things.  One is you don't need VACUUM FULL, as already pointed out;
plain VACUUM suffices.  The other is that you don't actually need to
vacuum all tables; only those with a very old pg_class.relfrozenxid.
The one with the oldest value is that feeds pg_database.datfrozenxid;
and that's what feeds the "must be vacuumed within XY transactions"
messages.  So you can just connect to db1, examine pg_class looking for
tables whose age(relfrozenxid) is old, and vacuum only those.  No need
for downtime.

Now the interesting question is why didn't autovacuum get to these.
Normally it does, but when there's conflicting activity (say you have
periodic ALTER TABLE for some reason) it might not be able to.  Check
the log for ERRORs that made autovacuum kill itself, for example.  In
the long run, the best solution is to not have such conflicting activity
in the first place.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: wrong database name in error message?

От
Kevin Grittner
Дата:
Rural Hunter <ruralhunter@gmail.com> wrote:

> I'm on Ubuntu 12.04.1 64bit with 32 cores and 377G memory. The
> data is stored on several rai10 SAS 15k disks.

With a machine that beefy I have found it necessary to make the
autovacuum settings more aggressive.  Otherwise the need for
vacuuming can outpace the ability of autovacuum to keep up.

>   autovacuum_freeze_max_age  | 2000000000 | configuration file

>   vacuum_freeze_table_age    | 1000000000 | configuration file

There's your problem.  You left so little space between when
autovacuum would kick in for wraparound prevention (2 billion
transactions) and when the server prevents new transactions in
order to protect your data (2 ^ 31 - 1000000 transactions) that
autovacuum didn't have enough time to complete its effort to do so.

Changing a setting to ten times its default value is something
which should always be approached with caution.  In this case you
changed the threshold for starting the work to prevent data loss
from a little under 10% of the distance to the disastrous condition
to a little under 100% of that distance.

You could play with non-standard setting for these, but if you go
anywhere near this extreme you risk downtime like you have just
experienced.  Personally, I have never had a reason to change these
from the defaults.

To ensure that autovacuum can keep up with the activity on a
machine like this, I have generally gone to something like:

autovacuum_cost_limit = 800

If you have more than a couple large tables which take long enough
to scan to prevent small, frequently-updated tables from getting
attention soon enough, you might want to boost
autovacuum_max_workers, too.
 
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: wrong database name in error message?

От
Rural Hunter
Дата:
于 2013/9/16 1:31, Kevin Grittner 写道:
> There's your problem.  You left so little space between when
> autovacuum would kick in for wraparound prevention (2 billion
> transactions) and when the server prevents new transactions in
> order to protect your data (2 ^ 31 - 1000000 transactions) that
> autovacuum didn't have enough time to complete its effort to do so.
>
> Changing a setting to ten times its default value is something
> which should always be approached with caution.  In this case you
> changed the threshold for starting the work to prevent data loss
> from a little under 10% of the distance to the disastrous condition
> to a little under 100% of that distance.
>
> You could play with non-standard setting for these, but if you go
> anywhere near this extreme you risk downtime like you have just
> experienced.  Personally, I have never had a reason to change these
> from the defaults.
OK, thanks for pointing out the problem. This was changed quite long
time ago when I saw too frequent auto vacuums to prevent the wrap-around
on a very busy/large table which slow down the performance. I will
change it back to the default to see how it works.
>
> To ensure that autovacuum can keep up with the activity on a
> machine like this, I have generally gone to something like:
>
> autovacuum_cost_limit = 800
>
> If you have more than a couple large tables which take long enough
> to scan to prevent small, frequently-updated tables from getting
> attention soon enough, you might want to boost
> autovacuum_max_workers, too.
>
I will try the parameters as you suggested too.

So you guys still think the problem is on db1(that's my main db) as the
error message stated? Just auto-vacuum on db1 kicked off somehow and
fixed the problem when I was running vacuum on other dbs?


Re: wrong database name in error message?

От
Kevin Grittner
Дата:
Rural Hunter <ruralhunter@gmail.com> wrote:

> This was changed quite long time ago when I saw too frequent auto
> vacuums to prevent the wrap-around on a very busy/large table
> which slow down the performance. I will change it back to the
> default to see how it works.

There was a long-standing bug which could cause over-frequent
wraparound prevention autovacuums.  As long as you are on the
latest minor release, things should be much better now.

> I will try the parameters as you suggested too.

Possibly.  As I said before, I think the symptoms might better fit
a situation where the table in need of VACUUM was a shared table
and it just happened to mention db1 because that was the database
it was scanning at the time.  (Every database includes the shared
system tables in its catalog.)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: wrong database name in error message?

От
Rural Hunter
Дата:
于 2013/9/17 0:02, Kevin Grittner 写道:
> Possibly. As I said before, I think the symptoms might better fit a
> situation where the table in need of VACUUM was a shared table and it
> just happened to mention db1 because that was the database it was
> scanning at the time. (Every database includes the shared system
> tables in its catalog.)
OK, that sounds reasonable. thanks a lot for your advice and explanation.