Обсуждение: Something Weird Going on with VACUUM ANALYZE

Поиск
Список
Период
Сортировка

Something Weird Going on with VACUUM ANALYZE

От
Shaun Thomas
Дата:
Hey,

This is PostgreSQL 9.1.9.

So we've had "vacuumdb -avz" launched via cron at 3am for a few years
now, and recently noticed some queries behaving very badly. While
checking pg_stat_user_tables, I see this for several hundred of them:

        relname       |           last_analyze
---------------------+----------------------------------
  some_table          | 13-SEP-13 03:27:13.289291 -05:00
  another_table       | 13-SEP-13 03:33:51.262007 -05:00
  yet_another_table   | 13-SEP-13 03:23:27.630734 -05:00

Yet last_vacuum shows this:

        relname      |           last_vacuum
--------------------+----------------------------------
  some_table         | 17-SEP-13 03:23:41.84311 -05:00
  another_table      | 17-SEP-13 03:21:25.588267 -05:00
  yet_another_table  | 17-SEP-13 03:21:28.944848 -05:00

So I thought to myself, "Self, that's pretty freaking odd." The last
vacuumdb (with analyze flag enabled) was this morning at 3am.

Apparently something magical happened last Friday, and now analyze is
broken somehow? Am I missing something, here? The log claims everything
worked out OK:

2013-09-17 03:20:37 CDT|STATEMENT:  VACUUM (VERBOSE, ANALYZE);
2013-09-17 03:37:31 CDT|LOG:  duration: 2246467.567 ms  statement:
VACUUM (VERBOSE, ANALYZE);

These are from the same pid doing the vacuum. What's weird, is that the
lines don't match up in time.  The reported duration is 37 minutes, and
since the vacuum launches at 3:00am, it matches with the last line. If
that's the case, what on Earth is that line at 3:20 all about? The
durations for the last few days have also been about 50% shorter than
historically, which is mysterious all by itself.

I mean...

WITH stats AS (
SELECT relname,
        ceil(extract(epoch FROM last_vacuum)) AS vac_sec,
        ceil(extract(epoch FROM last_analyze)) AS an_sec
   FROM pg_stat_all_tables
  WHERE last_vacuum IS NOT NULL
    AND last_analyze IS NOT NULL
)
SELECT sum(CASE WHEN vac_sec = an_sec THEN 1 ELSE 0 END) AS same_vac,
        sum(CASE WHEN vac_sec != an_sec THEN 1 ELSE 0 END) AS diff_vac
   FROM stats

  same_vac | diff_vac
----------+----------
       508 |      739

I can't really figure this out. Any help would be appreciated. Since
VACUUM ANALYZE as a single statement is being wonky, I'm willing to use
a script like this until we can figure out the cause:

DO $$
DECLARE
   tablename VARCHAR;
   schemaname VARCHAR;
BEGIN
   FOR schemaname, tablename IN
       SELECT s.schemaname, s.relname
         FROM pg_stat_all_tables s
        WHERE s.last_analyze < CURRENT_DATE
   LOOP
     EXECUTE 'ANALYZE ' || quote_ident(schemaname) || '.' ||
             quote_ident(tablename);
   END LOOP;
END;
$$ LANGUAGE plpgsql;


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Something Weird Going on with VACUUM ANALYZE

От
Albe Laurenz
Дата:
Shaun Thomas wrote:
> This is PostgreSQL 9.1.9.
> 
> So we've had "vacuumdb -avz" launched via cron at 3am for a few years
> now, and recently noticed some queries behaving very badly. While
> checking pg_stat_user_tables, I see this for several hundred of them:
> 
>         relname       |           last_analyze
> ---------------------+----------------------------------
>   some_table          | 13-SEP-13 03:27:13.289291 -05:00
>   another_table       | 13-SEP-13 03:33:51.262007 -05:00
>   yet_another_table   | 13-SEP-13 03:23:27.630734 -05:00
> 
> Yet last_vacuum shows this:
> 
>         relname      |           last_vacuum
> --------------------+----------------------------------
>   some_table         | 17-SEP-13 03:23:41.84311 -05:00
>   another_table      | 17-SEP-13 03:21:25.588267 -05:00
>   yet_another_table  | 17-SEP-13 03:21:28.944848 -05:00
> 
> So I thought to myself, "Self, that's pretty freaking odd." The last
> vacuumdb (with analyze flag enabled) was this morning at 3am.
> 
> Apparently something magical happened last Friday, and now analyze is
> broken somehow? Am I missing something, here? The log claims everything
> worked out OK:
> 
> 2013-09-17 03:20:37 CDT|STATEMENT:  VACUUM (VERBOSE, ANALYZE);
> 2013-09-17 03:37:31 CDT|LOG:  duration: 2246467.567 ms  statement:
> VACUUM (VERBOSE, ANALYZE);

It does sound odd.

What happens if you run VACUUM (VERBOSE, ANALYZE) manually?
Are the statistics updated?
Are there any warnings?

> These are from the same pid doing the vacuum. What's weird, is that the
> lines don't match up in time.  The reported duration is 37 minutes, and
> since the vacuum launches at 3:00am, it matches with the last line. If
> that's the case, what on Earth is that line at 3:20 all about? The
> durations for the last few days have also been about 50% shorter than
> historically, which is mysterious all by itself.

No idea about this.
Is there a lot of load on the system?

Yours,
Laurenz Albe

Re: Something Weird Going on with VACUUM ANALYZE

От
Shaun Thomas
Дата:
On 09/18/2013 02:32 AM, Albe Laurenz wrote:

> What happens if you run VACUUM (VERBOSE, ANALYZE) manually?
> Are the statistics updated?
> Are there any warnings?

It looks like running it manually worked fine, but I only ran it on
specific tables that showed up in the list of un-analyzed.

I see no errors, no warnings, no problems of any kind in the verbose log
I save every day. According to the database, everything is fine.

> No idea about this.
> Is there a lot of load on the system?

According to sar, it's a desert. There's a ton of disk IO caused by the
VACUUM ANALYZE, but that's it. The IO map is... very odd, though. It
looks like the vacuum IO didn't start until 3:20 instead of 3:00,
possibly because some lock was preventing it from starting. That might
explain that delay from the log, but not the fact it vacuumed for 17
minutes and neglected to analyze, instead of 55 minutes while doing both.

Logs show no unexpected or unusual behavior, either. And of course, the
vacuumdb ran fine this morning. If it comes up again, I'll dig further,
but I can't find any reason why our database stopped analyzing itself
for four days.

Thanks for the attempt though. :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Something Weird Going on with VACUUM ANALYZE

От
Jeff Janes
Дата:
On Tue, Sep 17, 2013 at 1:39 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:
Hey,

This is PostgreSQL 9.1.9.

 
...
 
Apparently something magical happened last Friday, and now analyze is broken somehow? Am I missing something, here? The log claims everything worked out OK:

2013-09-17 03:20:37 CDT|STATEMENT:  VACUUM (VERBOSE, ANALYZE);
2013-09-17 03:37:31 CDT|LOG:  duration: 2246467.567 ms  statement: VACUUM (VERBOSE, ANALYZE);

These are from the same pid doing the vacuum. What's weird, is that the lines don't match up in time.  The reported duration is 37 minutes, and since the vacuum launches at 3:00am, it matches with the last line. If that's the case, what on Earth is that line at 3:20 all about?

That line is the final line of a multi-line log entry.  To know what it is about, you need to look at the lines before it in the logfile.  Perhaps it is failing to obtain a lock or something.

Cheers,

Jeff

Re: Something Weird Going on with VACUUM ANALYZE

От
Shaun Thomas
Дата:
On 09/18/2013 11:50 AM, Jeff Janes wrote:

> That line is the final line of a multi-line log entry.  To know what
> it is about, you need to look at the lines before it in the logfile.
> Perhaps it is failing to obtain a lock or something.

Thanks for the input, Jeff. The full chunk for that PID (22078) is:

|2013-09-17 03:20:08 CDT|LOG:  automatic vacuum of table
"table_redacted": could not (re)acquire exclusive lock for truncate scan
2013-09-17 03:20:08 CDT|STATEMENT:  VACUUM (VERBOSE, ANALYZE);
2013-09-17 03:20:37 CDT|LOG:  sending cancel to blocking autovacuum PID
23000
DETAIL:  Process 22078 waits for ShareUpdateExclusiveLock on relation
18745 of database 16384.
2013-09-17 03:20:37 CDT|STATEMENT:  VACUUM (VERBOSE, ANALYZE);
2013-09-17 03:37:31 CDT|LOG:  duration: 2246467.567 ms  statement:
VACUUM (VERBOSE, ANALYZE);

Which suggests an autovacuum was screwing with the launch time of the
database-wide VACUUM VERBOSE, or at least interfered with that one
table. What that doesn't explain, is why the VACUUM VERBOSE would finish
much sooner when this happens. 17 minutes in this case (if it was
blocking, or 37 if it was just stuck on that one table) instead of the
usual 55. I could see it starting later, but it should still take 55
minutes.

It did it again today, but this time, it "finished" in 25 minutes, and
didn't send a cancel to the other autovacuum. The fact it's so much
faster after encountering the truncate scan error, makes me wonder if
the analyze portion (high CPU usage) just plain stops working after that.

The first line also seems extremely misleading. a manually launched
vacuum is not an autovacuum, so why is it complaining about an
autovacuum? Shouldn't the PID doing the autovacuum emit the log message?
This all seems a little sketchy.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Something Weird Going on with VACUUM ANALYZE

От
Kevin Grittner
Дата:
Shaun Thomas <sthomas@optionshouse.com> wrote:

> The first line also seems extremely misleading. a manually
> launched vacuum is not an autovacuum, so why is it complaining
> about an autovacuum? Shouldn't the PID doing the autovacuum emit
> the log message? This all seems a little sketchy.

The table truncation in autovacuum had severe problems in some
conditions which were causing production down time.  The fix
accidentally introduced some messages which are confusing, and
caused the statistics from autovacuum to fail to be generated more
often than had previously been the case.  These problems should be
fixed in the next minor release.  Until then, if you have a
workload which triggers these issues, the workaround is to schedule
ANALYZE jobs.

Where things are now running faster, it's hard to say how much is
from missing the analyze step, how much is due to caching issues,
and how much is actual gain from the fixes for the original
problems.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Something Weird Going on with VACUUM ANALYZE

От
Shaun Thomas
Дата:
On 09/19/2013 08:42 AM, Kevin Grittner wrote:

> The fix accidentally introduced some messages which are confusing,
> and caused the statistics from autovacuum to fail to be generated
> more often than had previously been the case.  These problems should
> be fixed in the next minor release.  Until then, if you have a
> workload which triggers these issues, the workaround is to schedule
> ANALYZE jobs.

Thanks Kevin, that explanation totally makes sense. I can switch to a
separate VACUUM and ANALYZE step to avoid this until the fix comes through.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email