Обсуждение: Problem with pgstat timneouts

Поиск
Список
Период
Сортировка

Problem with pgstat timneouts

От
"Benjamin Krajmalnik"
Дата:

About a month ago, I started receiving quite a few pgstat timeouts on my production database.

PostgreSQL 9.0.3 on amd64-portbld-freebsd8.1, compiled by GCC cc (GCC) 4.2.1 20070719  [FreeBSD], 64-bit

I am not sure where to start looking for the cause.

Database has been up since march without any restart

 

db1# ps -aux | grep stat

pgsql      1287  1.0  0.0 22832  4724  ??  Ss   11Mar11 4265:13.87 postgres: stats collector process    (postgres)

 

It appears that these timeouts are affecting the performance of the application – sometimes it hangs for a few seconds before data gets returned, which was not the case before this.

 

 

2011-12-22 03:45:04 MST WARNING:  pgstat wait timeout

2011-12-22 03:45:04 MST WARNING:  pgstat wait timeout

2011-12-22 03:45:04 MST WARNING:  pgstat wait timeout

2011-12-22 03:45:04 MST WARNING:  pgstat wait timeout

2011-12-22 03:45:41 MST WARNING:  pgstat wait timeout

2011-12-22 03:45:41 MST WARNING:  pgstat wait timeout

2011-12-22 03:45:41 MST WARNING:  pgstat wait timeout

2011-12-22 03:46:18 MST WARNING:  pgstat wait timeout

2011-12-22 03:46:18 MST WARNING:  pgstat wait timeout

2011-12-22 03:47:32 MST WARNING:  pgstat wait timeout

2011-12-22 03:47:32 MST WARNING:  pgstat wait timeout

2011-12-22 03:47:32 MST WARNING:  pgstat wait timeout

2011-12-22 03:48:09 MST WARNING:  pgstat wait timeout

2011-12-22 03:48:09 MST WARNING:  pgstat wait timeout

2011-12-22 03:48:09 MST WARNING:  pgstat wait timeout

2011-12-22 03:49:26 MST WARNING:  pgstat wait timeout

 

Runtime statistics setting for the database are as follows:

 

#------------------------------------------------------------------------------

# RUNTIME STATISTICS

#------------------------------------------------------------------------------

                                       

# - Query/Index Statistics Collector -

                                       

#track_activities = on

#track_counts = on

#track_functions = none                 # none, pl, all

#track_activity_query_size = 1024       # (change requires restart)

                                       

# On FreeBSD, this is a performance hog, so keep it off if you need speed  

update_process_title = off

#update_process_title = on

#stats_temp_directory = 'pg_stat_tmp'

                                       

                                        

# - Statistics Monitoring -

                                       

#log_parser_stats = off

#log_planner_stats = off

#log_executor_stats = off

#log_statement_stats = off

 

 

Re: Problem with pgstat timneouts

От
Tom Lane
Дата:
"Benjamin Krajmalnik" <kraj@servoyant.com> writes:
> About a month ago, I started receiving quite a few pgstat timeouts on my
> production database.

> PostgreSQL 9.0.3 on amd64-portbld-freebsd8.1, compiled by GCC cc (GCC)
> 4.2.1 20070719  [FreeBSD], 64-bit

> I am not sure where to start looking for the cause.

Yeah, so are the rest of us :-(.  People have been reporting sometimes
seeing these messages without any clear cause.  We've been unable to
find any bug in Postgres that would explain them, and the obvious
explanation of "your system is so overloaded that the stats collector
is too slow to respond" isn't very satisfying if response time seems
snappy otherwise.  I think we have seen one case where the problem was
provably caused by significant (multi-second) clock skew between
different processors, but that's likely not applicable to many people.

Can you reconstruct what changed in your system around the time you
started seeing these?

> It appears that these timeouts are affecting the performance of the
> application - sometimes it hangs for a few seconds before data gets
> returned, which was not the case before this.

Usually these timeouts are reported by autovacuum, and so should not be
affecting the performance of foreground queries, at least not queries
that are not explicitly requesting pgstats data.  The fact that your
log messages don't include any query context reinforces the idea that
they are coming from autovacuum for you too.  So if you are seeing
query response time problems, maybe the pgstat timeouts are another
symptom, not the cause, of an underlying too-much-load problem.  You
might try turning on query duration logging and see if you can correlate
response-time glitches and pgstat timeouts to anything else observable,
such as peaks of system load average.

            regards, tom lane

Re: Problem with pgstat timneouts

От
"Benjamin Krajmalnik"
Дата:
Thanks, Tom.
The system is running 16 cores (dual E5620's), 96GB RAM.
Nothing changed on the system when this started happening - at least
nothing in the configuration - it has not been touched since it was set
up.
Load averages are in between 6 and 8 (all 3 of them)
Memory wise, I do not see a problem - plenty of free memory, and only
120K of swap has been used.
The only thing which I see when I run iostat is load on the drives every
so often.  Mifd0 is the database, mifd1 is the pg_xlog.


db1# iostat -x 5
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   1.6   0
ad6        0.0   0.0     0.0     0.0    0   0.8   0
mfid0     11.6 172.2   171.7  2988.5    0 178.1   6
mfid1      0.1 734.0     1.1  5898.8    0   0.2   4
ar0        0.3   0.7     1.0    10.3    0 114.7   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      2.8  73.4    22.4  1961.9    0   0.2   2
mfid1      0.0 1784.9     0.0 14303.2    0   0.0   5
ar0        0.0   1.0     0.0     6.8    0   0.8   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      2.2 652.7    44.8  7671.1 2053  99.6  10
mfid1      0.0 1525.9     0.0 12207.6    1   0.1  12
ar0        0.0   0.0     0.0     0.0    0   0.0   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      1.0 2202.4     8.0 19776.3    0 632.0  68
mfid1      0.0 574.3     0.0  4666.4    0   0.5  23
ar0        0.0   0.0     0.0     0.0    0   0.0   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      1.8  75.8    14.4  2242.8    0   0.3   2
mfid1      0.0 1722.5     0.0 13802.0    0   0.0   5
ar0        0.0   0.0     0.0     0.0    0   0.0   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      1.4  71.4    11.2  1991.6    0   0.2   1
mfid1      0.0 1708.9     0.0 13671.4    0   0.0   4
ar0        0.2   2.4     3.2    38.4    0   2.5   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      1.0  78.0     8.0  2222.5    0   0.1   1
mfid1      0.0 1541.0     0.0 12352.1    0   0.0   5
ar0        0.0   0.6     0.0     9.6    0   0.6   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      1.4  56.6    11.2  1629.1    0   0.3   2
mfid1      0.0 1558.3     0.0 12468.3    0   0.0   6
ar0        0.0   0.6     0.0     3.2    0   0.5   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      1.8  72.2    14.4  1706.7    0   0.2   1
mfid1      0.0 1696.5     0.0 13596.2    0   0.0   5
ar0        0.0   0.0     0.0     0.0    0   0.0   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      0.0 1757.7     0.0 16093.5  351 648.0  48
mfid1      0.0 503.9     0.0  4031.3    0   0.2  10
ar0        0.0   0.0     0.0     0.0    0   0.0   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      0.0 125.0     0.0  2542.5    0 404.4   1
mfid1      0.0 1743.9     0.0 13951.1    0   0.1   5
ar0        0.0   0.0     0.0     0.0    0   0.0   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      1.2  73.4     9.6  2122.6    0   0.1   1
mfid1      0.0 1624.3     0.0 13020.0    0   0.0   4
ar0        0.0   2.0     0.0    32.0    0   1.8   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      0.2  67.4     1.6  1955.8    0   0.1   0
mfid1      0.0 1334.5     0.0 10699.7    0   0.0   5
ar0        0.0   0.4     0.0     6.4    0   0.5   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      1.0  64.6     8.0  1789.1    0   0.2   1
mfid1      0.0 1541.0     0.0 12352.1    0   0.0   5
ar0        0.0   0.0     0.0     0.0    0   0.0   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      1.0  78.4     8.0  1855.3    0   0.2   1
mfid1      0.0 1652.1     0.0 13217.0    0   0.0   5
ar0        0.0   0.8     0.0     6.4    0   0.7   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      0.6 691.6     4.8  7919.5 3461  82.4  11
mfid1      0.0 1477.7     0.0 11821.7    1   0.3  15
ar0        0.0   0.0     0.0     0.0    0   0.0   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      0.4 1346.2     3.2 12558.1    0 855.1  41
mfid1      0.0 1020.8     0.0  8190.6    0   0.5  39
ar0        0.0   0.0     0.0     0.0    0   0.0   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      1.0  62.2     8.0  1792.6    0   0.3   2
mfid1      0.0 1636.3     0.0 13112.4    0   0.0   4
ar0        0.0   0.8     0.0    12.8    0   0.8   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      0.2  63.0     1.6  1818.1    0   0.1   0
mfid1      0.0 1710.6     0.0 13684.6    0   0.0   5
ar0        0.0   1.2     0.0    19.2    0   1.2   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      0.8  83.8     6.4  2330.6    0   0.1   1
mfid1      0.0 1218.8     0.0  9750.2    0   0.0   4
ar0        0.0   0.4     0.0     6.4    0   0.5   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      0.0  68.6     0.0  1655.7    0   0.1   0
mfid1      0.0 1708.9     0.0 13671.4    0   0.0   5
ar0        0.0   0.6     0.0     3.2    0   0.5   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      0.0  71.6     0.0  1999.4    0   0.0   0
mfid1      0.0 1071.3     0.0  8594.1    0   0.1   3
ar0        0.0   0.0     0.0     0.0    0   0.0   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      0.4 1373.6     3.2 12956.5    0 550.7  35
mfid1      0.0 999.8     0.0  8022.7    0   0.8  37
ar0        0.0   0.0     0.0     0.0    0   0.0   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      0.8  68.8     8.0  1908.5    0   0.2   1
mfid1      0.0 1370.5     0.0 10962.3    0   0.0   3
ar0        0.0   0.6     0.0     9.6    0   0.6   0
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        0.0   0.0     0.0     0.0    0   0.0   0
ad6        0.0   0.0     0.0     0.0    0   0.0   0
mfid0      1.2  63.8     9.6  1912.1    0   0.3   2
mfid1      0.0 1355.8     0.0 10846.2    0   0.0   3
ar0        0.0   1.0     0.0    16.0    0   1.1   0

Looking at /var/log/messages, everything appears to be ok.
The only thing I found, but it was from over a month ago, was the
following:
Nov 17 06:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f
Nov 17 06:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 06:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 06:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory
error
Nov 17 06:35:03 db1 kernel: MCA: Address 0x84aa98000
Nov 17 06:35:03 db1 kernel: MCA: Misc 0x4701041000041280
Nov 17 07:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f
Nov 17 07:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 07:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 07:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory
error
Nov 17 07:35:03 db1 kernel: MCA: Address 0x84aa98000
Nov 17 07:35:03 db1 kernel: MCA: Misc 0x4701041000046141
Nov 17 09:35:03 db1 kernel: MCA: Bank 8, Status 0x88000040000200cf
Nov 17 09:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 09:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 09:35:03 db1 kernel: MCA: CPU 0 COR (1) MS channel ?? memory
error
Nov 17 09:35:03 db1 kernel: MCA: Misc 0x4701041000046141
Nov 18 13:34:59 db1 kernel: MCA: Bank 8, Status 0xcc0000800001009f
Nov 18 13:34:59 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 18 13:34:59 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 18 13:34:59 db1 kernel: MCA: CPU 0 COR (2) OVER RD channel ?? memory
error
Nov 18 13:34:59 db1 kernel: MCA: Address 0x84aa98000
Nov 18 13:34:59 db1 kernel: MCA: Misc 0x4701041000044707
But it has not reoccurred.


> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Thursday, December 22, 2011 10:55 AM
> To: Benjamin Krajmalnik
> Cc: pgsql-admin
> Subject: Re: [ADMIN] Problem with pgstat timneouts
>
> "Benjamin Krajmalnik" <kraj@servoyant.com> writes:
> > About a month ago, I started receiving quite a few pgstat timeouts
on
> my
> > production database.
>
> > PostgreSQL 9.0.3 on amd64-portbld-freebsd8.1, compiled by GCC cc
> (GCC)
> > 4.2.1 20070719  [FreeBSD], 64-bit
>
> > I am not sure where to start looking for the cause.
>
> Yeah, so are the rest of us :-(.  People have been reporting sometimes
> seeing these messages without any clear cause.  We've been unable to
> find any bug in Postgres that would explain them, and the obvious
> explanation of "your system is so overloaded that the stats collector
> is too slow to respond" isn't very satisfying if response time seems
> snappy otherwise.  I think we have seen one case where the problem was
> provably caused by significant (multi-second) clock skew between
> different processors, but that's likely not applicable to many people.
>
> Can you reconstruct what changed in your system around the time you
> started seeing these?
>
> > It appears that these timeouts are affecting the performance of the
> > application - sometimes it hangs for a few seconds before data gets
> > returned, which was not the case before this.
>
> Usually these timeouts are reported by autovacuum, and so should not
be
> affecting the performance of foreground queries, at least not queries
> that are not explicitly requesting pgstats data.  The fact that your
> log messages don't include any query context reinforces the idea that
> they are coming from autovacuum for you too.  So if you are seeing
> query response time problems, maybe the pgstat timeouts are another
> symptom, not the cause, of an underlying too-much-load problem.  You
> might try turning on query duration logging and see if you can
> correlate
> response-time glitches and pgstat timeouts to anything else
observable,
> such as peaks of system load average.
>
>             regards, tom lane

Re: Problem with pgstat timneouts

От
Tom Lane
Дата:
"Benjamin Krajmalnik" <kraj@servoyant.com> writes:
> The only thing which I see when I run iostat is load on the drives every
> so often.  Mifd0 is the database, mifd1 is the pg_xlog.

Hmm.  Maybe checkpoints or something else saturating your drives for a
little while?  You might want to be more aggressive about smoothing out
checkpoints, cf
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
At the very least I'd enable checkpoint logging for awhile and see
if there seems to be any correlation to the response-time glitches.

            regards, tom lane

Re: Problem with pgstat timneouts

От
"Benjamin Krajmalnik"
Дата:
checkpoint_segments = 1024              # in logfile segments, min 1,
16MB each
checkpoint_timeout = 60min              # range 30s-1h
checkpoint_completion_target = 0.85     # checkpoint target duration,
0.0 - 1.0
#checkpoint_warning = 30s               # 0 disables

select * from pg_stat_bgwriter gives ne the following:
checkpoints_timed - 6887
checkpoints_req - 36

A snapshot from the log (I am logging checkpoints)

2011-12-22 01:04:33 MST LOG:  checkpoint starting: time
2011-12-22 01:12:32 MST WARNING:  pgstat wait timeout
2011-12-22 01:12:32 MST WARNING:  pgstat wait timeout
2011-12-22 01:13:12 MST WARNING:  pgstat wait timeout
2011-12-22 01:13:12 MST WARNING:  pgstat wait timeout
2011-12-22 01:13:12 MST WARNING:  pgstat wait timeout
2011-12-22 01:13:18 MST WARNING:  pgstat wait timeout
2011-12-22 01:13:18 MST WARNING:  pgstat wait timeout
2011-12-22 01:35:35 MST WARNING:  pgstat wait timeout
2011-12-22 01:35:35 MST WARNING:  pgstat wait timeout
2011-12-22 01:38:32 MST WARNING:  pgstat wait timeout
2011-12-22 01:38:32 MST WARNING:  pgstat wait timeout
2011-12-22 01:41:28 MST LOG:  checkpoint complete: wrote 329593 buffers
(41.9%); 0 transaction log file(s) added, 0 removed, 731 rec



2011-12-22 09:04:33 MST LOG:  checkpoint starting: time
2011-12-22 09:06:54 MST WARNING:  pgstat wait timeout
2011-12-22 09:06:54 MST WARNING:  pgstat wait timeout
2011-12-22 09:06:54 MST WARNING:  pgstat wait timeout
2011-12-22 09:07:33 MST WARNING:  pgstat wait timeout
2011-12-22 09:07:33 MST WARNING:  pgstat wait timeout
2011-12-22 09:07:33 MST WARNING:  pgstat wait timeout
2011-12-22 09:13:41 MST WARNING:  pgstat wait timeout
2011-12-22 09:22:07 MST ERROR:  canceling autovacuum task
2011-12-22 09:22:07 MST CONTEXT:  automatic vacuum of table
"ishield.public.tbltmptests"
2011-12-22 09:28:41 MST WARNING:  pgstat wait timeout
2011-12-22 09:28:41 MST WARNING:  pgstat wait timeout
2011-12-22 09:31:35 MST WARNING:  pgstat wait timeout
2011-12-22 09:31:35 MST WARNING:  pgstat wait timeout
2011-12-22 09:33:16 MST WARNING:  pgstat wait timeout
2011-12-22 09:33:16 MST WARNING:  pgstat wait timeout
2011-12-22 09:33:55 MST WARNING:  pgstat wait timeout
2011-12-22 09:33:55 MST WARNING:  pgstat wait timeout
2011-12-22 09:33:55 MST WARNING:  pgstat wait timeout
2011-12-22 09:35:10 MST WARNING:  pgstat wait timeout
2011-12-22 09:35:10 MST WARNING:  pgstat wait timeout
2011-12-22 09:35:10 MST ERROR:  canceling autovacuum task
2011-12-22 09:35:10 MST CONTEXT:  automatic vacuum of table
"ishield.public.tbltmptests"
2011-12-22 09:36:21 MST WARNING:  pgstat wait timeout
2011-12-22 09:36:21 MST WARNING:  pgstat wait timeout
2011-12-22 09:37:02 MST WARNING:  pgstat wait timeout
2011-12-22 09:37:02 MST WARNING:  pgstat wait timeout
2011-12-22 09:37:02 MST WARNING:  pgstat wait timeout
2011-12-22 09:37:43 MST WARNING:  pgstat wait timeout
2011-12-22 09:37:43 MST WARNING:  pgstat wait timeout
2011-12-22 09:38:20 MST WARNING:  pgstat wait timeout
2011-12-22 09:38:20 MST WARNING:  pgstat wait timeout
2011-12-22 09:38:59 MST WARNING:  pgstat wait timeout
2011-12-22 09:38:59 MST WARNING:  pgstat wait timeout
2011-12-22 09:38:59 MST WARNING:  pgstat wait timeout
2011-12-22 09:42:33 MST LOG:  checkpoint complete: wrote 362856 buffers
(46.1%); 0 transaction log file(s) added, 0 removed, 908 rec

I do have an aggressive autovacuum because I have some tables that get
updated thousands of times per minute and otherwise they bloat terribly,
but it is very seldom that I see cancelation of autovacuum, and only on
this table, which is a temporary table  used for the insertion of the
data, in which records are inserted and deleted - never updated.


> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Thursday, December 22, 2011 11:45 AM
> To: Benjamin Krajmalnik
> Cc: pgsql-admin
> Subject: Re: [ADMIN] Problem with pgstat timneouts
>
> "Benjamin Krajmalnik" <kraj@servoyant.com> writes:
> > The only thing which I see when I run iostat is load on the drives
> every
> > so often.  Mifd0 is the database, mifd1 is the pg_xlog.
>
> Hmm.  Maybe checkpoints or something else saturating your drives for a
> little while?  You might want to be more aggressive about smoothing
out
> checkpoints, cf
> http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
> At the very least I'd enable checkpoint logging for awhile and see
> if there seems to be any correlation to the response-time glitches.
>
>             regards, tom lane

Re: Problem with pgstat timneouts

От
"Benjamin Krajmalnik"
Дата:
Just a WAG - is it possible this is being caused by the connection
pooler (pgbouncer in my case)?


> -----Original Message-----
> From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-
> owner@postgresql.org] On Behalf Of Benjamin Krajmalnik
> Sent: Thursday, December 22, 2011 11:56 AM
> To: Tom Lane
> Cc: pgsql-admin
> Subject: Re: [ADMIN] Problem with pgstat timneouts
>
> checkpoint_segments = 1024              # in logfile segments, min 1,
> 16MB each
> checkpoint_timeout = 60min              # range 30s-1h
> checkpoint_completion_target = 0.85     # checkpoint target duration,
> 0.0 - 1.0
> #checkpoint_warning = 30s               # 0 disables
>
> select * from pg_stat_bgwriter gives ne the following:
> checkpoints_timed - 6887
> checkpoints_req - 36
>
> A snapshot from the log (I am logging checkpoints)
>
> 2011-12-22 01:04:33 MST LOG:  checkpoint starting: time
> 2011-12-22 01:12:32 MST WARNING:  pgstat wait timeout
> 2011-12-22 01:12:32 MST WARNING:  pgstat wait timeout
> 2011-12-22 01:13:12 MST WARNING:  pgstat wait timeout
> 2011-12-22 01:13:12 MST WARNING:  pgstat wait timeout
> 2011-12-22 01:13:12 MST WARNING:  pgstat wait timeout
> 2011-12-22 01:13:18 MST WARNING:  pgstat wait timeout
> 2011-12-22 01:13:18 MST WARNING:  pgstat wait timeout
> 2011-12-22 01:35:35 MST WARNING:  pgstat wait timeout
> 2011-12-22 01:35:35 MST WARNING:  pgstat wait timeout
> 2011-12-22 01:38:32 MST WARNING:  pgstat wait timeout
> 2011-12-22 01:38:32 MST WARNING:  pgstat wait timeout
> 2011-12-22 01:41:28 MST LOG:  checkpoint complete: wrote 329593
buffers
> (41.9%); 0 transaction log file(s) added, 0 removed, 731 rec
>
>
>
> 2011-12-22 09:04:33 MST LOG:  checkpoint starting: time
> 2011-12-22 09:06:54 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:06:54 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:06:54 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:07:33 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:07:33 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:07:33 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:13:41 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:22:07 MST ERROR:  canceling autovacuum task
> 2011-12-22 09:22:07 MST CONTEXT:  automatic vacuum of table
> "ishield.public.tbltmptests"
> 2011-12-22 09:28:41 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:28:41 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:31:35 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:31:35 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:33:16 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:33:16 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:33:55 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:33:55 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:33:55 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:35:10 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:35:10 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:35:10 MST ERROR:  canceling autovacuum task
> 2011-12-22 09:35:10 MST CONTEXT:  automatic vacuum of table
> "ishield.public.tbltmptests"
> 2011-12-22 09:36:21 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:36:21 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:37:02 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:37:02 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:37:02 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:37:43 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:37:43 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:38:20 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:38:20 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:38:59 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:38:59 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:38:59 MST WARNING:  pgstat wait timeout
> 2011-12-22 09:42:33 MST LOG:  checkpoint complete: wrote 362856
buffers
> (46.1%); 0 transaction log file(s) added, 0 removed, 908 rec
>
> I do have an aggressive autovacuum because I have some tables that get
> updated thousands of times per minute and otherwise they bloat
> terribly,
> but it is very seldom that I see cancelation of autovacuum, and only
on
> this table, which is a temporary table  used for the insertion of the
> data, in which records are inserted and deleted - never updated.
>
>
> > -----Original Message-----
> > From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> > Sent: Thursday, December 22, 2011 11:45 AM
> > To: Benjamin Krajmalnik
> > Cc: pgsql-admin
> > Subject: Re: [ADMIN] Problem with pgstat timneouts
> >
> > "Benjamin Krajmalnik" <kraj@servoyant.com> writes:
> > > The only thing which I see when I run iostat is load on the drives
> > every
> > > so often.  Mifd0 is the database, mifd1 is the pg_xlog.
> >
> > Hmm.  Maybe checkpoints or something else saturating your drives for
> a
> > little while?  You might want to be more aggressive about smoothing
> out
> > checkpoints, cf
> > http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
> > At the very least I'd enable checkpoint logging for awhile and see
> > if there seems to be any correlation to the response-time glitches.
> >
> >             regards, tom lane
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin

Re: Problem with pgstat timeouts

От
"Benjamin Krajmalnik"
Дата:
I have proceeded to do some more checking, and I see in iostat that the pg_xlog drive has a significantly higher busy
statethan before.  Whereas it was barely busy when we first spun up the server (total %busy since we started the server
isabout 6%) it is now in its 80's almost steady state.  We have a set of partitioned tables which are continuously
updated,and based on the size of them they no longer fit in the shared memory which was allocated.  Pg_xlog is in a SAS
RAID1.  The server is set up with streaming replication to an identical server.  The one thing which I just checked is
theRAID mode on the server. 
db1# megacli64 -LDInfo -lAll -aAll


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 2.452 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy       : Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None


Virtual Drive: 1 (Target Id: 1)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 418.656 GB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 2
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy       : Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None

Is it possible that the WriteThrough is what is causing the high io (and maybe the pgstat wait timeouts as well)?
If this is the case, would it be safe to change the cache to Write back?

Additionally, and somewhat unrelated, is there anything special which I need to do when restarting the primary server
vis-à-visthe streaming replication server?  In other words, if I were to restart the main server, will the streaming
replicationserver reconnect and pick up once the primary comes online? 


Re: Problem with pgstat timeouts

От
"Kevin Grittner"
Дата:
"Benjamin Krajmalnik" <kraj@servoyant.com> wrote:

> Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache
>   if Bad BBU
> Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write
>   Cache if Bad BBU

> Is it possible that the WriteThrough is what is causing the high
> io (and maybe the pgstat wait timeouts as well)?

Yes.

> If this is the case, would it be safe to change the cache to Write
> back?

Maybe.  How did it get into this (non-default) state?  Are batteries
installed and healthy?

> Additionally, and somewhat unrelated, is there anything special
> which I need to do when restarting the primary server vis-à-vis
> the streaming replication server?  In other words, if I were to
> restart the main server, will the streaming replication server
> reconnect and pick up once the primary comes online?

I think that should be pretty automatic as long as you haven't
promoted the standby to be the new master.

-Kevin

Re: Problem with pgstat timeouts

От
"Benjamin Krajmalnik"
Дата:
Batteries are installed and healthy.  I wonder if it did it during one of the charge/discharge cycles and just did not
revert(interestingly, I have one more server that did just that).  I am trying to get the info from LSI as to what may
havecaused it, if it is safe to revert the cache mode while it is running, and, of course, the exact command so I don't
foobarthings up. 


> -----Original Message-----
> From: Kevin Grittner [mailto:Kevin.Grittner@wicourts.gov]
> Sent: Wednesday, January 04, 2012 11:44 AM
> To: pgsql-admin; Benjamin Krajmalnik
> Subject: Re: [ADMIN] Problem with pgstat timeouts
>
> "Benjamin Krajmalnik" <kraj@servoyant.com> wrote:
>
> > Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache
> >   if Bad BBU
> > Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write
> >   Cache if Bad BBU
>
> > Is it possible that the WriteThrough is what is causing the high
> > io (and maybe the pgstat wait timeouts as well)?
>
> Yes.
>
> > If this is the case, would it be safe to change the cache to Write
> > back?
>
> Maybe.  How did it get into this (non-default) state?  Are batteries
> installed and healthy?
>
> > Additionally, and somewhat unrelated, is there anything special
> > which I need to do when restarting the primary server vis-à-vis
> > the streaming replication server?  In other words, if I were to
> > restart the main server, will the streaming replication server
> > reconnect and pick up once the primary comes online?
>
> I think that should be pretty automatic as long as you haven't
> promoted the standby to be the new master.
>
> -Kevin

Re: Problem with pgstat timeouts

От
"Benjamin Krajmalnik"
Дата:
Just a followup - thanks to Tom, Kevin, and the rest of the user group, as usual, for their great support.
It appears that I have hit a nice firmware bug, so I am going to have to upgrade the controller firmware.  Apparently,
ithas some issues communicating with the BBU so although the BBU is showing as good, trying to bring it to writeback
saysit is "below threshold".  Since I am going to be taking down 4 servers (our entire infrastructure) to perform the
firmwareupgrades, I will end up replacing the BBU's for good measure. 

Once again, thanks to everyone for their assistance, and I guess this pgstat timeout can be attributed to the
controller:) 



> -----Original Message-----
> From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-
> owner@postgresql.org] On Behalf Of Benjamin Krajmalnik
> Sent: Wednesday, January 04, 2012 11:48 AM
> To: Kevin Grittner; pgsql-admin
> Subject: Re: [ADMIN] Problem with pgstat timeouts
>
> Batteries are installed and healthy.  I wonder if it did it during one
> of the charge/discharge cycles and just did not revert (interestingly,
> I have one more server that did just that).  I am trying to get the
> info from LSI as to what may have caused it, if it is safe to revert
> the cache mode while it is running, and, of course, the exact command
> so I don't foobar things up.
>
>
> > -----Original Message-----
> > From: Kevin Grittner [mailto:Kevin.Grittner@wicourts.gov]
> > Sent: Wednesday, January 04, 2012 11:44 AM
> > To: pgsql-admin; Benjamin Krajmalnik
> > Subject: Re: [ADMIN] Problem with pgstat timeouts
> >
> > "Benjamin Krajmalnik" <kraj@servoyant.com> wrote:
> >
> > > Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache
> > >   if Bad BBU
> > > Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write
> > >   Cache if Bad BBU
> >
> > > Is it possible that the WriteThrough is what is causing the high
> > > io (and maybe the pgstat wait timeouts as well)?
> >
> > Yes.
> >
> > > If this is the case, would it be safe to change the cache to Write
> > > back?
> >
> > Maybe.  How did it get into this (non-default) state?  Are batteries
> > installed and healthy?
> >
> > > Additionally, and somewhat unrelated, is there anything special
> > > which I need to do when restarting the primary server vis-à-vis
> > > the streaming replication server?  In other words, if I were to
> > > restart the main server, will the streaming replication server
> > > reconnect and pick up once the primary comes online?
> >
> > I think that should be pretty automatic as long as you haven't
> > promoted the standby to be the new master.
> >
> > -Kevin
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin