Обсуждение: Who is causing all this i/o?

Поиск
Список
Период
Сортировка

Who is causing all this i/o?

От
Craig James
Дата:
Our development server (PG 8.4.4 on Ubuntu server) is constantly doing something, and I can't figure out what.  The two
productionservers, which are essentially identical, don't show these symptoms.  In a nutshell, it's showing 10K blocks
persecond of data going out, all the time, and essentially zero blocks per second of input. 

To start with, if I turn off Postgres, everything is quiet:

$ /etc/init.d/postgresql stop
Stopping PostgreSQL: server stopped
ok
[radon:root] /emi/logs/londiste # vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
  0  0 126140 1115944 112692 10466452    0    0   132   829    1    3  2  1 97  1
  0  0 126140 1116068 112692 10466452    0    0     0     0  150  175  0  0 100  0
  0  0 126140 1116184 112708 10466456    0    0     0    94  123  153  0  0 100  0
  0  0 126140 1116432 112708 10466456    0    0     0     0   96  126  0  0 100  0
  0  0 126140 1116928 112708 10466456    0    0     0     0   94  111  0  0 100  0
  0  0 126140 1116688 112708 10466456    0    0     0     0  114  144  0  0 100  0
  0  0 126140 1116364 112708 10466456    0    0    18    74  229  266  1  0 99  0
  0  0 126140 1116488 112708 10466492    0    0     0     0  111  138  0  0 100  0
  1  0 126140 1116868 112716 10466492    0    0     0    20   96  121  0  0 100  0

But when I restart Postgres, it looks like this, and runs this way 24/7.  Notice the "bo" column:

$ /etc/init.d/postgresql start
Starting PostgreSQL: ok
$ vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
  0  0 126140 1078020 112744 10480652    0    0   132   829    1    3  2  1 97  1
  0  1 126140 1054320 112744 10487792    0    0    52 10714  900  976  2  1 97  0
  0  1 126140 1015988 112760 10511060    0    0  1828  9620 1491 3527  5  2 87  6
  0  1 126140 1008316 112768 10517936    0    0   672 10994  982 1209  1  2 86 11
  0  0 126140 1017780 112776 10528824    0    0  1466 10746 1003 1428  2  1 89  8
  0  0 126140 1013324 112784 10533680    0    0     0  9718  766  858  1  0 99  0
  1  0 126140 991600 112784 10539220    0    0     0 10714  962 1127  2  1 97  0
  0  0 126140 999924 112788 10544380    0    0     6 10726  918 1227  2  1 97  0
  0  0 126140 995700 112800 10548816    0    0     0  9744  823  889  1  1 99  0
  2  0 126140 964560 112804 10554048    0    0     2 10718  903 1196  3  1 96  0

That seems to be a lot of output data going somewhere, but there doesn't seem to be anything going on that should cause
it:

postgres=# select datname, procpid, usename, substr(current_query,0,25) current_query, waiting from pg_stat_activity;
   datname   | procpid | usename  |      current_query       | waiting
------------+---------+----------+--------------------------+---------
  postgres   |   12927 | postgres | select datname, procpid, | f
  global     |   12185 | postgres | <IDLE>                   | f
  accounting |   12367 | postgres | <IDLE>                   | f
  orders     |   12225 | orders   | <IDLE>                   | f
  archive    |   12267 | postgres | <IDLE>                   | f
(5 rows)

After a lot of digging around, I found this in the /postgres/pg_stat_tmp directory.  If I list the directory including
thei-nodes once every second, I find that a new 2MB file is being created roughly once every two seconds: 

# ls -li
total 2384
69731459 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat
# ls -li
total 2384
69731258 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat
# ls -li
total 2384
69731459 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat
# ls -li
total 2384
69731459 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat
# ls -li
total 3332
69731258 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat
69731459 -rw------- 1 postgres postgres  970752 2011-05-20 14:07 pgstat.tmp
# ls -li
total 2384
69731459 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat
# ls -li
total 2384
69731258 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat
# ls -li
total 2384
69731258 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat
# ls -li
total 2384
69731258 -rw------- 1 postgres postgres 2437957 2011-05-20 14:07 pgstat.stat

This is a development server, so there are no queries coming in from the web or other random activities.  Nothing is
goingon. 

The other two servers are configured identically.  If I diff the configuration files, the only difference is the IP
addressesfor the "listen" section. 

Can anyone tell me what's going on?  Why is pgstat.stat being rewritten on this server constantly and not on the other
twoservers? 

Thanks,
Craig


Re: Who is causing all this i/o?

От
"Joshua D. Drake"
Дата:
> The other two servers are configured identically. If I diff the
> configuration files, the only difference is the IP addresses for the
> "listen" section.
>
> Can anyone tell me what's going on? Why is pgstat.stat being rewritten
> on this server constantly and not on the other two servers?

Do you have something monitoring the stats of the system aggressively?
(like pg_stat_user_tables?)

JD


>
> Thanks,
> Craig
>
>


--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

Re: Who is causing all this i/o?

От
Craig James
Дата:
On 5/20/11 3:38 PM, Joshua D. Drake wrote:
>
>> The other two servers are configured identically. If I diff the
>> configuration files, the only difference is the IP addresses for the
>> "listen" section.
>>
>> Can anyone tell me what's going on? Why is pgstat.stat being rewritten
>> on this server constantly and not on the other two servers?
>
> Do you have something monitoring the stats of the system aggressively? (like pg_stat_user_tables?)
No, there is nothing like that running on this server.  Just Apache, and Apache can only be accessed from inside the
company,and its logs showed no activity at all. 

Thanks,
Craig
>
> JD
>
>
>>
>> Thanks,
>> Craig
>>
>>
>
>


Re: Who is causing all this i/o?

От
Steve Crawford
Дата:
On 05/20/2011 02:14 PM, Craig James wrote:
> Our development server (PG 8.4.4 on Ubuntu server) is constantly doing
> something, and I can't figure out what.  The two production servers,
> which are essentially identical, don't show these symptoms.  In a
> nutshell, it's showing 10K blocks per second of data going out, all
> the time, and essentially zero blocks per second of input.
...
There are people on this list better equipped than I to get into the
guts of diagnosis, but a quick Google shows you aren't the only person
to see high IO with disk activity. There are suggestions that a db with
lots of temp-table creation and nothing triggering a vacuum could be
related. Another thread suggested that bloated system tables could be to
blame. Others look at the stats collectors. Groping a bit in the dark, I
would try...

Checking PG logs for info - especially anything indicating that vacuum
isn't running or isn't being triggered.

Manual vacuum.

Vacuum full of system tables (pg_catalog - especially pg_class and
pg_attribute) - possibly followed by a reindex all followed by a manual
vacuum. Note: IIRC, that file is shared across databases so you will
need to clean each one.

Reset stats collector at startup.

Check system after each change and report back if anything improves or
degrades the situation.

Cheers,
Steve




Re: Who is causing all this i/o?

От
Scott Marlowe
Дата:
On Fri, May 20, 2011 at 3:14 PM, Craig James <craig_james@emolecules.com> wrote:
> Our development server (PG 8.4.4 on Ubuntu server) is constantly doing
> something, and I can't figure out what.  The two production servers, which
> are essentially identical, don't show these symptoms.  In a nutshell, it's
> showing 10K blocks per second of data going out, all the time, and
> essentially zero blocks per second of input.
> After a lot of digging around, I found this in the /postgres/pg_stat_tmp
> directory.  If I list the directory including the i-nodes once every second,
> I find that a new 2MB file is being created roughly once every two seconds:

Have you got a lot of databases in your development environment?  I
think that can sometimes cause a lot of pg_stat writes.

Re: Who is causing all this i/o?

От
Craig James
Дата:
On 5/20/11 4:25 PM, Scott Marlowe wrote:
> On Fri, May 20, 2011 at 3:14 PM, Craig James<craig_james@emolecules.com>  wrote:
>> Our development server (PG 8.4.4 on Ubuntu server) is constantly doing
>> something, and I can't figure out what.  The two production servers, which
>> are essentially identical, don't show these symptoms.  In a nutshell, it's
>> showing 10K blocks per second of data going out, all the time, and
>> essentially zero blocks per second of input.
>> After a lot of digging around, I found this in the /postgres/pg_stat_tmp
>> directory.  If I list the directory including the i-nodes once every second,
>> I find that a new 2MB file is being created roughly once every two seconds:
> Have you got a lot of databases in your development environment?  I
> think that can sometimes cause a lot of pg_stat writes.
Yes.  The production servers have a dozen or so databases, but the development server has a couple hundred databases.
Doesthat count as "a lot of databases"? 

We've had even more databases in the past (>500) and didn't see this sort of I/O activity.

The odd thing is that this activity is gradually growing, so slowly that you don't notice it right away.  A month or
twoago, nothing.  A couple weeks ago, a constant stream of "grass" in xload's graph.  Today, xload is showing a 50% all
thetime.  I can only guess that the load will continue to increase. 

Is there a way to tell Postgres not to do this, or to do it less?  It's not a big disaster (yet), but it's annoying to
havethe system spewing a couple megabytes of pg_stat_tmp data every few seconds. 

Thanks,
Craig

Re: Who is causing all this i/o?

От
vequalsitimesz@gmail.com
Дата:
I am tracking down possibly similar behavior on a production server and a colleague introduced me to Dstat. I am using it right now to try and track down the root of my problem. May or may not be helpful in figuring out exactly whats causing this and then on to figuring out to fix it.

http://dag.wieers.com/home-made/dstat/#

then run something like:
dstat -t -l --cpu24 --sys --tcp --vm --nocolor --top-io --top-latency-avg --output /tmp/dstat.csv

sincerely

Re: Who is causing all this i/o?

От
Tom Lane
Дата:
Craig James <craig_james@emolecules.com> writes:
> On 5/20/11 4:25 PM, Scott Marlowe wrote:
>> On Fri, May 20, 2011 at 3:14 PM, Craig James<craig_james@emolecules.com>  wrote:
>>> Our development server (PG 8.4.4 on Ubuntu server) is constantly doing
>>> something, and I can't figure out what.  The two production servers, which
>>> are essentially identical, don't show these symptoms.  In a nutshell, it's
>>> showing 10K blocks per second of data going out, all the time, and
>>> essentially zero blocks per second of input.
>>> After a lot of digging around, I found this in the /postgres/pg_stat_tmp
>>> directory.  If I list the directory including the i-nodes once every second,
>>> I find that a new 2MB file is being created roughly once every two seconds:

>> Have you got a lot of databases in your development environment?  I
>> think that can sometimes cause a lot of pg_stat writes.

> Yes.  The production servers have a dozen or so databases, but the development server has a couple hundred databases.
Doesthat count as "a lot of databases"? 

Yeah.  I think what is happening is that the autovacuum launcher is
visiting every database, doing accesses to the system catalogs (and not
much more than that), which results in access-count updates in the stats
collector, which have to get written to disk.

What's not apparent however is why the stats collector is writing disk
so much.  8.4 does have the logic change to not write stats out unless
something is asking to see them.  So either it's really pre-8.4, or you
have a monitoring task that is constantly asking to see stats.

One possible band-aid solution is to increase autovacuum_naptime.  This
is defined as the AV cycle time *for each database*, so AV wakes up and
touches another database every naptime/#databases seconds.  If your
number of databases has been growing over time, this would probably
explain why the problem is getting worse.

            regards, tom lane

Re: Who is causing all this i/o?

От
Craig James
Дата:
On 5/21/11 8:11 AM, Tom Lane wrote:
> Craig James<craig_james@emolecules.com>  writes:
>> On 5/20/11 4:25 PM, Scott Marlowe wrote:
>>> On Fri, May 20, 2011 at 3:14 PM, Craig James<craig_james@emolecules.com>   wrote:
>>>> Our development server (PG 8.4.4 on Ubuntu server) is constantly doing
>>>> something, and I can't figure out what.  The two production servers, which
>>>> are essentially identical, don't show these symptoms.  In a nutshell, it's
>>>> showing 10K blocks per second of data going out, all the time, and
>>>> essentially zero blocks per second of input.
>>>> After a lot of digging around, I found this in the /postgres/pg_stat_tmp
>>>> directory.  If I list the directory including the i-nodes once every second,
>>>> I find that a new 2MB file is being created roughly once every two seconds:
>>> Have you got a lot of databases in your development environment?  I
>>> think that can sometimes cause a lot of pg_stat writes.
>> Yes.  The production servers have a dozen or so databases, but the development server has a couple hundred
databases.Does that count as "a lot of databases"? 
> Yeah.  I think what is happening is that the autovacuum launcher is
> visiting every database, doing accesses to the system catalogs (and not
> much more than that), which results in access-count updates in the stats
> collector, which have to get written to disk.
>
> What's not apparent however is why the stats collector is writing disk
> so much.  8.4 does have the logic change to not write stats out unless
> something is asking to see them.  So either it's really pre-8.4,
$ ./postgres --version
postgres (PostgreSQL) 8.4.4

>   or you
> have a monitoring task that is constantly asking to see stats.
Nothing like that. There's virtually nothing on this server but Postgres and Apache.  There are some fast-cgi perl
programsthat stay connected to Postgres all the time, but the constant I/O starts up even if I kill Apache. 
> One possible band-aid solution is to increase autovacuum_naptime.  This
> is defined as the AV cycle time *for each database*, so AV wakes up and
> touches another database every naptime/#databases seconds.
I'll try this.  Since it's a development system, it tends to get large bursts of loading followed by a bit of
searching. Cutting down on vacuums should be fine. 
>    If your
> number of databases has been growing over time, this would probably
> explain why the problem is getting worse.
For the last few months the number of databases has been almost static.  Maybe five added (of 200) in the last three
months,and probably a few others dropped. 

Thanks,
Craig
>             regards, tom lane
>


Re: Who is causing all this i/o?

От
Shianmiin
Дата:
Tom Lane-2 wrote:
>
> What's not apparent however is why the stats collector is writing disk
> so much.  8.4 does have the logic change to not write stats out unless
> something is asking to see them.  So either it's really pre-8.4, or you
> have a monitoring task that is constantly asking to see stats.
>

We have a PostgreSQL 9.0.4 on CentOs for performance testing and we are
seeing the similar issue.
we have a "crazy" setup it has 1 database with 1000 identical schemas. There
are occasional I/O write storm
of over 100 MB/sec without any disk reads, and it could last for a couple of
minutes when the schemas/data are aggressively populated by pg_restore. All
the io writes seem to be on pgstat.tmp.

The I/O write storm seemed to be trigger by Vacuum.

The way we get around that is to point stats_temp_directory to ram disk
which eliminates the I/O write storm effectively.

Is it an issue of specific usage scenario or a general issue to be resolved?

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Who-is-causing-all-this-i-o-tp4413558p4499692.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.

Re: Who is causing all this i/o?

От
Craig James
Дата:
On 6/17/11 11:51 AM, Shianmiin wrote:
> Tom Lane-2 wrote:
>> What's not apparent however is why the stats collector is writing disk
>> so much.  8.4 does have the logic change to not write stats out unless
>> something is asking to see them.  So either it's really pre-8.4, or you
>> have a monitoring task that is constantly asking to see stats.
>>
> We have a PostgreSQL 9.0.4 on CentOs for performance testing and we are
> seeing the similar issue.
> we have a "crazy" setup it has 1 database with 1000 identical schemas. There
> are occasional I/O write storm
> of over 100 MB/sec without any disk reads, and it could last for a couple of
> minutes when the schemas/data are aggressively populated by pg_restore. All
> the io writes seem to be on pgstat.tmp.
>
> The I/O write storm seemed to be trigger by Vacuum.
Based on the advice I got from my original question, I changed autovacuum_naptime to "5min", and the problem completely
disappeared. (I know that's a long interval, but this particular server gets maybe 5-10 heavy updates per week and is
idlethe rest of the time.) 

select count(1) from pg_database ;
  count
-------
    267

It seems like there's a problem somewhere.  Autovacuum has improved enormously in the last couple of years, but some
changeto its algorithm causes a lot of I/O thrashing when there are more than a few hundred separate databases. 

Craig

Re: Who is causing all this i/o?

От
Tom Lane
Дата:
Craig James <craig_james@emolecules.com> writes:
> On 6/17/11 11:51 AM, Shianmiin wrote:
>> We have a PostgreSQL 9.0.4 on CentOs for performance testing and we are
>> seeing the similar issue.
>> we have a "crazy" setup it has 1 database with 1000 identical schemas. There
>> are occasional I/O write storm
>> of over 100 MB/sec without any disk reads, and it could last for a couple of
>> minutes when the schemas/data are aggressively populated by pg_restore. All
>> the io writes seem to be on pgstat.tmp.

> Based on the advice I got from my original question, I changed autovacuum_naptime to "5min", and the problem
completelydisappeared.  (I know that's a long interval, but this particular server gets maybe 5-10 heavy updates per
weekand is idle the rest of the time.) 

> select count(1) from pg_database ;
>   count
> -------
>     267

> It seems like there's a problem somewhere.  Autovacuum has improved enormously in the last couple of years, but some
changeto its algorithm causes a lot of I/O thrashing when there are more than a few hundred separate databases. 

Well, if you have a lot of databases then you definitely need to
increase the naptime to keep autovac's demands for stats within bounds.
I don't find that surprising, though I do wonder if we ought to redefine
the way that GUC works: right now, you get one autovac wakeup every
naptime/databases seconds.

The OP seems to have some other issue, though, since he says he's only
got 1 database.

            regards, tom lane