Обсуждение: big select is resulting in a large amount of disk writing by kjournald

Поиск
Список
Период
Сортировка

big select is resulting in a large amount of disk writing by kjournald

От
Joseph S
Дата:
I just installed a shiny new database server with pg 8.4.1 running on
CentOS 5.4. After using slony to replicate over my database I decided to
  do some basic performance tests to see how spiffy my shiny new server
is.  This machine has 32G ram, over 31 of which is used for the system
file cache.

So I run "select count(*) from large_table" and I see in xosview a solid
block of write activity. Runtime is 28125.644 ms for the first run.  The
second run does not show a block of write activity and takes 3327.441 ms

top shows that this writing is being done by kjournald. What is going on
here?  There is not a lot of write activity on this server so there
should not be a significant number of dirty cache pages that kjournald
would need to write out before it could read in my table. Certainly in
the 31G being used for file cache there should be enough non-dirty pages
that could be dropped to read in my table w/o having to flush anything
to disk. My table size is 2,870,927,360 bytes.

# cat /proc/sys/vm/dirty_expire_centisecs
2999

I restarted postgres and ran a count(*) on an even larger table.

[local]=> explain analyze select count(*) from et;
                                                             QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=6837051.82..6837051.83 rows=1 width=0) (actual
time=447240.157..447240.157 rows=1 loops=1)
    ->  Seq Scan on et  (cost=0.00..6290689.25 rows=218545025 width=0)
(actual time=5.971..400326.911 rows=218494524 loops=1)
  Total runtime: 447240.402 ms
(3 rows)

Time: 447258.525 ms
[local]=> explain analyze select count(*) from et;
                                                             QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=6837113.44..6837113.45 rows=1 width=0) (actual
time=103011.724..103011.724 rows=1 loops=1)
    ->  Seq Scan on et  (cost=0.00..6290745.95 rows=218546995 width=0)
(actual time=9.844..71629.497 rows=218496012 loops=1)
  Total runtime: 103011.832 ms
(3 rows)

Time: 103012.523 ms

[local]=> select pg_relation_size('et');
  pg_relation_size
------------------
       33631543296
(1 row)


I posted xosview snapshots from the two runs  at:
http://www.tupari.net/2009-12-9/ This time the first run showed a mix of
read/write activity instead of the solid write I saw before.

Re: big select is resulting in a large amount of disk writing by kjournald

От
Kenneth Marshall
Дата:
Hint bit I/O?

Ken

On Wed, Dec 09, 2009 at 01:29:00PM -0500, Joseph S wrote:
> I just installed a shiny new database server with pg 8.4.1 running on
> CentOS 5.4. After using slony to replicate over my database I decided to
> do some basic performance tests to see how spiffy my shiny new server is.
> This machine has 32G ram, over 31 of which is used for the system file
> cache.
>
> So I run "select count(*) from large_table" and I see in xosview a solid
> block of write activity. Runtime is 28125.644 ms for the first run.  The
> second run does not show a block of write activity and takes 3327.441 ms
>
> top shows that this writing is being done by kjournald. What is going on
> here?  There is not a lot of write activity on this server so there should
> not be a significant number of dirty cache pages that kjournald would need
> to write out before it could read in my table. Certainly in the 31G being
> used for file cache there should be enough non-dirty pages that could be
> dropped to read in my table w/o having to flush anything to disk. My table
> size is 2,870,927,360 bytes.
>
> # cat /proc/sys/vm/dirty_expire_centisecs
> 2999
>
> I restarted postgres and ran a count(*) on an even larger table.
>
> [local]=> explain analyze select count(*) from et;
>                                                             QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=6837051.82..6837051.83 rows=1 width=0) (actual
> time=447240.157..447240.157 rows=1 loops=1)
>    ->  Seq Scan on et  (cost=0.00..6290689.25 rows=218545025 width=0)
> (actual time=5.971..400326.911 rows=218494524 loops=1)
>  Total runtime: 447240.402 ms
> (3 rows)
>
> Time: 447258.525 ms
> [local]=> explain analyze select count(*) from et;
>                                                             QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=6837113.44..6837113.45 rows=1 width=0) (actual
> time=103011.724..103011.724 rows=1 loops=1)
>    ->  Seq Scan on et  (cost=0.00..6290745.95 rows=218546995 width=0)
> (actual time=9.844..71629.497 rows=218496012 loops=1)
>  Total runtime: 103011.832 ms
> (3 rows)
>
> Time: 103012.523 ms
>
> [local]=> select pg_relation_size('et');
>  pg_relation_size
> ------------------
>       33631543296
> (1 row)
>
>
> I posted xosview snapshots from the two runs  at:
> http://www.tupari.net/2009-12-9/ This time the first run showed a mix of
> read/write activity instead of the solid write I saw before.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: big select is resulting in a large amount of disk writing by kjournald

От
Greg Smith
Дата:
Joseph S wrote:
> So I run "select count(*) from large_table" and I see in xosview a
> solid block of write activity. Runtime is 28125.644 ms for the first
> run.  The second run does not show a block of write activity and takes
> 3327.441 ms
http://wiki.postgresql.org/wiki/Hint_Bits

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: big select is resulting in a large amount of disk writing by kjournald

От
Joseph S
Дата:
Greg Smith wrote:
> Joseph S wrote:
>> So I run "select count(*) from large_table" and I see in xosview a
>> solid block of write activity. Runtime is 28125.644 ms for the first
>> run.  The second run does not show a block of write activity and takes
>> 3327.441 ms
> http://wiki.postgresql.org/wiki/Hint_Bits
>

Hmm. A large select results in a lot of writes? This seems broken.  And
if we are setting these hint bits then what do we need VACUUM for?  Is
there any way to tune this behavior? Is there any way to get stats on
how many rows/pages would need hint bits set?

Re: big select is resulting in a large amount of disk writing by kjournald

От
Greg Smith
Дата:
Joseph S wrote:
> Greg Smith wrote:
>> Joseph S wrote:
>>> So I run "select count(*) from large_table" and I see in xosview a
>>> solid block of write activity. Runtime is 28125.644 ms for the first
>>> run.  The second run does not show a block of write activity and
>>> takes 3327.441 ms
>> http://wiki.postgresql.org/wiki/Hint_Bits
>>
>
> Hmm. A large select results in a lot of writes? This seems broken.
> And if we are setting these hint bits then what do we need VACUUM
> for?  Is there any way to tune this behavior? Is there any way to get
> stats on how many rows/pages would need hint bits set?
Basically, the idea is that if you're pulling a page in for something
else that requires you to compute the hint bits, just do it now so
VACUUM doesn't have to later, while you're in there anyway.  Why make
VACUUM do the work later if you're doing part of it now anyway?  If you
reorganize your test to VACUUM first *before* running the "select (*)
from...", you'll discover the writes during SELECT go away.  You're
running into the worst-case behavior.  For example, if you inserted a
bunch of things more slowly, you might discover that autovacuum would do
this cleanup before you even got to looking at the data.

There's no tuning for the behavior beyond making autovacuum more
aggressive (to improve odds it will get there first), and no visibility
into what's happening either.  And cranking up autovacuum has its own
downsides.  This situation shows up a lot when you're benchmarking
things, but not as much in the real world, so it's hard to justify
improving.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: big select is resulting in a large amount of disk writing by kjournald

От
Joseph S
Дата:
Greg Smith wrote:
> Joseph S wrote:
>> Greg Smith wrote:
>>> Joseph S wrote:
>>>> So I run "select count(*) from large_table" and I see in xosview a
>>>> solid block of write activity. Runtime is 28125.644 ms for the first
>>>> run.  The second run does not show a block of write activity and
>>>> takes 3327.441 ms
>>> http://wiki.postgresql.org/wiki/Hint_Bits
>>>
>>
>> Hmm. A large select results in a lot of writes? This seems broken.
>> And if we are setting these hint bits then what do we need VACUUM
>> for?  Is there any way to tune this behavior? Is there any way to get
>> stats on how many rows/pages would need hint bits set?
> Basically, the idea is that if you're pulling a page in for something
> else that requires you to compute the hint bits, just do it now so
> VACUUM doesn't have to later, while you're in there anyway.  Why make
> VACUUM do the work later if you're doing part of it now anyway?  If you

Then why not do all the work the VACUUM does?  What does VACUUM do anyway?

> reorganize your test to VACUUM first *before* running the "select (*)
> from...", you'll discover the writes during SELECT go away.  You're
> running into the worst-case behavior.  For example, if you inserted a
> bunch of things more slowly, you might discover that autovacuum would do
> this cleanup before you even got to looking at the data.

I think autovacuum did hit these tables after slony copied them (I
remember seeing them running). Would the hint bits be set during an
reindex?  For example the indexing slony does after the initial copy?
I'm not sure if slony commits the transaction before it does the
reindex. It probably doesn't.
>

> downsides.  This situation shows up a lot when you're benchmarking
> things, but not as much in the real world, so it's hard to justify
> improving.
>

Actually I think I have been running into this situation.  There were
many reports that ran much faster the second time around than the first
and I assumed it was just because the data was in memory cache.  Now I'm
thinking I was running into this.

Re: big select is resulting in a large amount of disk writing by kjournald

От
"Kevin Grittner"
Дата:
Joseph S <jks@selectacast.net> wrote:
> I just installed a shiny new database server with pg 8.4.1 running
> on CentOS 5.4. After using slony to replicate over my database I
> decided to do some basic performance tests to see how spiffy my
> shiny new server is.  This machine has 32G ram, over 31 of which
> is used for the system file cache.
>
> So I run "select count(*) from large_table" and I see in xosview a
> solid block of write activity. Runtime is 28125.644 ms for the
> first run.  The second run does not show a block of write activity
> and takes 3327.441 ms

As others have mentioned, this is due to hint bit updates, and doing
an explicit VACUUM after the load and before you start using the
database will avoid run-time issues.  You also need statistics, so
be sure to do VACUUM ANALYZE.

There is one other sneaky surprise awaiting you, however.  Since
this stuff was all loaded with a narrow range of transaction IDs,
they will all need to be frozen at about the same time; so somewhere
down the road, either during a routine database vacuum or possibly
in the middle of normal operations, all of these rows will need to
be rewritten *again* to change the transaction IDs used for managing
MVCC to the special "frozen" value.  We routinely follow a load with
VACUUM FREEZE ANALYZE of the database to combine the update to
freeze the tuples with the update to set the hint bits and avoid
this problem.

There has been some talk about possibly writing tuples in a frozen
state with the hint bits already set if they are loaded in the same
database transaction which creates the table, but I'm not aware of
anyone currently working on this.

-Kevin

Re: big select is resulting in a large amount of disk writing by kjournald

От
Scott Carey
Дата:
On 12/10/09 8:41 AM, "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

>
> There has been some talk about possibly writing tuples in a frozen
> state with the hint bits already set if they are loaded in the same
> database transaction which creates the table, but I'm not aware of
> anyone currently working on this.
>

Wow, that would be nice.  That would cut in half the time it takes to
restore a several TB db (3 days to 1.5 here).

I assume this would help a lot of "CREATE TABLE AS SELECT ..." use cases
too.  That is often the fastest way to do a large update on a table, but it
can still be annoyingly write intensive.


> -Kevin
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>