Обсуждение: big select is resulting in a large amount of disk writing by kjournald
I just installed a shiny new database server with pg 8.4.1 running on CentOS 5.4. After using slony to replicate over my database I decided to do some basic performance tests to see how spiffy my shiny new server is. This machine has 32G ram, over 31 of which is used for the system file cache. So I run "select count(*) from large_table" and I see in xosview a solid block of write activity. Runtime is 28125.644 ms for the first run. The second run does not show a block of write activity and takes 3327.441 ms top shows that this writing is being done by kjournald. What is going on here? There is not a lot of write activity on this server so there should not be a significant number of dirty cache pages that kjournald would need to write out before it could read in my table. Certainly in the 31G being used for file cache there should be enough non-dirty pages that could be dropped to read in my table w/o having to flush anything to disk. My table size is 2,870,927,360 bytes. # cat /proc/sys/vm/dirty_expire_centisecs 2999 I restarted postgres and ran a count(*) on an even larger table. [local]=> explain analyze select count(*) from et; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=6837051.82..6837051.83 rows=1 width=0) (actual time=447240.157..447240.157 rows=1 loops=1) -> Seq Scan on et (cost=0.00..6290689.25 rows=218545025 width=0) (actual time=5.971..400326.911 rows=218494524 loops=1) Total runtime: 447240.402 ms (3 rows) Time: 447258.525 ms [local]=> explain analyze select count(*) from et; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=6837113.44..6837113.45 rows=1 width=0) (actual time=103011.724..103011.724 rows=1 loops=1) -> Seq Scan on et (cost=0.00..6290745.95 rows=218546995 width=0) (actual time=9.844..71629.497 rows=218496012 loops=1) Total runtime: 103011.832 ms (3 rows) Time: 103012.523 ms [local]=> select pg_relation_size('et'); pg_relation_size ------------------ 33631543296 (1 row) I posted xosview snapshots from the two runs at: http://www.tupari.net/2009-12-9/ This time the first run showed a mix of read/write activity instead of the solid write I saw before.
Re: big select is resulting in a large amount of disk writing by kjournald
От
Kenneth Marshall
Дата:
Hint bit I/O? Ken On Wed, Dec 09, 2009 at 01:29:00PM -0500, Joseph S wrote: > I just installed a shiny new database server with pg 8.4.1 running on > CentOS 5.4. After using slony to replicate over my database I decided to > do some basic performance tests to see how spiffy my shiny new server is. > This machine has 32G ram, over 31 of which is used for the system file > cache. > > So I run "select count(*) from large_table" and I see in xosview a solid > block of write activity. Runtime is 28125.644 ms for the first run. The > second run does not show a block of write activity and takes 3327.441 ms > > top shows that this writing is being done by kjournald. What is going on > here? There is not a lot of write activity on this server so there should > not be a significant number of dirty cache pages that kjournald would need > to write out before it could read in my table. Certainly in the 31G being > used for file cache there should be enough non-dirty pages that could be > dropped to read in my table w/o having to flush anything to disk. My table > size is 2,870,927,360 bytes. > > # cat /proc/sys/vm/dirty_expire_centisecs > 2999 > > I restarted postgres and ran a count(*) on an even larger table. > > [local]=> explain analyze select count(*) from et; > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=6837051.82..6837051.83 rows=1 width=0) (actual > time=447240.157..447240.157 rows=1 loops=1) > -> Seq Scan on et (cost=0.00..6290689.25 rows=218545025 width=0) > (actual time=5.971..400326.911 rows=218494524 loops=1) > Total runtime: 447240.402 ms > (3 rows) > > Time: 447258.525 ms > [local]=> explain analyze select count(*) from et; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=6837113.44..6837113.45 rows=1 width=0) (actual > time=103011.724..103011.724 rows=1 loops=1) > -> Seq Scan on et (cost=0.00..6290745.95 rows=218546995 width=0) > (actual time=9.844..71629.497 rows=218496012 loops=1) > Total runtime: 103011.832 ms > (3 rows) > > Time: 103012.523 ms > > [local]=> select pg_relation_size('et'); > pg_relation_size > ------------------ > 33631543296 > (1 row) > > > I posted xosview snapshots from the two runs at: > http://www.tupari.net/2009-12-9/ This time the first run showed a mix of > read/write activity instead of the solid write I saw before. > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
Joseph S wrote: > So I run "select count(*) from large_table" and I see in xosview a > solid block of write activity. Runtime is 28125.644 ms for the first > run. The second run does not show a block of write activity and takes > 3327.441 ms http://wiki.postgresql.org/wiki/Hint_Bits -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith wrote: > Joseph S wrote: >> So I run "select count(*) from large_table" and I see in xosview a >> solid block of write activity. Runtime is 28125.644 ms for the first >> run. The second run does not show a block of write activity and takes >> 3327.441 ms > http://wiki.postgresql.org/wiki/Hint_Bits > Hmm. A large select results in a lot of writes? This seems broken. And if we are setting these hint bits then what do we need VACUUM for? Is there any way to tune this behavior? Is there any way to get stats on how many rows/pages would need hint bits set?
Joseph S wrote: > Greg Smith wrote: >> Joseph S wrote: >>> So I run "select count(*) from large_table" and I see in xosview a >>> solid block of write activity. Runtime is 28125.644 ms for the first >>> run. The second run does not show a block of write activity and >>> takes 3327.441 ms >> http://wiki.postgresql.org/wiki/Hint_Bits >> > > Hmm. A large select results in a lot of writes? This seems broken. > And if we are setting these hint bits then what do we need VACUUM > for? Is there any way to tune this behavior? Is there any way to get > stats on how many rows/pages would need hint bits set? Basically, the idea is that if you're pulling a page in for something else that requires you to compute the hint bits, just do it now so VACUUM doesn't have to later, while you're in there anyway. Why make VACUUM do the work later if you're doing part of it now anyway? If you reorganize your test to VACUUM first *before* running the "select (*) from...", you'll discover the writes during SELECT go away. You're running into the worst-case behavior. For example, if you inserted a bunch of things more slowly, you might discover that autovacuum would do this cleanup before you even got to looking at the data. There's no tuning for the behavior beyond making autovacuum more aggressive (to improve odds it will get there first), and no visibility into what's happening either. And cranking up autovacuum has its own downsides. This situation shows up a lot when you're benchmarking things, but not as much in the real world, so it's hard to justify improving. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith wrote: > Joseph S wrote: >> Greg Smith wrote: >>> Joseph S wrote: >>>> So I run "select count(*) from large_table" and I see in xosview a >>>> solid block of write activity. Runtime is 28125.644 ms for the first >>>> run. The second run does not show a block of write activity and >>>> takes 3327.441 ms >>> http://wiki.postgresql.org/wiki/Hint_Bits >>> >> >> Hmm. A large select results in a lot of writes? This seems broken. >> And if we are setting these hint bits then what do we need VACUUM >> for? Is there any way to tune this behavior? Is there any way to get >> stats on how many rows/pages would need hint bits set? > Basically, the idea is that if you're pulling a page in for something > else that requires you to compute the hint bits, just do it now so > VACUUM doesn't have to later, while you're in there anyway. Why make > VACUUM do the work later if you're doing part of it now anyway? If you Then why not do all the work the VACUUM does? What does VACUUM do anyway? > reorganize your test to VACUUM first *before* running the "select (*) > from...", you'll discover the writes during SELECT go away. You're > running into the worst-case behavior. For example, if you inserted a > bunch of things more slowly, you might discover that autovacuum would do > this cleanup before you even got to looking at the data. I think autovacuum did hit these tables after slony copied them (I remember seeing them running). Would the hint bits be set during an reindex? For example the indexing slony does after the initial copy? I'm not sure if slony commits the transaction before it does the reindex. It probably doesn't. > > downsides. This situation shows up a lot when you're benchmarking > things, but not as much in the real world, so it's hard to justify > improving. > Actually I think I have been running into this situation. There were many reports that ran much faster the second time around than the first and I assumed it was just because the data was in memory cache. Now I'm thinking I was running into this.
Re: big select is resulting in a large amount of disk writing by kjournald
От
"Kevin Grittner"
Дата:
Joseph S <jks@selectacast.net> wrote: > I just installed a shiny new database server with pg 8.4.1 running > on CentOS 5.4. After using slony to replicate over my database I > decided to do some basic performance tests to see how spiffy my > shiny new server is. This machine has 32G ram, over 31 of which > is used for the system file cache. > > So I run "select count(*) from large_table" and I see in xosview a > solid block of write activity. Runtime is 28125.644 ms for the > first run. The second run does not show a block of write activity > and takes 3327.441 ms As others have mentioned, this is due to hint bit updates, and doing an explicit VACUUM after the load and before you start using the database will avoid run-time issues. You also need statistics, so be sure to do VACUUM ANALYZE. There is one other sneaky surprise awaiting you, however. Since this stuff was all loaded with a narrow range of transaction IDs, they will all need to be frozen at about the same time; so somewhere down the road, either during a routine database vacuum or possibly in the middle of normal operations, all of these rows will need to be rewritten *again* to change the transaction IDs used for managing MVCC to the special "frozen" value. We routinely follow a load with VACUUM FREEZE ANALYZE of the database to combine the update to freeze the tuples with the update to set the hint bits and avoid this problem. There has been some talk about possibly writing tuples in a frozen state with the hint bits already set if they are loaded in the same database transaction which creates the table, but I'm not aware of anyone currently working on this. -Kevin
On 12/10/09 8:41 AM, "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > > There has been some talk about possibly writing tuples in a frozen > state with the hint bits already set if they are loaded in the same > database transaction which creates the table, but I'm not aware of > anyone currently working on this. > Wow, that would be nice. That would cut in half the time it takes to restore a several TB db (3 days to 1.5 here). I assume this would help a lot of "CREATE TABLE AS SELECT ..." use cases too. That is often the fastest way to do a large update on a table, but it can still be annoyingly write intensive. > -Kevin > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >