Обсуждение: Slow Vacuum was: vacuum output question
To follow up on an old thread that I started - I had a customer who had a system where manual vacuum runs were taking a very long time to run. I was seeing output like this: INFO: "cpe": found 95498 removable, 18757 nonremovable row versions in 11117 pages DETAIL: 0 dead row versions cannot be removed yet. There were 280173 unused item pointers. 0 pages are entirely empty. CPU 5.35s/0.99u sec elapsed 724.38 sec. Then, running vacuum again immediately afterword, on a system that was basically idle, would result in nearly the same amount of time to vacuum the table. Getting a copy of the database from the customer, and loading it onto my Postgres System and running the vacuum would result in runs that took less than a second (as expected). General opinion was that it was a disk-io problem. We rebooted the system, and magically, the problem went away. As tends to be the case with "magically" fixed problems, this one is back. Now I have a different customer running Postgres 8.1 on Fedora Core 6, and their system produced that log snippit above. hdparm shows disk-io thruput being perfectly normal on their system. Everything else on the system seems to be working correctly. The reboot solution doesn't help. Vacuum still runs painfully slow. I'm still waiting for a copy of their postgresql config file, but I'm guessing that almost everything was left on the default settings. I don't suppose anyone knows of any bugs that existed between postgres 8.1 and older linux kernels that led to behavior like this? I can't really just ask them to upgrade their OS and/or Postgres on the hunch that the problem should go away.... And I haven't yet been able to reproduce anything like this on a system that I have control over. Thanks for any ideas. Dan
On Tue, Dec 30, 2008 at 9:10 AM, Dan Armbrust <daniel.armbrust.list@gmail.com> wrote: > To follow up on an old thread that I started - I had a customer who > had a system where manual vacuum runs were taking a very long time to > run. I was seeing output like this: > > INFO: "cpe": found 95498 removable, 18757 nonremovable row versions > in 11117 pages > DETAIL: 0 dead row versions cannot be removed yet. > There were 280173 unused item pointers. > 0 pages are entirely empty. > CPU 5.35s/0.99u sec elapsed 724.38 sec. > > Then, running vacuum again immediately afterword, on a system that was > basically idle, would result in nearly the same amount of time to > vacuum the table. You do realize that except for the end of a table, vacuum recovers no actual space, just makes it available for new tuples to move in there. So it makes sense that the second vacuum would take just as long, or nearly so. cluster or vacuum full followed by reindex will regain space lost. > Getting a copy of the database from the customer, and loading it onto > my Postgres System and running the vacuum would result in runs that > took less than a second (as expected). > > General opinion was that it was a disk-io problem. We rebooted the > system, and magically, the problem went away. Hard to say. Have them run vmstat 1 while vacuuming and see what bi/bo look like.
>> INFO: "cpe": found 95498 removable, 18757 nonremovable row versions >> in 11117 pages >> DETAIL: 0 dead row versions cannot be removed yet. >> There were 280173 unused item pointers. >> 0 pages are entirely empty. >> CPU 5.35s/0.99u sec elapsed 724.38 sec. >> >> Then, running vacuum again immediately afterword, on a system that was >> basically idle, would result in nearly the same amount of time to >> vacuum the table. > > You do realize that except for the end of a table, vacuum recovers no > actual space, just makes it available for new tuples to move in there. > So it makes sense that the second vacuum would take just as long, or > nearly so. > Yep. The real issue is that it took 724 seconds, instead of say, a half second - like it does on my system. I wasn't sure if I should expect vacuum to take longer to run when it finds a large number of tuples that it needs to make available, so I just have them run it twice so I can easily compare the time that it takes with essentially no work to do. > > Hard to say. Have them run > > vmstat 1 > > while vacuuming and see what bi/bo look like. > Haven't looked at that yet on this particular system. Last time, on different hardware when this occurred the vmstat 'wa' column showed very large values while vacuum was running. I don't recall what the bi/bo columns indicated. top also showed very high load averages while vacuum was running - but basically no cpu use. Are there any common tools that could do a better disk benchmark than hdparm -Tt? Thanks, Dan
>> INFO: "cpe": found 95498 removable, 18757 nonremovable row versions >> in 11117 pages >> DETAIL: 0 dead row versions cannot be removed yet. >> There were 280173 unused item pointers. >> 0 pages are entirely empty. >> CPU 5.35s/0.99u sec elapsed 724.38 sec. > > How many idle transactions are there? > Not sure. I don't expect that there were many, their system isn't very highly loaded compared to most of our customers systems. The way that they reported the problem to us was that if they enable autovacuum, when ever it runs (about 4 times an hour) it would stop processing the things it needed to process, due to table lock contention for several minutes. Their workaround had been to run a daily autovacuum at the lowest load time of day, to cause the least disruption.
On Tue, Dec 30, 2008 at 9:32 AM, Dan Armbrust <daniel.armbrust.list@gmail.com> wrote: > Haven't looked at that yet on this particular system. Last time, on > different hardware when this occurred the vmstat 'wa' column showed > very large values while vacuum was running. I don't recall what the > bi/bo columns indicated. definitely sounds like poor io > top also showed very high load averages while vacuum was running - but > basically no cpu use. yeah, load is the number of things running or waiting to run. If vacuum is sucking up all the io, and this machine doesn't have much io capability, then it's quite possible for other processes stuck behind it to crank up the load factor. Also, were the any vacuum cost delay settings over 0 on this machine when the test was run? > Are there any common tools that could do a better disk benchmark than > hdparm -Tt? Keep in mind, hdparm hits the drive directly, not through the filesystem. I use bonnie++ or iozone to test io.
On Tue, Dec 30, 2008 at 9:47 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > Keep in mind, hdparm hits the drive directly, not through the > filesystem. I use bonnie++ or iozone to test io. Also dd and vmstat together.
On Tue, Dec 30, 2008 at 10:37:04AM -0600, Dan Armbrust wrote: > The way that they reported the problem to us was that if they enable > autovacuum, when ever it runs (about 4 times an hour) it would stop > processing the things it needed to process, due to table lock > contention for several minutes. If that's true, there are a couple possibilities for why. One of them is that autovacuum needs tuning. Another is that certain tables aren't really good candidates for autovacuum (<8.3, this is pretty common). Another is that they're starved for I/O; you need to check that when processing and vacuum is happening, not the rest of the time. Another is that they're doing something like explicit locking or very large numbers of UPDATEs in one transaction, and vacuum is just highlighting a problem with what they're doing. > Their workaround had been to run a daily autovacuum at the lowest load > time of day, to cause the least disruption. What is a "daily autovacuum"? It sounds like some tables just need vacuuming more often. If they find that the system is not responsive during that, it tells us that they need more disk bandwidth or that they need to integrate vacuuming some tables with their program. A -- Andrew Sullivan ajs@crankycanuck.ca
> >> Their workaround had been to run a daily autovacuum at the lowest load >> time of day, to cause the least disruption. > > What is a "daily autovacuum"? It sounds like some tables just need > vacuuming more often. If they find that the system is not responsive > during that, it tells us that they need more disk bandwidth or that > they need to integrate vacuuming some tables with their program. > Sorry, I meant a daily manual vacuum. On paper, their hardware is plenty fast for their workload. Out of hundreds of sites, all running the same software putting load on the database, this is only the second time where we have seen this odd behaviour of very slow vacuums. I guess I was hoping that someone would be able to chime in and say - yes, in so and so version, we fixed an obscure bug that sometimes caused huge slowdowns, perhaps when combined with certain linux kernels. It was a nice dream anyway :) iozone looks useful. I'll see if I can get on their system and do some proper benchmarks. Thanks, Dan
On Tue, Dec 30, 2008 at 10:14 AM, Dan Armbrust <daniel.armbrust.list@gmail.com> wrote: > > On paper, their hardware is plenty fast for their workload. Out of > hundreds of sites, all running the same software putting load on the > database, this is only the second time where we have seen this odd > behaviour of very slow vacuums. > > I guess I was hoping that someone would be able to chime in and say - > yes, in so and so version, we fixed an obscure bug that sometimes > caused huge slowdowns, perhaps when combined with certain linux > kernels. It was a nice dream anyway :) This actually kinda sounds like one of those RAID cards that every so often decides to check the consistency of the RAID array, and when it's doing that, the whole system slows to a crawl.
Here is an interesting new datapoint. Modern Ubuntu distro - PostgreSQL 8.1. SATA drive. No Raid. Cannot reproduce slow vacuum performance - vacuums take less than a second for the whole database. Reinstall OS - Fedora Core 6 - PostgreSQL 8.1. Push data through PostgreSQL for a couple hours (same as above) and now vacuum reports this: INFO: vacuuming "public.cpe" INFO: index "pk_cpe" now contains 50048 row versions in 2328 pages DETAIL: 415925 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.05s/0.33u sec elapsed 0.94 sec. INFO: index "ix_cpe_ispid" now contains 50090 row versions in 1338 pages DETAIL: 415925 index row versions were removed. 953 index pages have been deleted, 0 are currently reusable. CPU 0.27s/0.22u sec elapsed 8.93 sec. INFO: index "ix_cpe_enable" now contains 50676 row versions in 1637 pages DETAIL: 415925 index row versions were removed. 1161 index pages have been deleted, 0 are currently reusable. CPU 0.45s/0.31u sec elapsed 14.01 sec. INFO: "cpe": removed 415925 row versions in 10844 pages DETAIL: CPU 1.48s/0.25u sec elapsed 35.86 sec. INFO: "cpe": found 415925 removable, 50003 nonremovable row versions in 10849 pages DETAIL: 6 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 2.40s/1.18u sec elapsed 61.13 sec. It tooks 61 seconds to vacuum, and the number of index row versions removed was huge. We than issued a reindex command for the entire database - and now the vaccum times are back down under a second. What on earth could be going on between PostgreSQL 8.1 and Fedora 6 that is bloating and/or corrupting the indexes like this? Thanks, Dan
Dan Armbrust escribió: > What on earth could be going on between PostgreSQL 8.1 and Fedora 6 > that is bloating and/or corrupting the indexes like this? Postgres 8.1 was slow to vacuum btree indexes. My guess is that your indexes are so bloated that it takes a lot of time to scan them. I think the solution here is to vacuum this table far more often. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tue, Jan 6, 2009 at 1:39 PM, Dan Armbrust <daniel.armbrust.list@gmail.com> wrote: > Here is an interesting new datapoint. > > Modern Ubuntu distro - PostgreSQL 8.1. SATA drive. No Raid. Cannot > reproduce slow vacuum performance - vacuums take less than a second > for the whole database. > > Reinstall OS - Fedora Core 6 - PostgreSQL 8.1. Push data through > PostgreSQL for a couple hours (same as above) and now vacuum reports > this: Are you pushing the same amount of data through the ubuntu server? if not, then the comparison is invalid, if so, then yeah, there's some kind of difference between the platforms. Note that Fedora Core 6 is quite old compared to ubuntu 8.04 or 8.10. Also it's more likely to be installed on older and / or slower equipment.
On Tue, Jan 6, 2009 at 3:01 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Dan Armbrust escribió: > >> What on earth could be going on between PostgreSQL 8.1 and Fedora 6 >> that is bloating and/or corrupting the indexes like this? > > Postgres 8.1 was slow to vacuum btree indexes. My guess is that your > indexes are so bloated that it takes a lot of time to scan them. > > I think the solution here is to vacuum this table far more often. > > -- > Alvaro Herrera http://www.CommandPrompt.com/ > The PostgreSQL Company - Command Prompt, Inc. > Actually, the customer reported problem is that when they enable autovacuum, the performance basically tanks because vacuum runs so slow they can't bear to have it run frequently. Though, perhaps they had bloated indexes before they started autovacuum, and it never fixed them. Perhaps it will behave properly if we do a reindex, and then enable autovacuum.
> On Tue, Jan 6, 2009 at 1:39 PM, Dan Armbrust > <daniel.armbrust.list@gmail.com> wrote: >> Here is an interesting new datapoint. >> >> Modern Ubuntu distro - PostgreSQL 8.1. SATA drive. No Raid. Cannot >> reproduce slow vacuum performance - vacuums take less than a second >> for the whole database. >> >> Reinstall OS - Fedora Core 6 - PostgreSQL 8.1. Push data through >> PostgreSQL for a couple hours (same as above) and now vacuum reports >> this: > > Are you pushing the same amount of data through the ubuntu server? if > not, then the comparison is invalid, if so, then yeah, there's some > kind of difference between the platforms. > > Note that Fedora Core 6 is quite old compared to ubuntu 8.04 or 8.10. > Also it's more likely to be installed on older and / or slower > equipment. > Yep - actually, we pushed much more data through the Ubuntu system and could never reproduce the problem. On the Fedora Core 6 system, the problem happened very quickly. In our testing here, the Ubuntu test was on the same hardware as the fedora core 6 system (not just identical, but the same actual box) It seems that there is some sort of bad interaction between some part of the older OS and PostgreSQL. We have also seen what appears to be the same issue on a Cent OS 4.4 system. Which is a rather similar package level to Fedora Core 6.
On Tuesday 06 January 2009, "Dan Armbrust" <daniel.armbrust.list@gmail.com> wrote: > What on earth could be going on between PostgreSQL 8.1 and Fedora 6 > that is bloating and/or corrupting the indexes like this? Obviously the choice of operating system has no impact on the contents of your index. A better question might be, what did your application or maintenance procedures do different in the different tests? -- Alan
> > Obviously the choice of operating system has no impact on the contents of > your index. > > A better question might be, what did your application or maintenance > procedures do different in the different tests? > > > -- > Alan Our problem for a long time has been assuming the "obvious". But we now have tests that show otherwise. I'm now thinking something along the lines of an obscure file system or kernel interaction bug now - that was perhaps corrected in newer releases of the OS. Now that we can finally reproduce the problem in house, we are still doing more tests to figure out specifics - does the problem go away with Postgres 8.3, ext2/ext3/reiserFS, etc.
On Tue, Jan 6, 2009 at 2:07 PM, Dan Armbrust <daniel.armbrust.list@gmail.com> wrote: > > Actually, the customer reported problem is that when they enable > autovacuum, the performance basically tanks because vacuum runs so > slow they can't bear to have it run frequently. Actually this is kinda backwards. What's happening is that the vacuum uses up so much IO that nothing else can get through. The answer is to make it run slower, by use of autovacuum_vacuum_cost_delay, and setting it to 10 or 20 and seeing if they can then run autovacuum during the day without these issues. Note that vacuum was improved a fair bit from 8.1 to 8.2 and even moreso from 8.2 to 8.3. > Though, perhaps they had bloated indexes before they started > autovacuum, and it never fixed them. Perhaps it will behave properly > if we do a reindex, and then enable autovacuum. Definitely look at the cost_delay setting. Makes a huge difference.
"Dan Armbrust" <daniel.armbrust.list@gmail.com> writes: > INFO: "cpe": found 415925 removable, 50003 nonremovable row versions > in 10849 pages > What on earth could be going on between PostgreSQL 8.1 and Fedora 6 > that is bloating and/or corrupting the indexes like this? You're focusing on the indexes when the problem is dead table rows. It's very hard to believe that there's any OS dependence as such involved in that. I wonder whether (a) the Ubuntu and Fedora packages you're using are the same 8.1.x point release; (b) if there's any interesting non-default behavior built into the Ubuntu packaging ... like running autovacuum automatically, for instance. regards, tom lane
On Tue, Jan 6, 2009 at 3:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Dan Armbrust" <daniel.armbrust.list@gmail.com> writes: >> INFO: "cpe": found 415925 removable, 50003 nonremovable row versions >> in 10849 pages > >> What on earth could be going on between PostgreSQL 8.1 and Fedora 6 >> that is bloating and/or corrupting the indexes like this? > > You're focusing on the indexes when the problem is dead table rows. > > It's very hard to believe that there's any OS dependence as such > involved in that. I wonder whether (a) the Ubuntu and Fedora packages > you're using are the same 8.1.x point release; (b) if there's any > interesting non-default behavior built into the Ubuntu packaging > ... like running autovacuum automatically, for instance. > > regards, tom lane > In our testing, Postgres 8.1 was build from source (PostgreSQL website source) on both systems. No Distro packages involved. Believe me, we are as baffled as you. We have been chasing this bug off and on for months on a couple of different customer sites now. Thanks, Dan
I'm no closer to a solution, but here are some additional data points - all taken on Fedora Core 6. Postgres 8.1 built from source. Auto vacuum disabled. Create Empty Database. Run our load on the system for 2 hours to populate and exercise the database. Run Vacuum. Takes more than a minute. Run Vacuum immediately again. Takes more than a minute. Reindex the database. Takes 10 seconds. Run Vacuum again. Takes 2 seconds. Allow load to run on the system for 30 minutes. Run Vacuum again. Takes more than a minute. Postgres 8.3 built from source. Auto vacuum disabled. Create Empty Database. Run our load on the system for 2 hours to populate and exercise the database. Run Vacuum. Takes more than a minute. Run Vacuum immediately again. Takes 15 seconds. Run Vacuum immediately again. Takes 15 seconds. Reindex the database. Takes 10 seconds. Run Vacuum again. Takes 2 seconds. So, PostgreSQL 8.3 shows better behaviour, but it is still showing some sort of performance issue which a reindex fixes. And then of course, the kicker is that we can't recreate any of these issues when running the same exact test, on the same exact hardware - but using a different underlying OS. When we were running under a modern Ubuntu, the vacuum never takes more than 2 seconds. We will be checking other OSs soon. I guess if we can't figure out what is causing it, we can at least isolate the distros that we need to tell our customers not to use (or to schedule a reindex if they insist on not upgrading their OS)
On Wed, Jan 7, 2009 at 10:26 AM, Dan Armbrust <daniel.armbrust.list@gmail.com> wrote: > I'm no closer to a solution, but here are some additional data points > - all taken on Fedora Core 6. So, what does vmstat 10 say when you're running the "long" vacuum on each system?
On PostgreSQL 8.1, while a long vacuum is running, the output of vmstat 10 looks like this (sorry, can't format this very will in this e-mail client): r b swpd free buff cache si so bi bo in cs us sy id wa st 5 2 112 53732 4388 1163404 0 0 13524 1322 941 75053 45 55 0 0 0 3 2 112 51596 2336 1160032 0 0 20731 732 999 196116 35 58 0 7 0 3 2 112 53784 2448 1155388 0 0 5729 1680 716 114442 46 52 0 2 0 7 1 112 52088 2568 1156796 0 0 205 1919 546 28929 56 44 0 0 0 3 2 112 51772 2652 1157028 0 0 162 2010 534 2845 56 44 0 0 0 3 2 112 52296 2736 1156332 0 0 174 1910 517 2648 53 47 0 0 0 2 2 112 52736 2832 1155388 0 0 173 1960 521 2698 57 43 0 0 0 6 1 112 51172 2908 1157400 0 0 173 1825 511 2621 52 48 0 0 0 2 1 112 52156 2984 1155876 0 0 152 1766 495 2669 53 47 0 0 0 5 2 112 53828 3064 1154028 0 0 124 1762 498 2673 54 46 0 0 0 5 0 112 53396 3140 1154684 0 0 122 1803 532 2871 52 48 0 0 0 6 1 112 52004 3240 1155892 0 0 150 2004 522 3712 56 44 0 0 0 5 2 112 53100 3348 1152364 0 0 128 2142 543 2711 54 38 0 8 0 1 2 112 56704 3428 1148132 0 0 102 2229 560 2925 56 44 0 0 0 2 1 112 54876 3500 1150080 0 0 91 2129 598 3068 65 35 0 0 0 3 1 112 53196 3588 1152164 0 0 90 2091 542 2803 57 43 0 0 0 8 2 112 53528 3664 1151868 0 0 106 2180 553 2947 55 39 0 6 0 Vacuum hasn't stopped yet at this point - its been running for many minutes. On PostgreSQL 8.3, we had this: r b swpd free buff cache si so bi bo in cs us sy id wa st 6 0 9604 51932 14276 1138996 0 0 3 1808 523 5115 45 9 42 4 0 2 0 9580 53284 14124 1138092 10 0 564 2561 585 5126 53 10 19 18 0 7 0 9564 53412 14144 1137696 0 0 0 1682 497 4985 42 9 48 2 0 3 0 9532 53320 13880 1137940 6 0 37 1757 538 4979 48 8 41 3 More random data: On an overnight test of PostgreSQL 8.3 on Fedora Core 6, with auto-vacuum on (using all default settings) the amount of time that it takes to run a manual vacuum on a constant sized database (only row updates, or deletes followed by a replacement add) has doubled from 2 seconds to 4 seconds. A reindex brings the time back down to 2 seconds. Dan
On Thu, Jan 8, 2009 at 10:43 AM, Dan Armbrust <daniel.armbrust.list@gmail.com> wrote: > On PostgreSQL 8.1, while a long vacuum is running, the output of > vmstat 10 looks like this (sorry, can't format this very will in this > e-mail client): > > r b swpd free buff cache si so bi bo > in cs us sy id wa st > 5 2 112 53732 4388 1163404 0 0 13524 1322 941 75053 45 > 55 0 0 0 > 3 2 112 51596 2336 1160032 0 0 20731 732 999 196116 35 > 58 0 7 0 > 3 2 112 53784 2448 1155388 0 0 5729 1680 716 114442 46 > 52 0 2 0 > 7 1 112 52088 2568 1156796 0 0 205 1919 546 28929 56 > 44 0 0 0 > 3 2 112 51772 2652 1157028 0 0 162 2010 534 2845 56 44 0 0 0 > 3 2 112 52296 2736 1156332 0 0 174 1910 517 2648 53 47 0 0 0 > 2 2 112 52736 2832 1155388 0 0 173 1960 521 2698 57 43 0 0 0 > 6 1 112 51172 2908 1157400 0 0 173 1825 511 2621 52 48 0 0 0 > 2 1 112 52156 2984 1155876 0 0 152 1766 495 2669 53 47 0 0 0 > 5 2 112 53828 3064 1154028 0 0 124 1762 498 2673 54 46 0 0 0 > 5 0 112 53396 3140 1154684 0 0 122 1803 532 2871 52 48 0 0 0 > 6 1 112 52004 3240 1155892 0 0 150 2004 522 3712 56 44 0 0 0 > 5 2 112 53100 3348 1152364 0 0 128 2142 543 2711 54 38 0 8 0 > 1 2 112 56704 3428 1148132 0 0 102 2229 560 2925 56 44 0 0 0 > 2 1 112 54876 3500 1150080 0 0 91 2129 598 3068 65 35 0 0 0 > 3 1 112 53196 3588 1152164 0 0 90 2091 542 2803 57 43 0 0 0 > 8 2 112 53528 3664 1151868 0 0 106 2180 553 2947 55 39 0 6 0 > > Vacuum hasn't stopped yet at this point - its been running for many minutes. > > > On PostgreSQL 8.3, we had this: > > r b swpd free buff cache si so bi bo > in cs us sy id wa st > 6 0 9604 51932 14276 1138996 0 0 3 1808 523 5115 45 > 9 42 4 0 > 2 0 9580 53284 14124 1138092 10 0 564 2561 585 5126 53 > 10 19 18 0 > 7 0 9564 53412 14144 1137696 0 0 0 1682 497 4985 42 > 9 48 2 0 > 3 0 9532 53320 13880 1137940 6 0 37 1757 538 4979 48 8 41 3 > > More random data: > > On an overnight test of PostgreSQL 8.3 on Fedora Core 6, with > auto-vacuum on (using all default settings) the amount of time that it > takes to run a manual vacuum on a constant sized database (only row > updates, or deletes followed by a replacement add) has doubled from 2 > seconds to 4 seconds. A reindex brings the time back down to 2 > seconds. Well, your throughput on this machine is horrible. It looks like with 8.1 all your time is sys + cpu for your cpus, while with 8.3 you've got more idle and more i/o wait, which tells me that 8.3 is smarter about vacuuming, so it's spending less time working the cpus and more time waiting for the i/o subsystem. Wither way, getting only 2 or so megs a second write is pretty bad. I can get those numbers from a laptop. An older laptop like a single core 1.6GHz pentium M based T42 or something. My laptop, which is new from last year, is about twice as fast as your server in terms of I/O. I'm running 8.3.5 btw. Here's vmstat 10 output while vacuuming a db: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 45104 27228 26876 1806996 0 0 132 3545 1009 4531 17 4 50 29 0 1 45104 31680 26804 1803040 0 0 108 2144 1012 2187 15 3 39 44 7 0 45104 27120 26916 1807800 0 0 103 4439 1057 5018 17 4 37 42 2 1 45104 25732 26972 1808868 0 0 92 4174 1049 4439 16 4 37 43 0 1 45104 32928 26988 1801116 0 0 97 3709 1029 3254 17 3 34 45 pgbench gives me numbers like this: 0 1 45116 29920 26772 1810308 0 0 132 4540 932 2100 16 3 41 40 1 2 45116 31956 26732 1807844 0 0 380 8864 897 1824 12 4 32 52 You could replace that machine with something much faster for the price of an upgrade or two on the dell. I spent a lot of effort in the last year trying to increase the performance of our dell 1950s and finally gave up and just replaced them as db servers. One decent sized dual QC opteron can run the same load as 5 other database servers while running under a much lower load and better performance than any of the 5 dbs it replaced. We had bought a 1950 last year with an MD-1000 before I started, and it had cost about $12k in total for 7 SAS drives, controller and the original 1950. The new db server to replace it cost $500 less, had 16 SAS drives, a much faster Areca 1680 RAID controller, and 8 opteron cores. It is literally 10 to 40 times faster depending on the workload. The 1850s we have are half as fast as our 1950s as db servers, so I'm pretty sure you could run faster with only a handful of SAS drives or intel xm25s.
well, I can confirm problems is caused by indices here as well - I do reindex twice a month because of that. (and general performance problems caused by large index) - maybe it is time to focus on index re usability - since HOT is already there, autovacuum does the job too.
> Well, your throughput on this machine is horrible. It looks like with > 8.1 all your time is sys + cpu for your cpus, while with 8.3 you've > got more idle and more i/o wait, which tells me that 8.3 is smarter > about vacuuming, so it's spending less time working the cpus and more > time waiting for the i/o subsystem. > > Wither way, getting only 2 or so megs a second write is pretty bad. I > can get those numbers from a laptop. An older laptop like a single > core 1.6GHz pentium M based T42 or something. My laptop, which is new > from last year, is about twice as fast as your server in terms of I/O. This is my problem in a nutshell. As of yet, I have no rational explanation for this performance. The servers in question are not slow. This particular server never shows this problem when running a newer OS - but I have not yet finished isolating which OS's have problems on this hardware. No other software on the system exhibits any sort of IO issue other than PostgreSQL. I have customers with $20K servers that can't handle the workload that I can handle on an old cruddy laptop. However, much of the appeal of our product is low per-site installation costs, so expensive servers don't fit into the mix. Random futzing - reindexing, manual full vacuums, rebooting the server - each of these has cleared the error condition on one site or another, and allowed the system to go back to functioning the way it should for months, until the problem randomly crops up again. I'm still looking into it, but, at the same time, we have enough workarounds to the issue now (scheduled reindex, install a newer OS, upgrade to Postgres 8.3) that this is becoming a low priority mystery, rather than the high priority one it has been. Thanks for your thoughts, Dan