Обсуждение: Performance question 83 GB Table 150 million rows, distinct select
Linux F12 64bit Postgres 8.4.4 16 proc / 32GB 8 disk 15KRPM SAS/Raid 5 (I know!) shared_buffers = 6000MB #temp_buffers = 8MB max_prepared_transactions = 0 work_mem = 250MB maintenance_work_mem = 1000MB We now have about 180mill records in that table. The database size is about 580GB and the userstats table which is the biggest one and the one we query the most is 83GB. Just a basic query takes 4 minutes: For e.g. select count(distinct uid) from userstats where log_date >'11/7/2011' Since we are looking for distinct we can't obviously use an index. But I'm wondering what should be expected and what is caused be tuning or lack there of? Doing an iostat I see maybe 10-15%, however the cpu that this query is attached to is obviously in the 99-100% busy arena. Or am I really IOBound for this single query (sure lots of data but?!). It takes roughly 5.5 hours to do a concurrent re-index and this DB is vac'd nightly. Just not sure if this is what to expect, however there are many other DB's out there bigger than ours, so I'm curious what can I do? Thanks Tory avg-cpu: %user %nice %system %iowait %steal %idle 1.41 0.00 0.20 1.61 0.00 96.78 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 92.17 3343.06 1718.85 46273281004 23791660544 avg-cpu: %user %nice %system %iowait %steal %idle 1.47 0.00 0.61 5.85 0.00 92.07 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 417.50 90372.00 0.00 180744 0 avg-cpu: %user %nice %system %iowait %steal %idle 2.88 0.00 0.76 6.34 0.00 90.03 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 725.00 183560.00 148.00 367120 296 avg-cpu: %user %nice %system %iowait %steal %idle 2.18 0.00 0.60 3.59 0.00 93.63 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 711.00 179952.00 240.00 359904 480 [blue@adb01 ~]$ iostat -xd 2 Linux 2.6.32.26-175.fc12.x86_64 (adb01) 11/16/2011 _x86_64_ (16 CPU) Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.18 191.40 68.71 23.45 3343.22 1718.85 54.92 0.12 4.61 2.05 18.94 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 2.00 0.00 706.50 8.00 178832.00 128.00 250.47 77.76 31.21 1.40 100.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 4.98 17.41 584.58 35.32 148497.51 672.64 240.64 38.04 227.07 1.61 99.55 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 3.50 0.00 688.50 2.00 174556.00 32.00 252.84 2.81 4.66 1.44 99.30 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 1.00 10.00 717.50 1.50 182084.00 92.00 253.37 2.43 3.37 1.38 99.45 ^C [blue@]$ iostat 2 Linux 2.6.32.26-175.fc12.x86_64 (adb01) 11/16/2011 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.41 0.00 0.20 1.61 0.00 96.78 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 92.17 3343.33 1718.85 46277115652 23791678248 avg-cpu: %user %nice %system %iowait %steal %idle 7.79 0.00 0.51 8.51 0.00 83.20 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 235.50 45168.00 0.00 90336 0 avg-cpu: %user %nice %system %iowait %steal %idle 5.90 0.00 0.35 4.46 0.00 89.29 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 160.00 14688.00 132.00 29376 264 avg-cpu: %user %nice %system %iowait %steal %idle 8.01 0.00 0.51 12.80 0.00 78.67 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 163.50 11324.00 700.00 22648 1400
On November 16, 2011 02:53:17 PM Tory M Blue wrote: > We now have about 180mill records in that table. The database size is > about 580GB and the userstats table which is the biggest one and the > one we query the most is 83GB. > > Just a basic query takes 4 minutes: > > For e.g. select count(distinct uid) from userstats where log_date > >'11/7/2011' > > Just not sure if this is what to expect, however there are many other > DB's out there bigger than ours, so I'm curious what can I do? That query should use an index on log_date if one exists. Unless the planner thinks it would need to look at too much of the table. Also, the normal approach to making large statistics tables more manageable is to partition them by date range.
On Wed, Nov 16, 2011 at 4:27 PM, Alan Hodgson <ahodgson@simkin.ca> wrote: > On November 16, 2011 02:53:17 PM Tory M Blue wrote: >> We now have about 180mill records in that table. The database size is >> about 580GB and the userstats table which is the biggest one and the >> one we query the most is 83GB. >> >> Just a basic query takes 4 minutes: >> >> For e.g. select count(distinct uid) from userstats where log_date >> >'11/7/2011' >> >> Just not sure if this is what to expect, however there are many other >> DB's out there bigger than ours, so I'm curious what can I do? > > That query should use an index on log_date if one exists. Unless the planner > thinks it would need to look at too much of the table. Agreed. We'd need to know how selective that where clause is. Seeing some forced index usage versus regular explain analyze would be useful. i.e. set enable_seqscan=off; explain analyze select ... > Also, the normal approach to making large statistics tables more manageable is > to partition them by date range. If the OP's considering partitioning, they should really consider upgrading to 9.1 which has much better performance of things like aggregates against partition tables.
On Wed, Nov 16, 2011 at 3:32 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
If the OP's considering partitioning, they should really consider
upgrading to 9.1 which has much better performance of things like
aggregates against partition tables.
Could you elaborate on this a bit, or point me at some docs? I manage a 600GB db which does almost nothing but aggregates on partitioned tables - the largest of which has approx 600 million rows across all partitions. grouping in the aggregates tends to be on the partition column and rarely, if ever, would a group cross multiple partitions. We're on 9.0 and could definitely use some performance gains.
Hi. On 16 Listopad 2011, 23:53, Tory M Blue wrote: > > We now have about 180mill records in that table. The database size is > about 580GB and the userstats table which is the biggest one and the > one we query the most is 83GB. > > Just a basic query takes 4 minutes: > > For e.g. select count(distinct uid) from userstats where log_date > >'11/7/2011' > > Since we are looking for distinct we can't obviously use an index. But > I'm wondering what should be expected and what is caused be tuning or > lack there of? Doing an iostat I see maybe 10-15%, however the cpu > that this query is attached to is obviously in the 99-100% busy arena. > Or am I really IOBound for this single query (sure lots of data > but?!). What do you mean by "can't use an index"? The query may use an index to evaluate the WHERE condition, no matter if there's a distinct or not. The index-only scans that might be used to speed up this query are committed in 9.2 - but even that might use index both for plain count and count distinct. But you're right - you're not bound by I/O (although I don't know what are those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually keep all the distinct values to determine which are actually distinct. If there's enough memory (work_mem) to keep all the values, this may be done using a hash table (hash aggregate). Otherwise it has to sort them. You can see this in explain plan (which you haven't posted). Anyway this is actually a rather CPU intensive - how exactly depends on the data type. Comparing integers is much easier / cheaper than comparing text values. What data type is the 'uid' column? > It takes roughly 5.5 hours to do a concurrent re-index and this DB is > vac'd nightly. > > Just not sure if this is what to expect, however there are many other > DB's out there bigger than ours, so I'm curious what can I do? Well, not much. Use an integer data type for the 'uid' column (unless you're already using it). Then you can use more work_mem so that a hash aggregate is used (maybe it's already used, we need to see the explain plan to check). Then you could precompute the distinct values somehow - for example if there are only a few distinct values for each day, you could do something like this every day INSERT INTO userstats_distinct SELECT DISTINCT date_trunc('day', log_date), uid FROM userstats WHERE log_date BETWEEN date_trunc('day', log_date) - interval '1 day' AND date_trunc('day', log_date); and then just SELECT COUNT(DISTINCT uid) FROM userstats_distinct WHERE log_date > '11/7/2011'; The point is to preaggregate the data to the desired granularity (e.g. day), and how it improves the performance depends on how much the amount of data decreases. Another option is to use estimates instead of exact results - I've actually written an extension for that, maybe you'll find that useful. It's available on github (https://github.com/tvondra/distinct_estimators) and pgxn (http://pgxn.org/tag/estimate/). I've posted a brief description here: http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/ and the current extensions actually performs much better. It's not that difficult to reach 1% precision. Let me know if this is interesting for you and if you need a help with the extensions. Tomas
Thanks all, I misspoke on our use of the index. We do have an index on log_date and it is being used here is the explain analyze plan. 'Aggregate (cost=7266186.16..7266186.17 rows=1 width=8) (actual time=127575.030..127575.030 rows=1 loops=1)' ' -> Bitmap Heap Scan on userstats (cost=135183.17..7240890.38 rows=10118312 width=8) (actual time=8986.425..74815.790 rows=33084417 loops=1)' ' Recheck Cond: (log_date > '2011-11-04'::date)' ' -> Bitmap Index Scan on idx_userstats_logdate (cost=0.00..132653.59 rows=10118312 width=0) (actual time=8404.147..8404.147 rows=33084417 loops=1)' ' Index Cond: (log_date > '2011-11-04'::date)' 'Total runtime: 127583.898 ms' Partitioning Tables This is use primarily when you are usually accessing only a part of the data. We want our queries to go across the entire date range. So we don't really meet the criteria for partitioning (had to do some quick research). Thanks again Tory
On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > But you're right - you're not bound by I/O (although I don't know what are > those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually > keep all the distinct values to determine which are actually distinct. Actually I meant to comment on this, he is IO bound. Look at % Util, it's at 99 or 100. Also, if you have 16 cores and look at something like vmstat you'll see 6% wait state. That 6% represents one CPU core waiting for IO, the other cores will add up the rest to 100%.
On Wed, Nov 16, 2011 at 4:52 PM, Samuel Gendler <sgendler@ideasculptor.com> wrote: > Could you elaborate on this a bit, or point me at some docs? I manage a > 600GB db which does almost nothing but aggregates on partitioned tables - > the largest of which has approx 600 million rows across all partitions. > grouping in the aggregates tends to be on the partition column and rarely, > if ever, would a group cross multiple partitions. We're on 9.0 and could > definitely use some performance gains. It's covered in the release notes for 9.1: http://developer.postgresql.org/pgdocs/postgres/release-9-1.html
On 11/16/2011 04:53 PM, Tory M Blue wrote: > Linux F12 64bit > Postgres 8.4.4 > 16 proc / 32GB > 8 disk 15KRPM SAS/Raid 5 (I know!) > > > shared_buffers = 6000MB > #temp_buffers = 8MB > max_prepared_transactions = 0 > work_mem = 250MB > maintenance_work_mem = 1000MB > > > > > > We now have about 180mill records in that table. The database size is > about 580GB and the userstats table which is the biggest one and the > one we query the most is 83GB. > > Just a basic query takes 4 minutes: > > For e.g. select count(distinct uid) from userstats where log_date>'11/7/2011' > How'd you feel about keeping a monthly summary table? Update it daily, with only a days worth of stats, then you could querythe summary table much faster. That's what I do for my website stats. I log details for a month, then summarize everything into a summary table, and blowaway the details. You wouldn't have to delete the details if you wanted them, just keeping the summary table updatedwould be enough. -Andy
On 17 Listopad 2011, 2:57, Scott Marlowe wrote: > On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > >> But you're right - you're not bound by I/O (although I don't know what >> are >> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually >> keep all the distinct values to determine which are actually distinct. > > Actually I meant to comment on this, he is IO bound. Look at % Util, > it's at 99 or 100. > > Also, if you have 16 cores and look at something like vmstat you'll > see 6% wait state. That 6% represents one CPU core waiting for IO, > the other cores will add up the rest to 100%. Aaaah, I keep forgetting about this and I somehow ignored the iostat results too. Yes, he's obviously IO bound. But this actually means the pre-aggregating the data (as I described in my previous post) would probably help him even more (less data, less CPU). Tomas
On Wed, Nov 16, 2011 at 6:27 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 17 Listopad 2011, 2:57, Scott Marlowe wrote: >> On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> >>> But you're right - you're not bound by I/O (although I don't know what >>> are >>> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually >>> keep all the distinct values to determine which are actually distinct. >> >> Actually I meant to comment on this, he is IO bound. Look at % Util, >> it's at 99 or 100. >> >> Also, if you have 16 cores and look at something like vmstat you'll >> see 6% wait state. That 6% represents one CPU core waiting for IO, >> the other cores will add up the rest to 100%. > > Aaaah, I keep forgetting about this and I somehow ignored the iostat > results too. Yes, he's obviously IO bound. I'm not so sure on the io-bound. Been battling/reading about it all day. 1 CPU is pegged at 100%, but the disk is not. If I do something else via another CPU I have no issues accessing the disks, writing/deleting/reading. It appears that what was said about this being very CPU intensive makes more sense to me. The query is only using 1 CPU and that appears to be getting overwhelmed. %util: This number depicts the percentage of time that the device spent in servicing requests. On a large query, or something that is taking a while it's going to be writing to disk all the time and I'm thinking that is what the util is telling me, especially since IOwait is in the 10-15% range. Again just trying to absorb avg-cpu: %user %nice %system %iowait %steal %idle 0.93 0.00 0.60 9.84 0.00 88.62 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 86.50 3453.00 1.50 55352.00 16.00 16.03 5.24 0.66 0.29 100.00 I mean await time and service time are in the .29 to .66 msec that doesn't read as IObound to me. But I'm more than willing to learn something not totally postgres specific. But I just don't see it... Average queue size of 2.21 to 6, that's really not a ton of stuff "waiting" Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 3.50 3060.00 2.00 49224.00 20.00 16.08 2.21 0.76 0.33 99.95 avg-cpu: %user %nice %system %iowait %steal %idle 0.80 0.00 0.51 11.01 0.00 87.68 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 5.00 3012.50 3.00 48200.00 92.00 16.01 2.11 0.74 0.33 99.95 avg-cpu: %user %nice %system %iowait %steal %idle 0.93 0.00 0.60 9.84 0.00 88.62 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 86.50 3453.00 1.50 55352.00 16.00 16.03 5.24 0.66 0.29 100.00
On Wed, Nov 16, 2011 at 7:42 PM, Tory M Blue <tmblue@gmail.com> wrote: > On Wed, Nov 16, 2011 at 6:27 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 17 Listopad 2011, 2:57, Scott Marlowe wrote: >>> On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> >>>> But you're right - you're not bound by I/O (although I don't know what >>>> are >>>> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually >>>> keep all the distinct values to determine which are actually distinct. >>> >>> Actually I meant to comment on this, he is IO bound. Look at % Util, >>> it's at 99 or 100. >>> >>> Also, if you have 16 cores and look at something like vmstat you'll >>> see 6% wait state. That 6% represents one CPU core waiting for IO, >>> the other cores will add up the rest to 100%. >> >> Aaaah, I keep forgetting about this and I somehow ignored the iostat >> results too. Yes, he's obviously IO bound. > > I'm not so sure on the io-bound. Been battling/reading about it all > day. 1 CPU is pegged at 100%, but the disk is not. If I do something Look here in iostat: > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 3.50 3060.00 2.00 49224.00 20.00 > 16.08 2.21 0.76 0.33 99.95 See that last column, it's % utilization. Once it hits 100% you are anywhere from pretty close to IO bound to right on past it. I agree with the previous poster, you should roll these up ahead of time into a materialized view for fast reporting.
On Wed, Nov 16, 2011 at 8:02 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Wed, Nov 16, 2011 at 7:42 PM, Tory M Blue <tmblue@gmail.com> wrote: >> On Wed, Nov 16, 2011 at 6:27 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> On 17 Listopad 2011, 2:57, Scott Marlowe wrote: >>>> On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>>> >>>>> But you're right - you're not bound by I/O (although I don't know what >>>>> are >>>>> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually >>>>> keep all the distinct values to determine which are actually distinct. >>>> >>>> Actually I meant to comment on this, he is IO bound. Look at % Util, >>>> it's at 99 or 100. >>>> >>>> Also, if you have 16 cores and look at something like vmstat you'll >>>> see 6% wait state. That 6% represents one CPU core waiting for IO, >>>> the other cores will add up the rest to 100%. >>> >>> Aaaah, I keep forgetting about this and I somehow ignored the iostat >>> results too. Yes, he's obviously IO bound. >> >> I'm not so sure on the io-bound. Been battling/reading about it all >> day. 1 CPU is pegged at 100%, but the disk is not. If I do something > > Look here in iostat: > >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 0.00 3.50 3060.00 2.00 49224.00 20.00 >> 16.08 2.21 0.76 0.33 99.95 > > See that last column, it's % utilization. Once it hits 100% you are > anywhere from pretty close to IO bound to right on past it. > > I agree with the previous poster, you should roll these up ahead of > time into a materialized view for fast reporting. A followup. A good tool to see how your machine is running over time is the sar command and the needed sysstat service running and collecting data. You can get summary views of the last x weeks rolled up in 5 minute increments on all kinds of system metrics.
On Wed, Nov 16, 2011 at 7:02 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Wed, Nov 16, 2011 at 7:42 PM, Tory M Blue <tmblue@gmail.com> wrote: >> On Wed, Nov 16, 2011 at 6:27 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> On 17 Listopad 2011, 2:57, Scott Marlowe wrote: >>>> On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>>> >>>>> But you're right - you're not bound by I/O (although I don't know what >>>>> are >>>>> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually >>>>> keep all the distinct values to determine which are actually distinct. >>>> >>>> Actually I meant to comment on this, he is IO bound. Look at % Util, >>>> it's at 99 or 100. >>>> >>>> Also, if you have 16 cores and look at something like vmstat you'll >>>> see 6% wait state. That 6% represents one CPU core waiting for IO, >>>> the other cores will add up the rest to 100%. >>> >>> Aaaah, I keep forgetting about this and I somehow ignored the iostat >>> results too. Yes, he's obviously IO bound. >> >> I'm not so sure on the io-bound. Been battling/reading about it all >> day. 1 CPU is pegged at 100%, but the disk is not. If I do something > > Look here in iostat: > >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 0.00 3.50 3060.00 2.00 49224.00 20.00 >> 16.08 2.21 0.76 0.33 99.95 > > See that last column, it's % utilization. Once it hits 100% you are > anywhere from pretty close to IO bound to right on past it. > > I agree with the previous poster, you should roll these up ahead of > time into a materialized view for fast reporting. > Ya I'm getting mixed opinions on that. avg queue size is nothing and await and svctime is nothing, so maybe I'm on the edge, but it's not "at face value", the cause of the slow query times. I think the data structure is, however as it seems I need to query against all the data, I'm unclear how to best set that up. Partitioning is not the answer it seems.
On 17 Listopad 2011, 4:16, Tory M Blue wrote: > On Wed, Nov 16, 2011 at 7:02 PM, Scott Marlowe <scott.marlowe@gmail.com> > wrote: >> On Wed, Nov 16, 2011 at 7:42 PM, Tory M Blue <tmblue@gmail.com> wrote: >>> On Wed, Nov 16, 2011 at 6:27 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>>> On 17 Listopad 2011, 2:57, Scott Marlowe wrote: >>>>> On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>>>> >>>>>> But you're right - you're not bound by I/O (although I don't know >>>>>> what >>>>>> are >>>>>> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to >>>>>> actually >>>>>> keep all the distinct values to determine which are actually >>>>>> distinct. >>>>> >>>>> Actually I meant to comment on this, he is IO bound. Look at % Util, >>>>> it's at 99 or 100. >>>>> >>>>> Also, if you have 16 cores and look at something like vmstat you'll >>>>> see 6% wait state. That 6% represents one CPU core waiting for IO, >>>>> the other cores will add up the rest to 100%. >>>> >>>> Aaaah, I keep forgetting about this and I somehow ignored the iostat >>>> results too. Yes, he's obviously IO bound. >>> >>> I'm not so sure on the io-bound. Been battling/reading about it all >>> day. 1 CPU is pegged at 100%, but the disk is not. If I do something >> >> Look here in iostat: >> >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 3.50 3060.00 2.00 49224.00 20.00 >>> 16.08 2.21 0.76 0.33 99.95 >> >> See that last column, it's % utilization. Once it hits 100% you are >> anywhere from pretty close to IO bound to right on past it. >> >> I agree with the previous poster, you should roll these up ahead of >> time into a materialized view for fast reporting. >> > Ya I'm getting mixed opinions on that. avg queue size is nothing and > await and svctime is nothing, so maybe I'm on the edge, but it's not What do you mean by "nothing"? There are 3060 reads/s, servicing each one takes 0.33 ms - that means the drive is 100% utilized. The problem with the iostat results you've posted earlier is that they either use "-xd" or none of those switches. That means you can's see CPU stats and extended I/O stats at the same time - use just "-x" next time. Anyway the results show that "%iowait" is about 6% - as Scott Marlowe pointed out, this means 1 core is waiting for I/O. That's the core running your query. Try to execute the query 16x and you'll see the iowait is 100%. > "at face value", the cause of the slow query times. I think the data > structure is, however as it seems I need to query against all the > data, I'm unclear how to best set that up. Partitioning is not the > answer it seems. I'm not sure I understand what you mean by accessing all the data. You can do that with partitioning too, although the execution plan may not be as efficient as with a plain table. Try to partition the data by date (a partition for each day / week) - my impression is that you're querying data by date so this is a "natural" partitioning. Anyway what I've recommended in my previous post was intelligent reduction of the data - imagine for example there are 1000 unique visitors and each of them does 1000 actions per day. That means 1.000.000 of rows. What you can do is aggregating the data by user (at the end of the day, thus processing just the single day), i.e. something like this SELECT uid, count(*) FROM users WHERE log_date ... GROUP BY uid and storing this in a table "users_aggregated". This table has just 1000 rows (one for each user), so it's 1000x smaller. But you can do this SELECT COUNT(DISTINCT uid) FROM users_aggregated and you'll get exactly the correct result. Tomas
Tory, A seq scan across 83GB in 4 minutes is pretty good. That's over 300MB/s. Even if you assume that 1/3 of the table was already cached, that's still over 240mb/s. Good disk array. Either you need an index, or you need to not do this query at user request time. Or a LOT more RAM. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Nov 16, 2011 at 7:47 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 17 Listopad 2011, 4:16, Tory M Blue wrote: >> On Wed, Nov 16, 2011 at 7:02 PM, Scott Marlowe <scott.marlowe@gmail.com> >> wrote: >>> On Wed, Nov 16, 2011 at 7:42 PM, Tory M Blue <tmblue@gmail.com> wrote: >>>> On Wed, Nov 16, 2011 at 6:27 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>>>> On 17 Listopad 2011, 2:57, Scott Marlowe wrote: >>>>>> On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>>>>> >>>>>>> But you're right - you're not bound by I/O (although I don't know >>>>>>> what >>>>>>> are >>>>>>> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to >>>>>>> actually >>>>>>> keep all the distinct values to determine which are actually >>>>>>> distinct. >>>>>> >>>>>> Actually I meant to comment on this, he is IO bound. Look at % Util, >>>>>> it's at 99 or 100. >>>>>> >>>>>> Also, if you have 16 cores and look at something like vmstat you'll >>>>>> see 6% wait state. That 6% represents one CPU core waiting for IO, >>>>>> the other cores will add up the rest to 100%. >>>>> >>>>> Aaaah, I keep forgetting about this and I somehow ignored the iostat >>>>> results too. Yes, he's obviously IO bound. >>>> >>>> I'm not so sure on the io-bound. Been battling/reading about it all >>>> day. 1 CPU is pegged at 100%, but the disk is not. If I do something >>> >>> Look here in iostat: >>> >>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>> avgrq-sz avgqu-sz await svctm %util >>>> sda 0.00 3.50 3060.00 2.00 49224.00 20.00 >>>> 16.08 2.21 0.76 0.33 99.95 >>> >>> See that last column, it's % utilization. Once it hits 100% you are >>> anywhere from pretty close to IO bound to right on past it. >>> >>> I agree with the previous poster, you should roll these up ahead of >>> time into a materialized view for fast reporting. >>> >> Ya I'm getting mixed opinions on that. avg queue size is nothing and >> await and svctime is nothing, so maybe I'm on the edge, but it's not > > What do you mean by "nothing"? There are 3060 reads/s, servicing each one > takes 0.33 ms - that means the drive is 100% utilized. > > The problem with the iostat results you've posted earlier is that they > either use "-xd" or none of those switches. That means you can's see CPU > stats and extended I/O stats at the same time - use just "-x" next time. > > Anyway the results show that "%iowait" is about 6% - as Scott Marlowe > pointed out, this means 1 core is waiting for I/O. That's the core running > your query. Try to execute the query 16x and you'll see the iowait is > 100%. Yes this I understand and is correct. But I'm wrestling with the idea that the Disk is completely saturated. I've seen where I actually run into high IO/Wait and see that load climbs as processes stack. I'm not arguing (please know this), I appreciate the help and will try almost anything that is offered here, but I think if I just threw money at the situation (hardware), I wouldn't get any closer to resolution of my issue. I am very interested in other solutions and more DB structure changes etc. Thanks ! Tory
On Wed, Nov 16, 2011 at 9:19 PM, Josh Berkus <josh@agliodbs.com> wrote: > Tory, > > A seq scan across 83GB in 4 minutes is pretty good. That's over > 300MB/s. Even if you assume that 1/3 of the table was already cached, > that's still over 240mb/s. Good disk array. > > Either you need an index, or you need to not do this query at user > request time. Or a LOT more RAM. Thanks josh, That's also the other scenario, what is expected, maybe the 4 minutes which turns into 5.5 hours or 23 hours for a report is just standard based on our data and sizing. Then it's about stopping the chase and start looking at tuning or redesign if possible to allow for reports to finish in a timely fashion. The data is going to grow a tad still, but reporting requirements are on the rise. You folks are the right place to seek answers from, I just need to make sure I'm giving you the information that will allow you to assist/help me. Memory is not expensive these days, so it's possible that i bump the server to the 192gb or whatever to give me the headroom, but we are trying to dig a tad deeper into the data/queries/tuning before I go the hardware route again. Tory
On Thu, Nov 17, 2011 at 12:23 AM, Tory M Blue <tmblue@gmail.com> wrote: >> What do you mean by "nothing"? There are 3060 reads/s, servicing each one >> takes 0.33 ms - that means the drive is 100% utilized. >> >> The problem with the iostat results you've posted earlier is that they >> either use "-xd" or none of those switches. That means you can's see CPU >> stats and extended I/O stats at the same time - use just "-x" next time. >> >> Anyway the results show that "%iowait" is about 6% - as Scott Marlowe >> pointed out, this means 1 core is waiting for I/O. That's the core running >> your query. Try to execute the query 16x and you'll see the iowait is >> 100%. > > Yes this I understand and is correct. But I'm wrestling with the idea > that the Disk is completely saturated. I've seen where I actually run > into high IO/Wait and see that load climbs as processes stack. > > I'm not arguing (please know this), I appreciate the help and will try > almost anything that is offered here, but I think if I just threw > money at the situation (hardware), I wouldn't get any closer to > resolution of my issue. I am very interested in other solutions and > more DB structure changes etc. But remember, you're doing all that in a single query. So your disk subsystem might even be able to perform even more *througput* if it was given many more concurrent request. A big raid10 is really good at handling multiple concurrent requests. But it's pretty much impossible to saturate a big raid array with only a single read stream. With a single query, the query can only run as fast as the single stream of requests can be satisfied. And as the next read is issued as soon as the previous is done (the kernel readahead/buffering the seq scan helps here), your iostat is going to show 100% util, because the there is always the next read "in progress", even if the average queue size is low (1). If you had a 24 spindle array, you could add another 20 queries, and you could see the queue size go up, but the util would still only be 100%, latency would stay about the same, even though your throughput could be 20 times greater. So, as long as you have a single query scanning that entire 83GB table, and that table has to come from disk (i.e. not cached kernel buffers in ram), you're going to be limited by the amount of time it takes to read that table in 8K chunks. Options for improving it are: 1) Making sure your array/controller/kernel are doing the maximum read-ahead/buffering possible to make reading that 83GB as quick as possible 2) Changing the query to not need to scan all 83GB. #2 is where you're going to see orders-of-magnitude differences in performance, and there are lots of options there. But because there are so many options, and so many variables in what type of other queries, inserts, updates, and deletes are done on the data, no one of them is necessarily "the best" for everyone. But if you have the ability to alter queries/schema slightly, you've got lots of avenues to explore ;-) And folks here are more than willing to offer advice and options that may be *very* fruitful. 1) Multicolumn index (depending on insert/update/delete patterns) 2) partition by date (depending on query types) 3) rollup views of history (depending on query types) 4) trigger based mat-view style rollups (depending on insert/update/delete patterns coupled with query types) a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On 17 Listopad 2011, 15:17, Aidan Van Dyk wrote: > With a single query, the query can only run as fast as the single > stream of requests can be satisfied. And as the next read is issued > as soon as the previous is done (the kernel readahead/buffering the > seq scan helps here), your iostat is going to show 100% util, because > the there is always the next read "in progress", even if the average > queue size is low (1). If you had a 24 spindle array, you could add > another 20 queries, and you could see the queue size go up, but the > util would still only be 100%, latency would stay about the same, even > though your throughput could be 20 times greater. This is probably the reason why interpreting iostat results is tricky. It's quite straightforward with a single drive, but once you get to arrays it's much more difficult. The %util remains 100% but it may actually mean the I/O is not saturated because some of the drives just sit there doing nothing. hat's why it's important to look at await and svctime - when "await >> svctime" it's another sign of saturation. Tory, you've mentioned you're on a 8-drive RAID5, but all the iostat results you've posted are about "sda". I'm kinda used "sda" is a regular drive, not an array - are you sure it's the right device? Are you using a controller or a sw-raid? With a sw-based RAID you can easily see stats for each of the drives (so it's easier to see what's going on in the array). > So, as long as you have a single query scanning that entire 83GB > table, and that table has to come from disk (i.e. not cached kernel > buffers in ram), you're going to be limited by the amount of time it > takes to read that table in 8K chunks. I don't think he's doing that - the explain plan he posted earlier showed a bitmap index scan, and as the table is a "log" of actions (just growing and ordered by log_time) this is pretty-much the best available plan. So it reads only the interesting part of the table (not the whole 84GB) in a sequential way. > Options for improving it are: > > 1) Multicolumn index (depending on insert/update/delete patterns) I don't think this is going to help him (unless the query is much more complicated than he presented here). This might be interesting with index-only scans and index on (log_time, uid) but that's in 9.2dev. > 2) partition by date (depending on query types) No, this is not going to help him much as he's already scanning only the interesting part of the table (thanks to the bitmap index scan). It might eliminate the first step (reading the index and preparing the bitmap), but that's not the dominant part of the execution time (it takes about 8s and the whole query takes 127s). > 3) rollup views of history (depending on query types) This is probably the most promising path - prebuild some intermediate results for a day, aggregate just the intermediate results later. I've already described some ideas in my previous posts. > 4) trigger based mat-view style rollups (depending on > insert/update/delete patterns coupled with query types) Not sure if this can work with 'count(distinct)' - you'd have to keep all the distinct values (leading to 3) or maintain the counters for every interval you're interested in (day, hour, ...). Anyway this is going to be much more complicated than (3), I wouldn't use it unless I really want continuously updated stats. Tomas
On Thu, Nov 17, 2011 at 11:17 AM, Aidan Van Dyk <aidan@highrise.ca> wrote: > But remember, you're doing all that in a single query. So your disk > subsystem might even be able to perform even more *througput* if it > was given many more concurrent request. A big raid10 is really good > at handling multiple concurrent requests. But it's pretty much > impossible to saturate a big raid array with only a single read > stream. The query uses a bitmap heap scan, which means it would benefit from a high effective_io_concurrency. What's your effective_io_concurrency setting? A good place to start setting it is the number of spindles on your array, though I usually use 1.5x that number since it gives me a little more thoughput. You can set it on a query-by-query basis too, so you don't need to change the configuration. If you do, a reload is enough to make PG pick it up, so it's an easy thing to try.