Обсуждение: Performance question 83 GB Table 150 million rows, distinct select

Поиск
Список
Период
Сортировка

Performance question 83 GB Table 150 million rows, distinct select

От
Tory M Blue
Дата:
Linux F12 64bit
Postgres 8.4.4
16 proc / 32GB
8 disk 15KRPM SAS/Raid 5 (I know!)


shared_buffers = 6000MB
#temp_buffers = 8MB
max_prepared_transactions = 0
work_mem = 250MB
maintenance_work_mem = 1000MB





We now have about 180mill records in that table. The database size is
about 580GB and the userstats table which is the biggest one and the
one we query the most is 83GB.

Just a basic query takes 4 minutes:

For e.g. select count(distinct uid) from userstats where log_date >'11/7/2011'

Since we are looking for distinct we can't obviously use an index. But
I'm wondering what should be expected and what is caused be tuning or
lack there of? Doing an iostat  I see maybe 10-15%, however the cpu
that this query is attached to is obviously in the 99-100% busy arena.
Or am I really IOBound for this single query (sure lots of data
but?!).

It takes roughly 5.5 hours to do a concurrent re-index and this DB is
vac'd nightly.

Just not sure if this is what to expect, however there are many other
DB's out there bigger than ours, so I'm curious what can I do?

Thanks
Tory

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.41    0.00    0.20    1.61    0.00   96.78

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              92.17      3343.06      1718.85 46273281004 23791660544

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.47    0.00    0.61    5.85    0.00   92.07

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             417.50     90372.00         0.00     180744          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.88    0.00    0.76    6.34    0.00   90.03

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             725.00    183560.00       148.00     367120        296

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.18    0.00    0.60    3.59    0.00   93.63

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             711.00    179952.00       240.00     359904        480

[blue@adb01 ~]$ iostat -xd 2
Linux 2.6.32.26-175.fc12.x86_64 (adb01)     11/16/2011     _x86_64_    (16 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.18   191.40   68.71   23.45  3343.22  1718.85
54.92     0.12    4.61   2.05  18.94

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               2.00     0.00  706.50    8.00 178832.00   128.00
250.47    77.76   31.21   1.40 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               4.98    17.41  584.58   35.32 148497.51   672.64
240.64    38.04  227.07   1.61  99.55

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               3.50     0.00  688.50    2.00 174556.00    32.00
252.84     2.81    4.66   1.44  99.30

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               1.00    10.00  717.50    1.50 182084.00    92.00
253.37     2.43    3.37   1.38  99.45

^C
[blue@]$ iostat 2
Linux 2.6.32.26-175.fc12.x86_64 (adb01)     11/16/2011     _x86_64_    (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.41    0.00    0.20    1.61    0.00   96.78

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              92.17      3343.33      1718.85 46277115652 23791678248

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.79    0.00    0.51    8.51    0.00   83.20

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             235.50     45168.00         0.00      90336          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.90    0.00    0.35    4.46    0.00   89.29

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             160.00     14688.00       132.00      29376        264

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.01    0.00    0.51   12.80    0.00   78.67

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             163.50     11324.00       700.00      22648       1400

Re: Performance question 83 GB Table 150 million rows, distinct select

От
Alan Hodgson
Дата:
On November 16, 2011 02:53:17 PM Tory M Blue wrote:
> We now have about 180mill records in that table. The database size is
> about 580GB and the userstats table which is the biggest one and the
> one we query the most is 83GB.
>
> Just a basic query takes 4 minutes:
>
> For e.g. select count(distinct uid) from userstats where log_date
> >'11/7/2011'
>
> Just not sure if this is what to expect, however there are many other
> DB's out there bigger than ours, so I'm curious what can I do?

That query should use an index on log_date if one exists. Unless the planner
thinks it would need to look at too much of the table.

Also, the normal approach to making large statistics tables more manageable is
to partition them by date range.

Re: Performance question 83 GB Table 150 million rows, distinct select

От
Scott Marlowe
Дата:
On Wed, Nov 16, 2011 at 4:27 PM, Alan Hodgson <ahodgson@simkin.ca> wrote:
> On November 16, 2011 02:53:17 PM Tory M Blue wrote:
>> We now have about 180mill records in that table. The database size is
>> about 580GB and the userstats table which is the biggest one and the
>> one we query the most is 83GB.
>>
>> Just a basic query takes 4 minutes:
>>
>> For e.g. select count(distinct uid) from userstats where log_date
>> >'11/7/2011'
>>
>> Just not sure if this is what to expect, however there are many other
>> DB's out there bigger than ours, so I'm curious what can I do?
>
> That query should use an index on log_date if one exists. Unless the planner
> thinks it would need to look at too much of the table.

Agreed.  We'd need to know how selective that where clause is.  Seeing
some forced index usage versus regular explain analyze would be
useful.  i.e.

set enable_seqscan=off;
explain analyze select ...

> Also, the normal approach to making large statistics tables more manageable is
> to partition them by date range.

If the OP's considering partitioning, they should really consider
upgrading to 9.1 which has much better performance of things like
aggregates against partition tables.

Re: Performance question 83 GB Table 150 million rows, distinct select

От
Samuel Gendler
Дата:


On Wed, Nov 16, 2011 at 3:32 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

If the OP's considering partitioning, they should really consider
upgrading to 9.1 which has much better performance of things like
aggregates against partition tables.


Could you elaborate on this a bit, or point me at some docs?  I manage a 600GB db which does almost nothing but aggregates on partitioned tables - the largest of which has approx 600 million rows across all partitions.  grouping in the aggregates tends to be on the partition column and rarely, if ever, would a group cross multiple partitions. We're on 9.0 and could definitely use some performance gains.


Re: Performance question 83 GB Table 150 million rows, distinct select

От
"Tomas Vondra"
Дата:
Hi.

On 16 Listopad 2011, 23:53, Tory M Blue wrote:
>
> We now have about 180mill records in that table. The database size is
> about 580GB and the userstats table which is the biggest one and the
> one we query the most is 83GB.
>
> Just a basic query takes 4 minutes:
>
> For e.g. select count(distinct uid) from userstats where log_date
> >'11/7/2011'
>
> Since we are looking for distinct we can't obviously use an index. But
> I'm wondering what should be expected and what is caused be tuning or
> lack there of? Doing an iostat  I see maybe 10-15%, however the cpu
> that this query is attached to is obviously in the 99-100% busy arena.
> Or am I really IOBound for this single query (sure lots of data
> but?!).

What do you mean by "can't use an index"? The query may use an index to
evaluate the WHERE condition, no matter if there's a distinct or not.

The index-only scans that might be used to speed up this query are
committed in 9.2 - but even that might use index both for plain count and
count distinct.

But you're right - you're not bound by I/O (although I don't know what are
those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually
keep all the distinct values to determine which are actually distinct.

If there's enough memory (work_mem) to keep all the values, this may be
done using a hash table (hash aggregate). Otherwise it has to sort them.
You can see this in explain plan (which you haven't posted).

Anyway this is actually a rather CPU intensive - how exactly depends on
the data type. Comparing integers is much easier / cheaper than comparing
text values. What data type is the 'uid' column?

> It takes roughly 5.5 hours to do a concurrent re-index and this DB is
> vac'd nightly.
>
> Just not sure if this is what to expect, however there are many other
> DB's out there bigger than ours, so I'm curious what can I do?

Well, not much. Use an integer data type for the 'uid' column (unless
you're already using it). Then you can use more work_mem so that a hash
aggregate is used (maybe it's already used, we need to see the explain
plan to check).

Then you could precompute the distinct values somehow - for example if
there are only a few distinct values for each day, you could do something
like this every day

INSERT INTO userstats_distinct
SELECT DISTINCT date_trunc('day', log_date), uid FROM userstats
 WHERE log_date BETWEEN date_trunc('day', log_date) - interval '1 day'
                    AND date_trunc('day', log_date);

and then just

SELECT COUNT(DISTINCT uid) FROM userstats_distinct
 WHERE log_date > '11/7/2011';


The point is to preaggregate the data to the desired granularity (e.g.
day), and how it improves the performance depends on how much the amount
of data decreases.

Another option is to use estimates instead of exact results - I've
actually written an extension for that, maybe you'll find that useful.
It's available on github (https://github.com/tvondra/distinct_estimators)
and pgxn (http://pgxn.org/tag/estimate/). I've posted a brief description
here:

http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/

and the current extensions actually performs much better. It's not that
difficult to reach 1% precision. Let me know if this is interesting for
you and if you need a help with the extensions.

Tomas


Re: Performance question 83 GB Table 150 million rows, distinct select

От
Tory M Blue
Дата:
Thanks all,  I misspoke on our use of the index.

We do have an index on log_date and it is being used here is the
explain analyze plan.



'Aggregate  (cost=7266186.16..7266186.17 rows=1 width=8) (actual
time=127575.030..127575.030 rows=1 loops=1)'
'  ->  Bitmap Heap Scan on userstats  (cost=135183.17..7240890.38
rows=10118312 width=8) (actual time=8986.425..74815.790 rows=33084417
loops=1)'
'        Recheck Cond: (log_date > '2011-11-04'::date)'
'        ->  Bitmap Index Scan on idx_userstats_logdate
(cost=0.00..132653.59 rows=10118312 width=0) (actual
time=8404.147..8404.147 rows=33084417 loops=1)'
'              Index Cond: (log_date > '2011-11-04'::date)'
'Total runtime: 127583.898 ms'

Partitioning Tables

This is use primarily when you are usually accessing only a part of
the data. We want our queries to go across the entire date range.  So
we don't really meet the criteria for partitioning (had to do some
quick research).

Thanks again
Tory

Re: Performance question 83 GB Table 150 million rows, distinct select

От
Scott Marlowe
Дата:
On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

> But you're right - you're not bound by I/O (although I don't know what are
> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually
> keep all the distinct values to determine which are actually distinct.

Actually I meant to comment on this, he is IO bound.  Look at % Util,
it's at 99 or 100.

Also, if you have 16 cores and look at something like vmstat you'll
see 6% wait state.  That 6% represents one CPU core waiting for IO,
the other cores will add up the rest to 100%.

Re: Performance question 83 GB Table 150 million rows, distinct select

От
Scott Marlowe
Дата:
On Wed, Nov 16, 2011 at 4:52 PM, Samuel Gendler
<sgendler@ideasculptor.com> wrote:
> Could you elaborate on this a bit, or point me at some docs?  I manage a
> 600GB db which does almost nothing but aggregates on partitioned tables -
> the largest of which has approx 600 million rows across all partitions.
>  grouping in the aggregates tends to be on the partition column and rarely,
> if ever, would a group cross multiple partitions. We're on 9.0 and could
> definitely use some performance gains.

It's covered in the release notes for 9.1:

http://developer.postgresql.org/pgdocs/postgres/release-9-1.html

Re: Performance question 83 GB Table 150 million rows, distinct select

От
Andy Colson
Дата:
On 11/16/2011 04:53 PM, Tory M Blue wrote:
> Linux F12 64bit
> Postgres 8.4.4
> 16 proc / 32GB
> 8 disk 15KRPM SAS/Raid 5 (I know!)
>
>
> shared_buffers = 6000MB
> #temp_buffers = 8MB
> max_prepared_transactions = 0
> work_mem = 250MB
> maintenance_work_mem = 1000MB
>
>
>
>
>
> We now have about 180mill records in that table. The database size is
> about 580GB and the userstats table which is the biggest one and the
> one we query the most is 83GB.
>
> Just a basic query takes 4 minutes:
>
> For e.g. select count(distinct uid) from userstats where log_date>'11/7/2011'
>

How'd you feel about keeping a monthly summary table?  Update it daily, with only a days worth of stats, then you could
querythe summary table much faster. 

That's what I do for my website stats.  I log details for a month, then summarize everything into a summary table, and
blowaway the details.  You wouldn't have to delete the details if you wanted them, just keeping the summary table
updatedwould be enough. 

-Andy

Re: Performance question 83 GB Table 150 million rows, distinct select

От
"Tomas Vondra"
Дата:
On 17 Listopad 2011, 2:57, Scott Marlowe wrote:
> On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>
>> But you're right - you're not bound by I/O (although I don't know what
>> are
>> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually
>> keep all the distinct values to determine which are actually distinct.
>
> Actually I meant to comment on this, he is IO bound.  Look at % Util,
> it's at 99 or 100.
>
> Also, if you have 16 cores and look at something like vmstat you'll
> see 6% wait state.  That 6% represents one CPU core waiting for IO,
> the other cores will add up the rest to 100%.

Aaaah, I keep forgetting about this and I somehow ignored the iostat
results too. Yes, he's obviously IO bound.

But this actually means the pre-aggregating the data (as I described in my
previous post) would probably help him even more (less data, less CPU).

Tomas


Re: Performance question 83 GB Table 150 million rows, distinct select

От
Tory M Blue
Дата:
On Wed, Nov 16, 2011 at 6:27 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 17 Listopad 2011, 2:57, Scott Marlowe wrote:
>> On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>
>>> But you're right - you're not bound by I/O (although I don't know what
>>> are
>>> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually
>>> keep all the distinct values to determine which are actually distinct.
>>
>> Actually I meant to comment on this, he is IO bound.  Look at % Util,
>> it's at 99 or 100.
>>
>> Also, if you have 16 cores and look at something like vmstat you'll
>> see 6% wait state.  That 6% represents one CPU core waiting for IO,
>> the other cores will add up the rest to 100%.
>
> Aaaah, I keep forgetting about this and I somehow ignored the iostat
> results too. Yes, he's obviously IO bound.

I'm not so sure on the io-bound. Been battling/reading about it all
day. 1 CPU is pegged at 100%, but the disk is not. If I do something
else via another CPU I have no issues accessing the disks,
writing/deleting/reading. It appears that what was said about this
being very CPU intensive makes more sense to me. The query is only
using 1 CPU and that appears to be getting overwhelmed.

%util: This number depicts the percentage of time that the device
spent in servicing requests.

On a large query, or something that is taking a while it's going to be
writing to disk all the time and I'm thinking that is what the util is
telling me, especially since IOwait is in the 10-15% range.

Again just trying to absorb

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.93    0.00    0.60    9.84    0.00   88.62

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    86.50 3453.00    1.50 55352.00    16.00
16.03     5.24    0.66   0.29 100.00

I mean await time and service time are in the .29 to .66 msec that
doesn't read as IObound to me. But I'm more than willing to learn
something not totally postgres specific.

But I just don't see it...  Average queue size of 2.21 to 6, that's
really not a ton of stuff "waiting"

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     3.50 3060.00    2.00 49224.00    20.00
16.08     2.21    0.76   0.33  99.95

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.80    0.00    0.51   11.01    0.00   87.68

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     5.00 3012.50    3.00 48200.00    92.00
16.01     2.11    0.74   0.33  99.95

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.93    0.00    0.60    9.84    0.00   88.62

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    86.50 3453.00    1.50 55352.00    16.00
16.03     5.24    0.66   0.29 100.00

Re: Performance question 83 GB Table 150 million rows, distinct select

От
Scott Marlowe
Дата:
On Wed, Nov 16, 2011 at 7:42 PM, Tory M Blue <tmblue@gmail.com> wrote:
> On Wed, Nov 16, 2011 at 6:27 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 17 Listopad 2011, 2:57, Scott Marlowe wrote:
>>> On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>>
>>>> But you're right - you're not bound by I/O (although I don't know what
>>>> are
>>>> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually
>>>> keep all the distinct values to determine which are actually distinct.
>>>
>>> Actually I meant to comment on this, he is IO bound.  Look at % Util,
>>> it's at 99 or 100.
>>>
>>> Also, if you have 16 cores and look at something like vmstat you'll
>>> see 6% wait state.  That 6% represents one CPU core waiting for IO,
>>> the other cores will add up the rest to 100%.
>>
>> Aaaah, I keep forgetting about this and I somehow ignored the iostat
>> results too. Yes, he's obviously IO bound.
>
> I'm not so sure on the io-bound. Been battling/reading about it all
> day. 1 CPU is pegged at 100%, but the disk is not. If I do something

Look here in iostat:

> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     3.50 3060.00    2.00 49224.00    20.00
> 16.08     2.21    0.76   0.33  99.95

See that last column, it's % utilization.   Once it hits 100% you are
anywhere from pretty close to IO bound to right on past it.

I agree with the previous poster, you should roll these up ahead of
time into a materialized view for fast reporting.

Re: Performance question 83 GB Table 150 million rows, distinct select

От
Scott Marlowe
Дата:
On Wed, Nov 16, 2011 at 8:02 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Wed, Nov 16, 2011 at 7:42 PM, Tory M Blue <tmblue@gmail.com> wrote:
>> On Wed, Nov 16, 2011 at 6:27 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> On 17 Listopad 2011, 2:57, Scott Marlowe wrote:
>>>> On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>>>
>>>>> But you're right - you're not bound by I/O (although I don't know what
>>>>> are
>>>>> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually
>>>>> keep all the distinct values to determine which are actually distinct.
>>>>
>>>> Actually I meant to comment on this, he is IO bound.  Look at % Util,
>>>> it's at 99 or 100.
>>>>
>>>> Also, if you have 16 cores and look at something like vmstat you'll
>>>> see 6% wait state.  That 6% represents one CPU core waiting for IO,
>>>> the other cores will add up the rest to 100%.
>>>
>>> Aaaah, I keep forgetting about this and I somehow ignored the iostat
>>> results too. Yes, he's obviously IO bound.
>>
>> I'm not so sure on the io-bound. Been battling/reading about it all
>> day. 1 CPU is pegged at 100%, but the disk is not. If I do something
>
> Look here in iostat:
>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda               0.00     3.50 3060.00    2.00 49224.00    20.00
>> 16.08     2.21    0.76   0.33  99.95
>
> See that last column, it's % utilization.   Once it hits 100% you are
> anywhere from pretty close to IO bound to right on past it.
>
> I agree with the previous poster, you should roll these up ahead of
> time into a materialized view for fast reporting.

A followup.  A good tool to see how your machine is running over time
is the sar command and the needed sysstat service running and
collecting data.  You can get summary views of the last x weeks rolled
up in 5 minute increments on all kinds of system metrics.

Re: Performance question 83 GB Table 150 million rows, distinct select

От
Tory M Blue
Дата:
On Wed, Nov 16, 2011 at 7:02 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Wed, Nov 16, 2011 at 7:42 PM, Tory M Blue <tmblue@gmail.com> wrote:
>> On Wed, Nov 16, 2011 at 6:27 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> On 17 Listopad 2011, 2:57, Scott Marlowe wrote:
>>>> On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>>>
>>>>> But you're right - you're not bound by I/O (although I don't know what
>>>>> are
>>>>> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually
>>>>> keep all the distinct values to determine which are actually distinct.
>>>>
>>>> Actually I meant to comment on this, he is IO bound.  Look at % Util,
>>>> it's at 99 or 100.
>>>>
>>>> Also, if you have 16 cores and look at something like vmstat you'll
>>>> see 6% wait state.  That 6% represents one CPU core waiting for IO,
>>>> the other cores will add up the rest to 100%.
>>>
>>> Aaaah, I keep forgetting about this and I somehow ignored the iostat
>>> results too. Yes, he's obviously IO bound.
>>
>> I'm not so sure on the io-bound. Been battling/reading about it all
>> day. 1 CPU is pegged at 100%, but the disk is not. If I do something
>
> Look here in iostat:
>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda               0.00     3.50 3060.00    2.00 49224.00    20.00
>> 16.08     2.21    0.76   0.33  99.95
>
> See that last column, it's % utilization.   Once it hits 100% you are
> anywhere from pretty close to IO bound to right on past it.
>
> I agree with the previous poster, you should roll these up ahead of
> time into a materialized view for fast reporting.
>
Ya I'm getting mixed opinions on that. avg queue size is nothing and
await and svctime is nothing, so maybe I'm on the edge, but it's not
"at face value", the cause of the slow query times. I think the data
structure is, however as it seems I need to query against all the
data, I'm unclear how to best set that up. Partitioning is not the
answer it seems.

Re: Performance question 83 GB Table 150 million rows, distinct select

От
"Tomas Vondra"
Дата:
On 17 Listopad 2011, 4:16, Tory M Blue wrote:
> On Wed, Nov 16, 2011 at 7:02 PM, Scott Marlowe <scott.marlowe@gmail.com>
> wrote:
>> On Wed, Nov 16, 2011 at 7:42 PM, Tory M Blue <tmblue@gmail.com> wrote:
>>> On Wed, Nov 16, 2011 at 6:27 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>>> On 17 Listopad 2011, 2:57, Scott Marlowe wrote:
>>>>> On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>>>>
>>>>>> But you're right - you're not bound by I/O (although I don't know
>>>>>> what
>>>>>> are
>>>>>> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to
>>>>>> actually
>>>>>> keep all the distinct values to determine which are actually
>>>>>> distinct.
>>>>>
>>>>> Actually I meant to comment on this, he is IO bound.  Look at % Util,
>>>>> it's at 99 or 100.
>>>>>
>>>>> Also, if you have 16 cores and look at something like vmstat you'll
>>>>> see 6% wait state.  That 6% represents one CPU core waiting for IO,
>>>>> the other cores will add up the rest to 100%.
>>>>
>>>> Aaaah, I keep forgetting about this and I somehow ignored the iostat
>>>> results too. Yes, he's obviously IO bound.
>>>
>>> I'm not so sure on the io-bound. Been battling/reading about it all
>>> day. 1 CPU is pegged at 100%, but the disk is not. If I do something
>>
>> Look here in iostat:
>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda               0.00     3.50 3060.00    2.00 49224.00    20.00
>>> 16.08     2.21    0.76   0.33  99.95
>>
>> See that last column, it's % utilization.   Once it hits 100% you are
>> anywhere from pretty close to IO bound to right on past it.
>>
>> I agree with the previous poster, you should roll these up ahead of
>> time into a materialized view for fast reporting.
>>
> Ya I'm getting mixed opinions on that. avg queue size is nothing and
> await and svctime is nothing, so maybe I'm on the edge, but it's not

What do you mean by "nothing"? There are 3060 reads/s, servicing each one
takes 0.33 ms - that means the drive is 100% utilized.

The problem with the iostat results you've posted earlier is that they
either use "-xd" or none of those switches. That means you can's see CPU
stats and extended I/O stats at the same time - use just "-x" next time.

Anyway the results show that "%iowait" is about 6% - as Scott Marlowe
pointed out, this means 1 core is waiting for I/O. That's the core running
your query. Try to execute the query 16x and you'll see the iowait is
100%.

> "at face value", the cause of the slow query times. I think the data
> structure is, however as it seems I need to query against all the
> data, I'm unclear how to best set that up. Partitioning is not the
> answer it seems.

I'm not sure I understand what you mean by accessing all the data. You can
do that with partitioning too, although the execution plan may not be as
efficient as with a plain table. Try to partition the data by date (a
partition for each day / week) - my impression is that you're querying
data by date so this is a "natural" partitioning.

Anyway what I've recommended in my previous post was intelligent reduction
of the data - imagine for example there are 1000 unique visitors and each
of them does 1000 actions per day. That means 1.000.000 of rows. What you
can do is aggregating the data by user (at the end of the day, thus
processing just the single day), i.e. something like this

SELECT uid, count(*) FROM users WHERE log_date ... GROUP BY uid

and storing this in a table "users_aggregated". This table has just 1000
rows (one for each user), so it's 1000x smaller.

But you can do this

SELECT COUNT(DISTINCT uid) FROM users_aggregated

and you'll get exactly the correct result.


Tomas


Re: Performance question 83 GB Table 150 million rows, distinct select

От
Josh Berkus
Дата:
Tory,

A seq scan across 83GB in 4 minutes is pretty good.   That's over
300MB/s.  Even if you assume that 1/3 of the table was already cached,
that's still over 240mb/s.  Good disk array.

Either you need an index, or you need to not do this query at user
request time.  Or a LOT more RAM.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Performance question 83 GB Table 150 million rows, distinct select

От
Tory M Blue
Дата:
On Wed, Nov 16, 2011 at 7:47 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 17 Listopad 2011, 4:16, Tory M Blue wrote:
>> On Wed, Nov 16, 2011 at 7:02 PM, Scott Marlowe <scott.marlowe@gmail.com>
>> wrote:
>>> On Wed, Nov 16, 2011 at 7:42 PM, Tory M Blue <tmblue@gmail.com> wrote:
>>>> On Wed, Nov 16, 2011 at 6:27 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>>>> On 17 Listopad 2011, 2:57, Scott Marlowe wrote:
>>>>>> On Wed, Nov 16, 2011 at 4:59 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>>>>>
>>>>>>> But you're right - you're not bound by I/O (although I don't know
>>>>>>> what
>>>>>>> are
>>>>>>> those 15% - iowait, util or what?). The COUNT(DISTINCT) has to
>>>>>>> actually
>>>>>>> keep all the distinct values to determine which are actually
>>>>>>> distinct.
>>>>>>
>>>>>> Actually I meant to comment on this, he is IO bound.  Look at % Util,
>>>>>> it's at 99 or 100.
>>>>>>
>>>>>> Also, if you have 16 cores and look at something like vmstat you'll
>>>>>> see 6% wait state.  That 6% represents one CPU core waiting for IO,
>>>>>> the other cores will add up the rest to 100%.
>>>>>
>>>>> Aaaah, I keep forgetting about this and I somehow ignored the iostat
>>>>> results too. Yes, he's obviously IO bound.
>>>>
>>>> I'm not so sure on the io-bound. Been battling/reading about it all
>>>> day. 1 CPU is pegged at 100%, but the disk is not. If I do something
>>>
>>> Look here in iostat:
>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>> sda               0.00     3.50 3060.00    2.00 49224.00    20.00
>>>> 16.08     2.21    0.76   0.33  99.95
>>>
>>> See that last column, it's % utilization.   Once it hits 100% you are
>>> anywhere from pretty close to IO bound to right on past it.
>>>
>>> I agree with the previous poster, you should roll these up ahead of
>>> time into a materialized view for fast reporting.
>>>
>> Ya I'm getting mixed opinions on that. avg queue size is nothing and
>> await and svctime is nothing, so maybe I'm on the edge, but it's not
>
> What do you mean by "nothing"? There are 3060 reads/s, servicing each one
> takes 0.33 ms - that means the drive is 100% utilized.
>
> The problem with the iostat results you've posted earlier is that they
> either use "-xd" or none of those switches. That means you can's see CPU
> stats and extended I/O stats at the same time - use just "-x" next time.
>
> Anyway the results show that "%iowait" is about 6% - as Scott Marlowe
> pointed out, this means 1 core is waiting for I/O. That's the core running
> your query. Try to execute the query 16x and you'll see the iowait is
> 100%.

Yes this I understand and is correct. But I'm wrestling with the idea
that the Disk is completely saturated. I've seen where I actually run
into high IO/Wait and see that load climbs as processes stack.

I'm not arguing (please know this), I appreciate the help and will try
almost anything that is offered here, but I think if I just threw
money at the situation (hardware), I wouldn't get any closer to
resolution of my issue. I am very interested in other solutions and
more DB structure changes etc.

Thanks !
Tory

Re: Performance question 83 GB Table 150 million rows, distinct select

От
Tory M Blue
Дата:
On Wed, Nov 16, 2011 at 9:19 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Tory,
>
> A seq scan across 83GB in 4 minutes is pretty good.   That's over
> 300MB/s.  Even if you assume that 1/3 of the table was already cached,
> that's still over 240mb/s.  Good disk array.
>
> Either you need an index, or you need to not do this query at user
> request time.  Or a LOT more RAM.

Thanks josh,

That's also the other scenario, what is expected, maybe the 4 minutes
which turns into 5.5 hours or 23 hours for a report is just standard
based on our data and sizing.

Then it's about stopping the chase and start looking at tuning or
redesign if possible to allow for reports to finish in a timely
fashion. The data is going to grow a tad still, but reporting
requirements are on the rise.

You folks are the right place to seek answers from, I just need to
make sure I'm giving you the information that will allow you to
assist/help me.

Memory is not expensive these days, so it's possible that i bump the
server to the 192gb or whatever to give me the headroom, but we are
trying to dig a tad deeper into the data/queries/tuning before I go
the hardware route again.

Tory

Re: Performance question 83 GB Table 150 million rows, distinct select

От
Aidan Van Dyk
Дата:
On Thu, Nov 17, 2011 at 12:23 AM, Tory M Blue <tmblue@gmail.com> wrote:

>> What do you mean by "nothing"? There are 3060 reads/s, servicing each one
>> takes 0.33 ms - that means the drive is 100% utilized.
>>
>> The problem with the iostat results you've posted earlier is that they
>> either use "-xd" or none of those switches. That means you can's see CPU
>> stats and extended I/O stats at the same time - use just "-x" next time.
>>
>> Anyway the results show that "%iowait" is about 6% - as Scott Marlowe
>> pointed out, this means 1 core is waiting for I/O. That's the core running
>> your query. Try to execute the query 16x and you'll see the iowait is
>> 100%.
>
> Yes this I understand and is correct. But I'm wrestling with the idea
> that the Disk is completely saturated. I've seen where I actually run
> into high IO/Wait and see that load climbs as processes stack.
>
> I'm not arguing (please know this), I appreciate the help and will try
> almost anything that is offered here, but I think if I just threw
> money at the situation (hardware), I wouldn't get any closer to
> resolution of my issue. I am very interested in other solutions and
> more DB structure changes etc.

But remember, you're doing all that in a single query.  So your disk
subsystem might even be able to perform even more *througput* if it
was given many more concurrent request.  A big raid10 is really good
at handling multiple concurrent requests.  But it's pretty much
impossible to saturate a big raid array with only a single read
stream.

With a single query, the query can only run as fast as the single
stream of requests can be satisfied.  And as the next read is issued
as soon as the previous is done (the kernel readahead/buffering the
seq scan helps here), your iostat is going to show 100% util, because
the there is always the next read "in progress", even if the average
queue size is low (1).  If you had a 24 spindle array, you could add
another 20 queries, and you could see the queue size go up, but the
util would still only be 100%, latency would stay about the same, even
though your throughput could be 20 times greater.

So, as long as you have a single query scanning that entire 83GB
table, and that table has to come from disk (i.e. not cached kernel
buffers in ram), you're going to be limited by the amount of time it
takes to read that table in 8K chunks.

Options for improving it are:

1) Making sure your array/controller/kernel are doing the maximum
read-ahead/buffering possible to make reading that 83GB as quick as
possible
2) Changing the query to not need to scan all 83GB.

#2 is where you're going to see orders-of-magnitude differences in
performance, and there are lots of options there.  But because there
are so many options, and so many variables in what type of other
queries, inserts, updates, and deletes are done on the data, no one of
them is necessarily "the best" for everyone.

But if you have the ability to alter queries/schema slightly, you've
got lots of avenues to explore ;-)  And folks here are more than
willing to offer advice and options that may be *very* fruitful.

1) Multicolumn index (depending on insert/update/delete patterns)
2) partition by date (depending on query types)
3) rollup views of history (depending on query types)
4) trigger based mat-view style rollups (depending on
insert/update/delete patterns coupled with query types)


a.
--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Performance question 83 GB Table 150 million rows, distinct select

От
"Tomas Vondra"
Дата:
On 17 Listopad 2011, 15:17, Aidan Van Dyk wrote:
> With a single query, the query can only run as fast as the single
> stream of requests can be satisfied.  And as the next read is issued
> as soon as the previous is done (the kernel readahead/buffering the
> seq scan helps here), your iostat is going to show 100% util, because
> the there is always the next read "in progress", even if the average
> queue size is low (1).  If you had a 24 spindle array, you could add
> another 20 queries, and you could see the queue size go up, but the
> util would still only be 100%, latency would stay about the same, even
> though your throughput could be 20 times greater.

This is probably the reason why interpreting iostat results is tricky.
It's quite straightforward with a single drive, but once you get to arrays
it's much more difficult. The %util remains 100% but it may actually mean
the I/O is not saturated because some of the drives just sit there doing
nothing. hat's why it's important to look at await and svctime - when
"await >> svctime" it's another sign of saturation.

Tory, you've mentioned you're on a 8-drive RAID5, but all the iostat
results you've posted are about "sda". I'm kinda used "sda" is a regular
drive, not an array - are you sure it's the right device? Are you using a
controller or a sw-raid? With a sw-based RAID you can easily see stats for
each of the drives (so it's easier to see what's going on in the array).

> So, as long as you have a single query scanning that entire 83GB
> table, and that table has to come from disk (i.e. not cached kernel
> buffers in ram), you're going to be limited by the amount of time it
> takes to read that table in 8K chunks.

I don't think he's doing that - the explain plan he posted earlier showed
a bitmap index scan, and as the table is a "log" of actions (just growing
and ordered by log_time) this is pretty-much the best available plan. So
it reads only the interesting part of the table (not the whole 84GB) in a
sequential way.

> Options for improving it are:
>
> 1) Multicolumn index (depending on insert/update/delete patterns)

I don't think this is going to help him (unless the query is much more
complicated than he presented here). This might be interesting with
index-only scans and index on (log_time, uid) but that's in 9.2dev.

> 2) partition by date (depending on query types)

No, this is not going to help him much as he's already scanning only the
interesting part of the table (thanks to the bitmap index scan). It might
eliminate the first step (reading the index and preparing the bitmap), but
that's not the dominant part of the execution time (it takes about 8s and
the whole query takes 127s).

> 3) rollup views of history (depending on query types)

This is probably the most promising path - prebuild some intermediate
results for a day, aggregate just the intermediate results later. I've
already described some ideas in my previous posts.

> 4) trigger based mat-view style rollups (depending on
> insert/update/delete patterns coupled with query types)

Not sure if this can work with 'count(distinct)' - you'd have to keep all
the distinct values (leading to 3) or maintain the counters for every
interval you're interested in (day, hour, ...). Anyway this is going to be
much more complicated than (3), I wouldn't use it unless I really want
continuously updated stats.

Tomas




Re: Performance question 83 GB Table 150 million rows, distinct select

От
Claudio Freire
Дата:
On Thu, Nov 17, 2011 at 11:17 AM, Aidan Van Dyk <aidan@highrise.ca> wrote:
> But remember, you're doing all that in a single query.  So your disk
> subsystem might even be able to perform even more *througput* if it
> was given many more concurrent request.  A big raid10 is really good
> at handling multiple concurrent requests.  But it's pretty much
> impossible to saturate a big raid array with only a single read
> stream.

The query uses a bitmap heap scan, which means it would benefit from a
high effective_io_concurrency.

What's your effective_io_concurrency setting?

A good place to start setting it is the number of spindles on your
array, though I usually use 1.5x that number since it gives me a
little more thoughput.

You can set it on a query-by-query basis too, so you don't need to
change the configuration. If you do, a reload is enough to make PG
pick it up, so it's an easy thing to try.