Re: Auto-clustering?

Поиск
Список
Период
Сортировка
От Filip Rembiałkowski
Тема Re: Auto-clustering?
Дата
Msg-id AANLkTinN8bZSK-iXOZRsDd-N2=DGoyQ26Wc_nJRP2UYD@mail.gmail.com
обсуждение исходный текст
Ответ на Auto-clustering?  (Royce Ausburn <royce@inomial.com>)
Ответы Re: Auto-clustering?
Re: Auto-clustering?
Список pgsql-performance

2010/12/17 Royce Ausburn <royce@inomial.com>
Hi all,

I have a table that in the typical case holds two minute sample data for a few thousand sources.  Often we need to report on these data for a particular source over a particular time period and we're finding this query tends to get a bit slow.

The structure of the table:

                                     Table "public.sample"
      Column       |           Type           |                    Modifiers                    
-------------------+--------------------------+-------------------------------------------------
 client            | integer                  | not null
 aggregateid       | bigint                   | not null
 sample            | bigint                   | not null default nextval('samplekey'::regclass)
 customer          | integer                  | 
 period            | integer                  | not null
 starttime         | integer                  | not null
 duration          | integer                  | not null
 ip                | text                     | 
 tariff            | integer                  | 
 bytessentrate     | bigint                   | 
 bytessent         | bigint                   | 
 bytesreceived     | bigint                   | 
 packets           | integer                  | not null
 queuetype         | integer                  | not null default 0
 collection        | integer                  | 
 bytesreceivedrate | bigint                   | 
 greatestrate      | bigint                   | 
 invalidated       | timestamp with time zone | 
Indexes:
    "sample_pkey" PRIMARY KEY, btree (sample)
    "sample_collection_starttime_idx" btree (collection, starttime)
    "sample_customer_starttime_idx" btree (customer, starttime)
    "sample_sample_idx" btree (client, sample)
Foreign-key constraints:
    "sample_client_fkey" FOREIGN KEY (client) REFERENCES client(client)


fc=# explain  analyse select collection, period, tariff, sum(bytesSent), sum(bytesReceived), sum(packets), max(sample), (starttime / 3600) * 3600 as startchunk from sample_20101001 where starttime between 1287493200 and 1290171599  and collection=128    and ip = '10.9.125.207' group by startchunk, tariff, collection, period;                                                                             QUERY PLAN                                                                              
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=34959.01..34959.03 rows=1 width=44) (actual time=67047.850..67047.850 rows=0 loops=1)
   ->  Bitmap Heap Scan on sample_20101001  (cost=130.56..34958.91 rows=5 width=44) (actual time=67047.847..67047.847 rows=0 loops=1)
         Recheck Cond: ((collection = 128) AND (starttime >= 1287493200) AND (starttime <= 1290171599))
         Filter: (ip = '10.9.125.207'::text)
         ->  Bitmap Index Scan on sample_20101001_collection_starttime_idx  (cost=0.00..130.56 rows=9596 width=0) (actual time=9806.115..9806.115 rows=6830 loops=1)
               Index Cond: ((collection = 128) AND (starttime >= 1287493200) AND (starttime <= 1290171599))
 Total runtime: 67048.201 ms
(7 rows)


how about (auto)vacuuming?
 

I figure at most there should only be ~20,000 rows to be read from disk, and I expect that the index is doing a pretty good job of making sure only the rows that need reading are read. inclusion of the ip in the query is almost redundant as most of the time an ip has its own collection....  My suspicion is that the rows that we're interested in are very sparsely distributed on disk, so we're having to read too many pages for the query...


you can test this suspicion in very simple way:
- create test table (like yours including indexes including constraints, but with no data)
- insert into test select * from yours order by
- analyze test tablee available
- test the query on the new table

If new query is much faster, and if you have intensive random UPD/DEL/INS activity, periodic CLUSTER could be a good idea...
but it depends on actual usage patterns (SELECT/modify ratio, types of updates, and so on).

 

All of the queries on this table are reporting on a single collection, so ideally a collection's data would all be stored in the same part of the disk... or at least clumped together.  This can be achieved using "cluster", however as far as I know there's no automated, non-cronesque means of clustering and having the table become unusable during the cluster is not ideal.

cron is a way of automation, isn't it :-)

 
 

I've considered partitioning, but I don't think that's going to give the effect I need.  Apparently clustering is only going to scale to a few dozen child tables, so that's only going to give one order of magnitude performance for significant complexity.


 

regarding partitioning: I guess it starts to make sense around 10M rows or 10G Bytes in one table.

regarding clustering: it does not help with index bloat.

and finally, you did not specify what PostgreSQL version are you using.


cheers,
Filip

В списке pgsql-performance по дате отправления:

Предыдущее
От: Royce Ausburn
Дата:
Сообщение: Auto-clustering?
Следующее
От: Marti Raudsepp
Дата:
Сообщение: Re: Auto-clustering?