Re: Huge Data sets, simple queries

Поиск
Список
Период
Сортировка
От Craig A. James
Тема Re: Huge Data sets, simple queries
Дата
Msg-id 43DD85D2.5020207@modgraph-usa.com
обсуждение исходный текст
Ответ на Huge Data sets, simple queries  ("Mike Biamonte" <mike@dbeat.com>)
Список pgsql-performance
Mike Biamonte wrote:
> Does anyone have any experience with extremely large data sets?
> I'm mean hundreds of millions of rows.
>
> The queries I need to run on my 200 million transactions are relatively
> simple:
>
>    select month, count(distinct(cardnum)) count(*), sum(amount) from
> transactions group by month;

This may be heretical to post to a relational-database group, but sometimes a problem can be better solved OUTSIDE of
therelational system. 

I had a similar problem recently: I have a set of about 100,000 distinct values, each of which occurs one to several
milliontimes in the database, with an aggregate total of several hundred million occurances in the database. 

Sorting this into distinct lists ("Which rows contain this value?") proved quite time consuming (just like your case),
buton reflection, I realized that it was dumb to expect a general-purpose sorting algorithm to sort a list about which
Ihad specialized knowledge.  General-purpose sorting usually takes O(N*log(N)), but if you have a small number of
distinctvalues, you can use "bucket sorting" and sort in O(N) time, a huge improvement.  In my case, it was even more
specialized-- there was a very small number of the lists that contained thousands or millions of items, but about 95%
ofthe lists only had a few items. 

Armed with this knowledge, it took me couple weeks to write a highly-specialized sorting system that used a combination
ofPostgres, in-memory and disk caching, and algorithms dredged up from Knuth.  The final result ran in about four
hours.

The thing to remember about relational databases is that the designers are constrained by the need for generality,
reliabilityand SQL standards.  Given any particular well-defined task where you have specialized knowledge about the
data,and/or you don't care about transactional correctness, and/or you're not concerned about data loss, a good
programmercan always write a faster solution. 

Of course, there's a huge penalty.  You lose support, lose of generality, the application takes on complexity that
shouldbe in the database, and on and on.  A hand-crafted solution should be avoided unless there's simply no other way. 

A relational database is a tool.  Although powerful, like any tool it has limitations.  Use the tool where it's useful,
anduse other tools when necessary. 

Craig

В списке pgsql-performance по дате отправления:

Предыдущее
От: Michael Adler
Дата:
Сообщение: Re: Huge Data sets, simple queries
Следующее
От: Arnau Rebassa Villalonga
Дата:
Сообщение: Where is my bottleneck?