Re: Optimizing data layout for reporting in postgres
От | Scott Marlowe |
---|---|
Тема | Re: Optimizing data layout for reporting in postgres |
Дата | |
Msg-id | dcc563d10912241453g55040de3sc370c8b2c50bc3a1@mail.gmail.com обсуждение исходный текст |
Ответ на | Optimizing data layout for reporting in postgres (Doug El <doug_el@yahoo.com>) |
Список | pgsql-general |
On Thu, Dec 24, 2009 at 1:01 PM, Doug El <doug_el@yahoo.com> wrote: > Hi, > > I have to summarize some data to be queried and there are about 5 million raw records a day I need to summarize. In a nutshellI don't think I'm laying it out in an optimal fashion, or not taking advantage of some postgres features perhaps,I'm looking for feedback. > > The raw incoming data is in the form of > > ip string uint uint uint uint > > So for any given record say: > > 8.8.8.8 helloworld 1 2 3 4 > > First, I need to be able to query how many total and how many unique requests there were (unique by ip), over given timeframe. > > So for the below data on the same day that's total two, but one unique > > 8.8.8.8 helloworld 1 2 3 4 > 8.8.8.8 helloworld 1 2 3 4 > > Further for all fields (but ip which is not stored) I need to be able to query and get total/unique counts based off anycombination of criteria. > > So if I refer to them as columns A-E > > A B C D E > string uint uint uint uint > > I need to be able and say how many where col A = 'helloworld' and say col C = 4. > Or perhaps col E = 4 and col c < 3 etc, any combination. > > The only way I could see to do this was to take the 5 million daily raw records, sort them, then summarize that list withtotal and unique counts as so: > > A B C D E F G H > date stringid uint uint uint uint total unique > > Primary key is A-F (date stringid uint uint uint uint) > > This gives me a summary of about 900k records a day from the 4 million raw. > > I have things organized with monthly tables and yearly schemas. The string column also has its own monthly lookup table,so there's just a string id that's looked up. > > The database however is still quite huge and grows very fast, even simple daily queries are fairly slow even on a fastserver. I have a few indexes on what I know are common columns queried against but again, any combination of data canbe queried, and indexes do increase db size of course. > > I feel like there's got to be some better way to organize this data and make it searchable. Overall speed is more importantthan disk space usage for this application. > > Perhaps there are some native features in postgres I'm not taking advantage of here, that would tip the scales in my favor.I've done a fair amount of research on the configuration file settings and feel like I have a fairly optimized configfor it as far as that goes, and have done the things mentioned here: http://wiki.postgresql.org/wiki/SlowQueryQuestions > > Very much appreciate any suggestions, thank you in advance. We run a nightly cron job that creates all the summary tables etc at midnight. On a fast machine it takes about 1 to 2 hours to run, but makes the queries run during the day go from 10 or 20 seconds to a few hundred milliseconds. You might want to look into table partitioning and also materialized views. There's a great tutorial on how to roll your own at: http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views
В списке pgsql-general по дате отправления: