Fw: [GENERAL] Including unique users in huge data warehouse in Postgresql...

Поиск
Список
Период
Сортировка
От Mark Jensen
Тема Fw: [GENERAL] Including unique users in huge data warehouse in Postgresql...
Дата
Msg-id 20061129194339.47957.qmail@web82105.mail.mud.yahoo.com
обсуждение исходный текст
Ответы Re: Fw: [GENERAL] Including unique users in huge data
Список pgsql-performance
posting this here instead of the GENERAL list...richard is right, this is more of a performance question than a general
question.

thanks,

____________________________________
Mark Jensen

----- Forwarded Message ----
From: Mark Jensen <musicnyman1974@yahoo.com>
To: Richard Huxton <dev@archonet.com>
Cc: pgsql-general@postgresql.org
Sent: Wednesday, November 29, 2006 2:40:58 PM
Subject: Re: [GENERAL] Including unique users in huge data warehouse in Postgresql...

thanks Richard.  I've talking to Ron Mayer about this as well offline.  I think the main problem is dedupping users,
andnot being able to aggregate visits in the fact table.  that's where most of the query time takes place.  but the
businessguys just won't accept using visits, but not actual uniques dedupped.  if visits was in the fact table as an
integeri could sum up, i'd be fine.  Ron also said he's put the unique user ids into arrays so it's faster to count
them,but placing them into aggregate tables.  only problem is i'd still have to know what things to aggregate by to
createthese, which is impossible since we have so many dimensions and facts that are going to be ad-hoc.  i have about
20summary tables i create per day, but most of the time, they have something new they want to query that's not in
summary. and will never come up again. 

I tried installing Bizgres using their Bizgres loader and custom postgresql package with bitmap indexes, but doesn't
seemto increase performance "that" much.  or as much as i would like compared to the normal postgresql install.  loads
arepretty slow when using their bitmap indexes compared to just using btree indexes in the standard postgresql install.
Query time is pretty good, but i also have to make sure load times are acceptable as well.  and had some problems with
thebizgres loader losing connection to the database for no reason at all, but when using the normal copy command in
8.2RC1,works fine.  love the new query inclusion in the copy command by the way, makes it so easy to aggregrate hourly
facttables into daily/weekly/monthly in one shot :) 

and yes, work_mem is optimized as much as possible.  postgresql is using about 1.5 gigs of working memory when it runs
thesequeries.  looking into getting 64 bit hardware with 16-32 gigs of RAM so i can throw most of this into memory to
speedit up.  we're also using 3par storage which is pretty fast.  we're going to try and put postgresql on a local disk
arrayusing RAID 5 as well to see if it makes a difference. 

and yes, right now, these are daily aggregate tables summed up from the hourly.  so about 17 million rows per day.
hourlyfact tables are impossible to query right now, so i have to at least put these into daily fact tables.  so when
youhave 30 days in this database, then yes, table scans are going to be huge, thus why it's taking so long, plus
deduppingon unique user id :) 

and you're right, i should put this on the performance mailing list... see you there :)

thanks guys.

____________________________________
Mark Jensen

----- Original Message ----
From: Richard Huxton <dev@archonet.com>
To: Mark Jensen <musicnyman1974@yahoo.com>
Cc: pgsql-general@postgresql.org
Sent: Wednesday, November 29, 2006 2:29:35 PM
Subject: Re: [GENERAL] Including unique users in huge data warehouse in Postgresql...

Mark Jensen wrote:
> So i've been given the task of designing a data warehouse in
> either Postgresql or Mysql for our clickstream data for our sites.  I
> started with Mysql but the joins in Mysql are just way too slow
> compared to Postgresql when playing with star schemas.

Mark - it's not my usual area, but no-one else has picked up your
posting, so I'll poke my nose in. The other thing you might want to do
is post this on the performance list - that's probably the best place.
Might be worth talking to those at www.bizgres.org too (although I think
they all hang out on the performance list).

 > I can't say
> which sites i'm working on, but we get close to 3-5 million uniques
> users per day, so over time, that's a lot of unique users to keep
> around and de-dup your fact tables by.  Need to be able to query normal
> analytics like:
<snip>

> i've
> made a lot of optimizations in postgresql.conf by playing with work_mem
> and shared_buffers and such and i think the database is using as much
> as it can disk/memory/cpu wise.

Big work_mem, I'm guessing. Limiting factor is presumably disk I/O.

<snip>
> here's a sample query that takes a while to run... just a simple report that shows gender by area of the site.
>
> select A.gender as gender, B.area as area, sum(C.imps) as imps, sum(C.clicks) as clicks, count(distinct(C.uu_id)) as
users
> from uus as A, areas as B, daily_area_fact as C
> where A.uu_id = C.uu_id
> and B.area_id = C.area_id
> group by gender,area;
>
> so
> by just having one day of data, with 3,168,049 rows in the user
> dimension table (uus), 17,213,420 in the daily_area_fact table that
> joins all the dimension tables, takes about 15 minutes.  if i had 30-90
> days in this fact table, who knows how long this would take... i know
> doing a distinct on uu_id is very expensive, so that's the main problem
> here i guess and would want to know if anyone else is doing it this way
> or better.

In the end, I'd suspect the seq-scan over the fact table will be your
biggest problem. Can you pre-aggregate your fact-table into daily summaries?

See you over on the performance list, where there are more experienced
people than myself to help you.
--
   Richard Huxton
   Archonet Ltd






____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings






____________________________________________________________________________________
Yahoo! Music Unlimited
Access over 1 million songs.
http://music.yahoo.com/unlimited

В списке pgsql-performance по дате отправления:

Предыдущее
От: Ron Mayer
Дата:
Сообщение: Re: RES: Priority to a mission critical transaction
Следующее
От: "Luke Lonergan"
Дата:
Сообщение: Re: Fw: [GENERAL] Including unique users in huge data