Looking for ideas on how to speed up warehouse loading

Поиск
Список
Период
Сортировка
От Sean Shanny
Тема Looking for ideas on how to speed up warehouse loading
Дата
Msg-id 40883FAB.3070109@earthlink.net
обсуждение исходный текст
Ответы Re: Looking for ideas on how to speed up warehouse loading  (Sean Shanny <shannyconsulting@earthlink.net>)
Re: Looking for ideas on how to speed up warehouse loading  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Looking for ideas on how to speed up warehouse loading  (Joe Conway <mail@joeconway.com>)
Re: Looking for ideas on how to speed up warehouse loading  (CoL <col@mportal.hu>)
Список pgsql-performance
To all,

Essentials:  Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB
memory, direct attached via fibre channel to a fully optioned 3.5TB
XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50.

Background:  We are loading what are essentially xml based access logs
from about 20+ webservers daily, about 6GB of raw data.  We have a
classic star schema.  All the ETL tools are custom java code or standard
*nix tools like sort, uniq etc...

The problem:  We have about 46 million rows in a table with the
following schema:

Table "public.d_referral"
       Column       |  Type   | Modifiers
--------------------+---------+-----------
 id                 | integer | not null
 referral_raw_url   | text    | not null
 job_control_number | integer | not null
Indexes:
    "d_referral_pkey" primary key, btree (id)
    "idx_referral_url" btree (referral_raw_url)

This is one of our dimension tables.  Part of the daily ETL process is
to match all the new referral URL's against existing data in the
d_referral table.  Some of the values in referral_raw_url can be 5000
characters long :-( .  The avg length is :  109.57 characters.

I sort and uniq all the incoming referrals and load them into a temp table.

Table "public.referral_temp"
 Column | Type | Modifiers
--------+------+-----------
 url    | text | not null
Indexes:
    "referral_temp_pkey" primary key, btree (url)

I then do a left join

SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral t1
ON t2.url = t1.referral_raw_url ORDER BY t1.id

This is the output from an explain analyze (Please note that I do a set
enable_index_scan = false prior to issuing this because it takes forever
using indexes.):

explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER
JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=4012064.81..4013194.45 rows=451856 width=115) (actual
time=1297320.823..1297739.813 rows=476176 loops=1)
   Sort Key: t1.id
   ->  Hash Left Join  (cost=1052345.95..3969623.10 rows=451856
width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1)
         Hash Cond: ("outer".url = "inner".referral_raw_url)
         ->  Seq Scan on referral_temp t2  (cost=0.00..6645.56
rows=451856 width=111) (actual time=20.285..1449.634 rows=476176 loops=1)
         ->  Hash  (cost=729338.16..729338.16 rows=46034716 width=124)
(actual time=1146440.710..1146440.710 rows=0 loops=1)
               ->  Seq Scan on d_referral t1  (cost=0.00..729338.16
rows=46034716 width=124) (actual time=14.502..-1064277.123 rows=46034715
loops=1)
 Total runtime: 1298153.193 ms
(8 rows)



What I would like to know is if there are better ways to do the join?  I
need to get all the rows back from the referral_temp table as they are
used for assigning FK's for the fact table later in processing.  When I
iterate over the values that I get back those with t1.id = null I assign
a new FK and push both into the d_referral table as new entries as well
as a text file for later use.  The matching records are written to a
text file for later use.

If we cannot improve the join performance my question becomes are there
better tools to match up the 46 million and growing at the rate of 1
million every 3 days, strings outside of postgresql?  We don't want to
have to invest in zillions of dollars worth of hardware but if we have
to we will.  I just want to make sure we have all the non hardware
possibilities for improvement covered before we start investing in large
disk arrays.

Thanks.

--sean

В списке pgsql-performance по дате отправления:

Предыдущее
От: Eduardo Almeida
Дата:
Сообщение: Re: [pgsql-advocacy] MySQL vs PG TPC-H benchmarks
Следующее
От: Manfred Koizar
Дата:
Сообщение: Re: Setting Shared Buffers , Effective Cache, Sort Mem Parameters