Обсуждение: Join runs for > 10 hours and then fills up >1.3TB of disk space

Поиск

Список

Период

Сортировка

Join runs for > 10 hours and then fills up >1.3TB of disk space

От

kevin kempter

Дата:

16 мая 2008 г., 06:43:44

Hi List;

I have a table with 9,961,914 rows in it (see the describe of
bigtab_stats_fact_tmp14 below)

I also have a table with 7,785 rows in it (see the describe of
xsegment_dim below)

I'm running the join shown below and it takes > 10 hours and
eventually runs out of disk space on a 1.4TB file system

I've included below a describe of both tables, the join and an explain
plan, any help / suggestions would be much appreciated !

I need to get this beast to run as quickly as possible (without
filling up my file system)


Thanks in advance...











select
f14.xpublisher_dim_id,
f14.xtime_dim_id,
f14.xlocation_dim_id,
f14.xreferrer_dim_id,
f14.xsite_dim_id,
f14.xsystem_cfg_dim_id,
f14.xaffiliate_dim_id,
f14.customer_id,
pf_dts_id,
episode_id,
sessionid,
bytes_received,
bytes_transmitted,
total_played_time_sec,
segdim.xsegment_dim_id as episode_level_segid
from
bigtab_stats_fact_tmp14 f14,
xsegment_dim segdim
where
f14.customer_id = segdim.customer_srcid
and f14.show_id = segdim.show_srcid
and f14.season_id = segdim.season_srcid
and f14.episode_id = segdim.episode_srcid
and segdim.segment_srcid is NULL;






QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join  (cost=1757001.74..73569676.49 rows=3191677219 width=118)
Merge Cond: ((segdim.episode_srcid = f14.episode_id) AND
(segdim.customer_srcid = f14.customer_id) AND (segdim.show_srcid =
f14.show_id) AND (segdim.season_srcid = f14.season_id))
->  Sort  (cost=1570.35..1579.46 rows=3643 width=40)
Sort Key: segdim.episode_srcid, segdim.customer_srcid,
segdim.show_srcid, segdim.season_srcid
->  Seq Scan on xsegment_dim segdim  (cost=0.00..1354.85 rows=3643
width=40)
Filter: (segment_srcid IS NULL)
->  Sort  (cost=1755323.26..1780227.95 rows=9961874 width=126)
Sort Key: f14.episode_id, f14.customer_id, f14.show_id, f14.season_id
->  Seq Scan on bigtab_stats_fact_tmp14 f14  (cost=0.00..597355.74
rows=9961874 width=126)
(9 rows)









# \d bigtab_stats_fact_tmp14
Table "public.bigtab_stats_fact_tmp14"
Column          |            Type             | Modifiers
--------------------------+-----------------------------+-----------
pf_dts_id     | bigint                      |
pf_device_id       | bigint                      |
segment_id               | bigint                      |
cdn_id                   | bigint                      |
collector_id             | bigint                      |
digital_envoy_id         | bigint                      |
maxmind_id               | bigint                      |
quova_id                 | bigint                      |
website_id               | bigint                      |
referrer_id              | bigint                      |
affiliate_id             | bigint                      |
custom_info_id           | bigint                      |
start_dt                 | timestamp without time zone |
total_played_time_sec    | numeric(18,5)               |
bytes_received           | bigint                      |
bytes_transmitted        | bigint                      |
stall_count              | integer                     |
stall_duration_sec       | numeric(18,5)               |
hiccup_count             | integer                     |
hiccup_duration_sec      | numeric(18,5)               |
watched_duration_sec     | numeric(18,5)               |
rewatched_duration_sec   | numeric(18,5)               |
requested_start_position | numeric(18,5)               |
requested_stop_position  | numeric(18,5)               |
post_position            | numeric(18,5)               |
is_vod                   | numeric(1,0)                |
sessionid                | bigint                      |
create_dt                | timestamp without time zone |
segment_type_id          | bigint                      |
customer_id              | bigint                      |
content_publisher_id     | bigint                      |
content_owner_id         | bigint                      |
episode_id               | bigint                      |
duration_sec             | numeric(18,5)               |
device_id                | bigint                      |
os_id                    | bigint                      |
browser_id               | bigint                      |
cpu_id                   | bigint                      |
xsystem_cfg_dim_id  | bigint                      |
xreferrer_dim_id    | bigint                      |
xaffiliate_dim_id   | bigint                      |
xsite_dim_id        | bigint                      |
xpublisher_dim_id   | bigint                      |
season_id                | bigint                      |
show_id                  | bigint                      |
xsegment_dim_id     | bigint                      |
location_id              | bigint                      |
zipcode                  | character varying(20)       |
xlocation_dim_id    | bigint                      |
location_srcid           | bigint                      |
timezone                 | real                        |
xtime_dim_id        | bigint                      |
Indexes:
"bigtab_stats_fact_tmp14_idx1" btree (customer_id)
"bigtab_stats_fact_tmp14_idx2" btree (show_id)
"bigtab_stats_fact_tmp14_idx3" btree (season_id)
"bigtab_stats_fact_tmp14_idx4" btree (episode_id)






# \d xsegment_dim
Table "public.xsegment_dim"
Column        |            Type             |
Modifiers
----------------------+-----------------------------
+-------------------------------------------------------------
xsegment_dim_id | bigint                      | not null default
nextval('xsegment_dim_seq'::regclass)
customer_srcid       | bigint                      | not null
show_srcid           | bigint                      | not null
show_name            | character varying(500)      | not null
season_srcid         | bigint                      | not null
season_name          | character varying(500)      | not null
episode_srcid        | bigint                      | not null
episode_name         | character varying(500)      | not null
segment_type_id      | integer                     |
segment_type         | character varying(500)      |
segment_srcid        | bigint                      |
segment_name         | character varying(500)      |
effective_dt         | timestamp without time zone | not null default
now()
inactive_dt          | timestamp without time zone |
last_update_dt       | timestamp without time zone | not null default
now()
Indexes:
"xsegment_dim_pk" PRIMARY KEY, btree (xsegment_dim_id)
"seg1" btree (customer_srcid)
"seg2" btree (show_srcid)
"seg3" btree (season_srcid)
"seg4" btree (episode_srcid)
"seg5" btree (segment_srcid)
"xsegment_dim_ix1" btree (customer_srcid)

Re: Join runs for > 10 hours and then fills up >1.3TB of disk space

От

"Claus Guttesen"

Дата:

16 мая 2008 г., 08:08:53

> I have a table with 9,961,914 rows in it (see the describe of
> bigtab_stats_fact_tmp14 below)
>
> I also have a table with 7,785 rows in it (see the describe of xsegment_dim
> below)
>
> I'm running the join shown below and it takes > 10 hours and eventually runs
> out of disk space on a 1.4TB file system
>
> I've included below a describe of both tables, the join and an explain plan,
> any help / suggestions would be much appreciated !
>
> I need to get this beast to run as quickly as possible (without filling up
> my file system)
>
>
> Thanks in advance...

What version of postgresql are you using? According to
http://www.postgresql.org/docs/8.2/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
you may benefit from adjusting work_mem.

You also index segment_srcid (in table xsegment_dim) but if you search
for  NULL and you have enough of those it defaults to a seq. scan:

Seq Scan on xsegment_dim segdim  (cost=0.00..1354.85 rows=3643 width=40)
> Filter: (segment_srcid IS NULL)

Maby you could insert some default value into segment_srcid (some
arbitrary large numbers) instead of NULL and then search for values
greater than??

You could also try to lower random_page_cost from default to 2.

> select
> f14.xpublisher_dim_id,
> f14.xtime_dim_id,
> f14.xlocation_dim_id,
> f14.xreferrer_dim_id,
> f14.xsite_dim_id,
> f14.xsystem_cfg_dim_id,
> f14.xaffiliate_dim_id,
> f14.customer_id,
> pf_dts_id,
> episode_id,
> sessionid,
> bytes_received,
> bytes_transmitted,
> total_played_time_sec,
> segdim.xsegment_dim_id as episode_level_segid
> from
> bigtab_stats_fact_tmp14 f14,
> xsegment_dim segdim
> where
> f14.customer_id = segdim.customer_srcid
> and f14.show_id = segdim.show_srcid
> and f14.season_id = segdim.season_srcid
> and f14.episode_id = segdim.episode_srcid
> and segdim.segment_srcid is NULL;
>
>
>
>
>
>
> QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Merge Join  (cost=1757001.74..73569676.49 rows=3191677219 width=118)
> Merge Cond: ((segdim.episode_srcid = f14.episode_id) AND
> (segdim.customer_srcid = f14.customer_id) AND (segdim.show_srcid =
> f14.show_id) AND (segdim.season_srcid = f14.season_id))
> ->  Sort  (cost=1570.35..1579.46 rows=3643 width=40)
> Sort Key: segdim.episode_srcid, segdim.customer_srcid, segdim.show_srcid,
> segdim.season_srcid
> ->  Seq Scan on xsegment_dim segdim  (cost=0.00..1354.85 rows=3643 width=40)
> Filter: (segment_srcid IS NULL)
> ->  Sort  (cost=1755323.26..1780227.95 rows=9961874 width=126)
> Sort Key: f14.episode_id, f14.customer_id, f14.show_id, f14.season_id
> ->  Seq Scan on bigtab_stats_fact_tmp14 f14  (cost=0.00..597355.74
> rows=9961874 width=126)
> (9 rows)
>
>
>
>
>
>
>
>
>
> # \d bigtab_stats_fact_tmp14
> Table "public.bigtab_stats_fact_tmp14"
> Column          |            Type             | Modifiers
> --------------------------+-----------------------------+-----------
> pf_dts_id     | bigint                      |
> pf_device_id       | bigint                      |
> segment_id               | bigint                      |
> cdn_id                   | bigint                      |
> collector_id             | bigint                      |
> digital_envoy_id         | bigint                      |
> maxmind_id               | bigint                      |
> quova_id                 | bigint                      |
> website_id               | bigint                      |
> referrer_id              | bigint                      |
> affiliate_id             | bigint                      |
> custom_info_id           | bigint                      |
> start_dt                 | timestamp without time zone |
> total_played_time_sec    | numeric(18,5)               |
> bytes_received           | bigint                      |
> bytes_transmitted        | bigint                      |
> stall_count              | integer                     |
> stall_duration_sec       | numeric(18,5)               |
> hiccup_count             | integer                     |
> hiccup_duration_sec      | numeric(18,5)               |
> watched_duration_sec     | numeric(18,5)               |
> rewatched_duration_sec   | numeric(18,5)               |
> requested_start_position | numeric(18,5)               |
> requested_stop_position  | numeric(18,5)               |
> post_position            | numeric(18,5)               |
> is_vod                   | numeric(1,0)                |
> sessionid                | bigint                      |
> create_dt                | timestamp without time zone |
> segment_type_id          | bigint                      |
> customer_id              | bigint                      |
> content_publisher_id     | bigint                      |
> content_owner_id         | bigint                      |
> episode_id               | bigint                      |
> duration_sec             | numeric(18,5)               |
> device_id                | bigint                      |
> os_id                    | bigint                      |
> browser_id               | bigint                      |
> cpu_id                   | bigint                      |
> xsystem_cfg_dim_id  | bigint                      |
> xreferrer_dim_id    | bigint                      |
> xaffiliate_dim_id   | bigint                      |
> xsite_dim_id        | bigint                      |
> xpublisher_dim_id   | bigint                      |
> season_id                | bigint                      |
> show_id                  | bigint                      |
> xsegment_dim_id     | bigint                      |
> location_id              | bigint                      |
> zipcode                  | character varying(20)       |
> xlocation_dim_id    | bigint                      |
> location_srcid           | bigint                      |
> timezone                 | real                        |
> xtime_dim_id        | bigint                      |
> Indexes:
> "bigtab_stats_fact_tmp14_idx1" btree (customer_id)
> "bigtab_stats_fact_tmp14_idx2" btree (show_id)
> "bigtab_stats_fact_tmp14_idx3" btree (season_id)
> "bigtab_stats_fact_tmp14_idx4" btree (episode_id)
>
>
>
>
>
>
> # \d xsegment_dim
> Table "public.xsegment_dim"
> Column        |            Type             |
>  Modifiers
> ----------------------+-----------------------------+-------------------------------------------------------------
> xsegment_dim_id | bigint                      | not null default
> nextval('xsegment_dim_seq'::regclass)
> customer_srcid       | bigint                      | not null
> show_srcid           | bigint                      | not null
> show_name            | character varying(500)      | not null
> season_srcid         | bigint                      | not null
> season_name          | character varying(500)      | not null
> episode_srcid        | bigint                      | not null
> episode_name         | character varying(500)      | not null
> segment_type_id      | integer                     |
> segment_type         | character varying(500)      |
> segment_srcid        | bigint                      |
> segment_name         | character varying(500)      |
> effective_dt         | timestamp without time zone | not null default now()
> inactive_dt          | timestamp without time zone |
> last_update_dt       | timestamp without time zone | not null default now()
> Indexes:
> "xsegment_dim_pk" PRIMARY KEY, btree (xsegment_dim_id)
> "seg1" btree (customer_srcid)
> "seg2" btree (show_srcid)
> "seg3" btree (season_srcid)
> "seg4" btree (episode_srcid)
> "seg5" btree (segment_srcid)
> "xsegment_dim_ix1" btree (customer_srcid)
>
>
>
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
regards
Claus

When lenity and cruelty play for a kingdom,
the gentlest gamester is the soonest winner.

Shakespeare

Re: Join runs for > 10 hours and then fills up >1.3TB of disk space

От

kevin kempter

Дата:

16 мая 2008 г., 08:08:55

Also, I'm running version 8.3 on a centOS box with 2  dual core CPU's
and 32Gig of ram


On May 16, 2008, at 12:58 AM, kevin kempter wrote:

> Sorry I goofed on the query text Here's the correct query:
>
> select
> f14.xpublisher_dim_id,
> f14.xtime_dim_id,
> f14.xlocation_dim_id,
> f14.xreferrer_dim_id,
> f14.xsite_dim_id,
> f14.xsystem_cfg_dim_id,
> f14.xaffiliate_dim_id,
> f14.customer_id,
> f14.pf_dts_id,
> f14.episode_id,
> f14.sessionid,
> f14.bytes_received,
> f14.bytes_transmitted,
> f14.total_played_time_sec,
> segdim.xsegment_dim_id as episode_level_segid
> from
> bigtab_stats_fact_tmp14 f14,
> xsegment_dim segdim
> where
> f14.customer_id = segdim.customer_srcid
> and f14.show_id = segdim.show_srcid
> and f14.season_id = segdim.season_srcid
> and f14.episode_id = segdim.episode_srcid
> and segdim.segment_srcid is NULL;
>
>
>
>
>
>
> On May 16, 2008, at 12:31 AM, kevin kempter wrote:
>
>> Hi List;
>>
>> I have a table with 9,961,914 rows in it (see the describe of
>> bigtab_stats_fact_tmp14 below)
>>
>> I also have a table with 7,785 rows in it (see the describe of
>> xsegment_dim below)
>>
>> I'm running the join shown below and it takes > 10 hours and
>> eventually runs out of disk space on a 1.4TB file system
>>
>> I've included below a describe of both tables, the join and an
>> explain plan, any help / suggestions would be much appreciated !
>>
>> I need to get this beast to run as quickly as possible (without
>> filling up my file system)
>>
>>
>> Thanks in advance...
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> select
>> f14.xpublisher_dim_id,
>> f14.xtime_dim_id,
>> f14.xlocation_dim_id,
>> f14.xreferrer_dim_id,
>> f14.xsite_dim_id,
>> f14.xsystem_cfg_dim_id,
>> f14.xaffiliate_dim_id,
>> f14.customer_id,
>> pf_dts_id,
>> episode_id,
>> sessionid,
>> bytes_received,
>> bytes_transmitted,
>> total_played_time_sec,
>> segdim.xsegment_dim_id as episode_level_segid
>> from
>> bigtab_stats_fact_tmp14 f14,
>> xsegment_dim segdim
>> where
>> f14.customer_id = segdim.customer_srcid
>> and f14.show_id = segdim.show_srcid
>> and f14.season_id = segdim.season_srcid
>> and f14.episode_id = segdim.episode_srcid
>> and segdim.segment_srcid is NULL;
>>
>>
>>
>>
>>
>>
>> QUERY PLAN
>>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> Merge Join  (cost=1757001.74..73569676.49 rows=3191677219 width=118)
>> Merge Cond: ((segdim.episode_srcid = f14.episode_id) AND
>> (segdim.customer_srcid = f14.customer_id) AND (segdim.show_srcid =
>> f14.show_id) AND (segdim.season_srcid = f14.season_id))
>> ->  Sort  (cost=1570.35..1579.46 rows=3643 width=40)
>> Sort Key: segdim.episode_srcid, segdim.customer_srcid,
>> segdim.show_srcid, segdim.season_srcid
>> ->  Seq Scan on xsegment_dim segdim  (cost=0.00..1354.85 rows=3643
>> width=40)
>> Filter: (segment_srcid IS NULL)
>> ->  Sort  (cost=1755323.26..1780227.95 rows=9961874 width=126)
>> Sort Key: f14.episode_id, f14.customer_id, f14.show_id, f14.season_id
>> ->  Seq Scan on bigtab_stats_fact_tmp14 f14  (cost=0.00..597355.74
>> rows=9961874 width=126)
>> (9 rows)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> # \d bigtab_stats_fact_tmp14
>> Table "public.bigtab_stats_fact_tmp14"
>> Column          |            Type             | Modifiers
>> --------------------------+-----------------------------+-----------
>> pf_dts_id     | bigint                      |
>> pf_device_id       | bigint                      |
>> segment_id               | bigint                      |
>> cdn_id                   | bigint                      |
>> collector_id             | bigint                      |
>> digital_envoy_id         | bigint                      |
>> maxmind_id               | bigint                      |
>> quova_id                 | bigint                      |
>> website_id               | bigint                      |
>> referrer_id              | bigint                      |
>> affiliate_id             | bigint                      |
>> custom_info_id           | bigint                      |
>> start_dt                 | timestamp without time zone |
>> total_played_time_sec    | numeric(18,5)               |
>> bytes_received           | bigint                      |
>> bytes_transmitted        | bigint                      |
>> stall_count              | integer                     |
>> stall_duration_sec       | numeric(18,5)               |
>> hiccup_count             | integer                     |
>> hiccup_duration_sec      | numeric(18,5)               |
>> watched_duration_sec     | numeric(18,5)               |
>> rewatched_duration_sec   | numeric(18,5)               |
>> requested_start_position | numeric(18,5)               |
>> requested_stop_position  | numeric(18,5)               |
>> post_position            | numeric(18,5)               |
>> is_vod                   | numeric(1,0)                |
>> sessionid                | bigint                      |
>> create_dt                | timestamp without time zone |
>> segment_type_id          | bigint                      |
>> customer_id              | bigint                      |
>> content_publisher_id     | bigint                      |
>> content_owner_id         | bigint                      |
>> episode_id               | bigint                      |
>> duration_sec             | numeric(18,5)               |
>> device_id                | bigint                      |
>> os_id                    | bigint                      |
>> browser_id               | bigint                      |
>> cpu_id                   | bigint                      |
>> xsystem_cfg_dim_id  | bigint                      |
>> xreferrer_dim_id    | bigint                      |
>> xaffiliate_dim_id   | bigint                      |
>> xsite_dim_id        | bigint                      |
>> xpublisher_dim_id   | bigint                      |
>> season_id                | bigint                      |
>> show_id                  | bigint                      |
>> xsegment_dim_id     | bigint                      |
>> location_id              | bigint                      |
>> zipcode                  | character varying(20)       |
>> xlocation_dim_id    | bigint                      |
>> location_srcid           | bigint                      |
>> timezone                 | real                        |
>> xtime_dim_id        | bigint                      |
>> Indexes:
>> "bigtab_stats_fact_tmp14_idx1" btree (customer_id)
>> "bigtab_stats_fact_tmp14_idx2" btree (show_id)
>> "bigtab_stats_fact_tmp14_idx3" btree (season_id)
>> "bigtab_stats_fact_tmp14_idx4" btree (episode_id)
>>
>>
>>
>>
>>
>>
>> # \d xsegment_dim
>> Table "public.xsegment_dim"
>> Column        |            Type
>> |                          Modifiers
>> ----------------------+-----------------------------
>> +-------------------------------------------------------------
>> xsegment_dim_id | bigint                      | not null default
>> nextval('xsegment_dim_seq'::regclass)
>> customer_srcid       | bigint                      | not null
>> show_srcid           | bigint                      | not null
>> show_name            | character varying(500)      | not null
>> season_srcid         | bigint                      | not null
>> season_name          | character varying(500)      | not null
>> episode_srcid        | bigint                      | not null
>> episode_name         | character varying(500)      | not null
>> segment_type_id      | integer                     |
>> segment_type         | character varying(500)      |
>> segment_srcid        | bigint                      |
>> segment_name         | character varying(500)      |
>> effective_dt         | timestamp without time zone | not null
>> default now()
>> inactive_dt          | timestamp without time zone |
>> last_update_dt       | timestamp without time zone | not null
>> default now()
>> Indexes:
>> "xsegment_dim_pk" PRIMARY KEY, btree (xsegment_dim_id)
>> "seg1" btree (customer_srcid)
>> "seg2" btree (show_srcid)
>> "seg3" btree (season_srcid)
>> "seg4" btree (episode_srcid)
>> "seg5" btree (segment_srcid)
>> "xsegment_dim_ix1" btree (customer_srcid)
>>
>>
>>
>>
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org
>> )
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: Join runs for > 10 hours and then fills up >1.3TB of disk space

От

kevin kempter

Дата:

16 мая 2008 г., 08:10:13

Sorry I goofed on the query text Here's the correct query:

select
f14.xpublisher_dim_id,
f14.xtime_dim_id,
f14.xlocation_dim_id,
f14.xreferrer_dim_id,
f14.xsite_dim_id,
f14.xsystem_cfg_dim_id,
f14.xaffiliate_dim_id,
f14.customer_id,
f14.pf_dts_id,
f14.episode_id,
f14.sessionid,
f14.bytes_received,
f14.bytes_transmitted,
f14.total_played_time_sec,
segdim.xsegment_dim_id as episode_level_segid
from
bigtab_stats_fact_tmp14 f14,
xsegment_dim segdim
where
f14.customer_id = segdim.customer_srcid
and f14.show_id = segdim.show_srcid
and f14.season_id = segdim.season_srcid
and f14.episode_id = segdim.episode_srcid
and segdim.segment_srcid is NULL;






On May 16, 2008, at 12:31 AM, kevin kempter wrote:

> Hi List;
>
> I have a table with 9,961,914 rows in it (see the describe of
> bigtab_stats_fact_tmp14 below)
>
> I also have a table with 7,785 rows in it (see the describe of
> xsegment_dim below)
>
> I'm running the join shown below and it takes > 10 hours and
> eventually runs out of disk space on a 1.4TB file system
>
> I've included below a describe of both tables, the join and an
> explain plan, any help / suggestions would be much appreciated !
>
> I need to get this beast to run as quickly as possible (without
> filling up my file system)
>
>
> Thanks in advance...
>
>
>
>
>
>
>
>
>
>
>
> select
> f14.xpublisher_dim_id,
> f14.xtime_dim_id,
> f14.xlocation_dim_id,
> f14.xreferrer_dim_id,
> f14.xsite_dim_id,
> f14.xsystem_cfg_dim_id,
> f14.xaffiliate_dim_id,
> f14.customer_id,
> pf_dts_id,
> episode_id,
> sessionid,
> bytes_received,
> bytes_transmitted,
> total_played_time_sec,
> segdim.xsegment_dim_id as episode_level_segid
> from
> bigtab_stats_fact_tmp14 f14,
> xsegment_dim segdim
> where
> f14.customer_id = segdim.customer_srcid
> and f14.show_id = segdim.show_srcid
> and f14.season_id = segdim.season_srcid
> and f14.episode_id = segdim.episode_srcid
> and segdim.segment_srcid is NULL;
>
>
>
>
>
>
> QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Merge Join  (cost=1757001.74..73569676.49 rows=3191677219 width=118)
> Merge Cond: ((segdim.episode_srcid = f14.episode_id) AND
> (segdim.customer_srcid = f14.customer_id) AND (segdim.show_srcid =
> f14.show_id) AND (segdim.season_srcid = f14.season_id))
> ->  Sort  (cost=1570.35..1579.46 rows=3643 width=40)
> Sort Key: segdim.episode_srcid, segdim.customer_srcid,
> segdim.show_srcid, segdim.season_srcid
> ->  Seq Scan on xsegment_dim segdim  (cost=0.00..1354.85 rows=3643
> width=40)
> Filter: (segment_srcid IS NULL)
> ->  Sort  (cost=1755323.26..1780227.95 rows=9961874 width=126)
> Sort Key: f14.episode_id, f14.customer_id, f14.show_id, f14.season_id
> ->  Seq Scan on bigtab_stats_fact_tmp14 f14  (cost=0.00..597355.74
> rows=9961874 width=126)
> (9 rows)
>
>
>
>
>
>
>
>
>
> # \d bigtab_stats_fact_tmp14
> Table "public.bigtab_stats_fact_tmp14"
> Column          |            Type             | Modifiers
> --------------------------+-----------------------------+-----------
> pf_dts_id     | bigint                      |
> pf_device_id       | bigint                      |
> segment_id               | bigint                      |
> cdn_id                   | bigint                      |
> collector_id             | bigint                      |
> digital_envoy_id         | bigint                      |
> maxmind_id               | bigint                      |
> quova_id                 | bigint                      |
> website_id               | bigint                      |
> referrer_id              | bigint                      |
> affiliate_id             | bigint                      |
> custom_info_id           | bigint                      |
> start_dt                 | timestamp without time zone |
> total_played_time_sec    | numeric(18,5)               |
> bytes_received           | bigint                      |
> bytes_transmitted        | bigint                      |
> stall_count              | integer                     |
> stall_duration_sec       | numeric(18,5)               |
> hiccup_count             | integer                     |
> hiccup_duration_sec      | numeric(18,5)               |
> watched_duration_sec     | numeric(18,5)               |
> rewatched_duration_sec   | numeric(18,5)               |
> requested_start_position | numeric(18,5)               |
> requested_stop_position  | numeric(18,5)               |
> post_position            | numeric(18,5)               |
> is_vod                   | numeric(1,0)                |
> sessionid                | bigint                      |
> create_dt                | timestamp without time zone |
> segment_type_id          | bigint                      |
> customer_id              | bigint                      |
> content_publisher_id     | bigint                      |
> content_owner_id         | bigint                      |
> episode_id               | bigint                      |
> duration_sec             | numeric(18,5)               |
> device_id                | bigint                      |
> os_id                    | bigint                      |
> browser_id               | bigint                      |
> cpu_id                   | bigint                      |
> xsystem_cfg_dim_id  | bigint                      |
> xreferrer_dim_id    | bigint                      |
> xaffiliate_dim_id   | bigint                      |
> xsite_dim_id        | bigint                      |
> xpublisher_dim_id   | bigint                      |
> season_id                | bigint                      |
> show_id                  | bigint                      |
> xsegment_dim_id     | bigint                      |
> location_id              | bigint                      |
> zipcode                  | character varying(20)       |
> xlocation_dim_id    | bigint                      |
> location_srcid           | bigint                      |
> timezone                 | real                        |
> xtime_dim_id        | bigint                      |
> Indexes:
> "bigtab_stats_fact_tmp14_idx1" btree (customer_id)
> "bigtab_stats_fact_tmp14_idx2" btree (show_id)
> "bigtab_stats_fact_tmp14_idx3" btree (season_id)
> "bigtab_stats_fact_tmp14_idx4" btree (episode_id)
>
>
>
>
>
>
> # \d xsegment_dim
> Table "public.xsegment_dim"
> Column        |            Type
> |                          Modifiers
> ----------------------+-----------------------------
> +-------------------------------------------------------------
> xsegment_dim_id | bigint                      | not null default
> nextval('xsegment_dim_seq'::regclass)
> customer_srcid       | bigint                      | not null
> show_srcid           | bigint                      | not null
> show_name            | character varying(500)      | not null
> season_srcid         | bigint                      | not null
> season_name          | character varying(500)      | not null
> episode_srcid        | bigint                      | not null
> episode_name         | character varying(500)      | not null
> segment_type_id      | integer                     |
> segment_type         | character varying(500)      |
> segment_srcid        | bigint                      |
> segment_name         | character varying(500)      |
> effective_dt         | timestamp without time zone | not null
> default now()
> inactive_dt          | timestamp without time zone |
> last_update_dt       | timestamp without time zone | not null
> default now()
> Indexes:
> "xsegment_dim_pk" PRIMARY KEY, btree (xsegment_dim_id)
> "seg1" btree (customer_srcid)
> "seg2" btree (show_srcid)
> "seg3" btree (season_srcid)
> "seg4" btree (episode_srcid)
> "seg5" btree (segment_srcid)
> "xsegment_dim_ix1" btree (customer_srcid)
>
>
>
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: Join runs for > 10 hours and then fills up >1.3TB of disk space

От

Simon Riggs

Дата:

16 мая 2008 г., 08:18:15

On Fri, 2008-05-16 at 00:31 -0600, kevin kempter wrote:

> I'm running the join shown below and it takes > 10 hours and
> eventually runs out of disk space on a 1.4TB file system

Well, running in 10 hours doesn't mean there's a software problem, nor
does running out of disk space.

Please crunch some numbers before you ask, such as how much disk space
was used by the query, how big you'd expect it to be etc, plus provide
information such as what the primary key of the large table is and what
is your release level is etc..

Are you sure you want to retrieve an estimated 3 billion rows? Can you
cope if that estimate is wrong and the true figure is much higher? Do
you think the estimate is realistic?

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: Join runs for > 10 hours and then fills up >1.3TB of disk space

От

Richard Huxton

Дата:

16 мая 2008 г., 08:19:47

kevin kempter wrote:
> Hi List;
>
> I have a table with 9,961,914 rows in it (see the describe of
> bigtab_stats_fact_tmp14 below)
>
> I also have a table with 7,785 rows in it (see the describe of
> xsegment_dim below)
>
> I'm running the join shown below and it takes > 10 hours and eventually
> runs out of disk space on a 1.4TB file system

> QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

>
> Merge Join  (cost=1757001.74..73569676.49 rows=3191677219 width=118)

Dumb question Kevin, but are you really expecting 3.2 billion rows in
the result-set? Because that's approaching 400GB of result-set without
any overheads.

--
   Richard Huxton
   Archonet Ltd

Re: Join runs for > 10 hours and then fills up >1.3TB of disk space

От

kevin kempter

Дата:

16 мая 2008 г., 08:27:38

I'm expecting 9,961,914 rows returned. Each row in the big table
should have a corresponding key in the smaller tale, I want to
basically "expand" the big table column list by one, via adding the
appropriate key from the smaller table for each row in the big table.
It's not a cartesion product join.



On May 16, 2008, at 1:40 AM, Richard Huxton wrote:

> kevin kempter wrote:
>> Hi List;
>> I have a table with 9,961,914 rows in it (see the describe of
>> bigtab_stats_fact_tmp14 below)
>> I also have a table with 7,785 rows in it (see the describe of
>> xsegment_dim below)
>> I'm running the join shown below and it takes > 10 hours and
>> eventually runs out of disk space on a 1.4TB file system
>
>> QUERY PLAN
>>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge 
>>  Join  (cost=1757001.74..73569676.49 rows=3191677219 width=118)
>
> Dumb question Kevin, but are you really expecting 3.2 billion rows
> in the result-set? Because that's approaching 400GB of result-set
> without any overheads.
>
> --
>  Richard Huxton
>  Archonet Ltd

Re: Join runs for > 10 hours and then fills up >1.3TB of disk space

От

Richard Huxton

Дата:

16 мая 2008 г., 08:32:14

kevin kempter wrote:
> I'm expecting 9,961,914 rows returned. Each row in the big table should
> have a corresponding key in the smaller tale, I want to basically
> "expand" the big table column list by one, via adding the appropriate
> key from the smaller table for each row in the big table. It's not a
> cartesion product join.

Didn't seem likely, to be honest.

What happens if you try the query as a cursor, perhaps with an order-by
on customer_id or something to encourage index use? Do you ever get a
first row back?

In fact, what happens if you slap an index over all your join columns on
xsegment_dim? With 7,000 rows that should make it a cheap test.

--
   Richard Huxton
   Archonet Ltd

Re: Join runs for > 10 hours and then fills up >1.3TB of disk space

От

Richard Huxton

Дата:

16 мая 2008 г., 08:33:20

kevin kempter wrote:
> Hi List;
>
> I have a table with 9,961,914 rows in it (see the describe of
> bigtab_stats_fact_tmp14 below)
>
> I also have a table with 7,785 rows in it (see the describe of
> xsegment_dim below)

Something else is puzzling me with this - you're joining over four fields.

> from
> bigtab_stats_fact_tmp14 f14,
> xsegment_dim segdim
> where
> f14.customer_id = segdim.customer_srcid
> and f14.show_id = segdim.show_srcid
> and f14.season_id = segdim.season_srcid
> and f14.episode_id = segdim.episode_srcid
> and segdim.segment_srcid is NULL;

>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

>
> Merge Join  (cost=1757001.74..73569676.49 rows=3191677219 width=118)

> ->  Sort  (cost=1570.35..1579.46 rows=3643 width=40)

> ->  Sort  (cost=1755323.26..1780227.95 rows=9961874 width=126)

Here it's still expecting 320 matches against each row from the large
table. That's ~ 10% of the small table (or that fraction of it that PG
expects) which seems very high for four clauses ANDed together.

--
   Richard Huxton
   Archonet Ltd

Re: Join runs for > 10 hours and then fills up >1.3TB of disk space

От

"Luke Lonergan"

Дата:

16 мая 2008 г., 08:47:02

Try 'set enable-mergejoin=false' and see if you get a hashjoin.

- Luke

----- Original Message -----
From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org>
To: Richard Huxton <dev@archonet.com>
Cc: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org>
Sent: Fri May 16 04:00:41 2008
Subject: Re: [PERFORM] Join runs for > 10 hours and then fills up >1.3TB of disk space

I'm expecting 9,961,914 rows returned. Each row in the big table
should have a corresponding key in the smaller tale, I want to
basically "expand" the big table column list by one, via adding the
appropriate key from the smaller table for each row in the big table.
It's not a cartesion product join.

On May 16, 2008, at 1:40 AM, Richard Huxton wrote:

> kevin kempter wrote:
>> Hi List;
>> I have a table with 9,961,914 rows in it (see the describe of
>> bigtab_stats_fact_tmp14 below)
>> I also have a table with 7,785 rows in it (see the describe of
>> xsegment_dim below)
>> I'm running the join shown below and it takes > 10 hours and
>> eventually runs out of disk space on a 1.4TB file system
>
>> QUERY PLAN
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge
>> Join (cost=1757001.74..73569676.49 rows=3191677219 width=118)
>
> Dumb question Kevin, but are you really expecting 3.2 billion rows
> in the result-set? Because that's approaching 400GB of result-set
> without any overheads.
>
> --
> Richard Huxton
> Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Join runs for > 10 hours and then fills up >1.3TB of disk space

От

kevin kempter

Дата:

16 мая 2008 г., 08:50:19

On further investigation it turns out that I/we have a serious data
issue in that my small table is full of 'UNKNOWN' tags so my query
cannot associate the data correctly - thus I will end up with 2+
billion rows.


Thanks everyone for your help




On May 16, 2008, at 1:38 AM, Simon Riggs wrote:

>
> On Fri, 2008-05-16 at 00:31 -0600, kevin kempter wrote:
>
>> I'm running the join shown below and it takes > 10 hours and
>> eventually runs out of disk space on a 1.4TB file system
>
> Well, running in 10 hours doesn't mean there's a software problem, nor
> does running out of disk space.
>
> Please crunch some numbers before you ask, such as how much disk space
> was used by the query, how big you'd expect it to be etc, plus provide
> information such as what the primary key of the large table is and
> what
> is your release level is etc..
>
> Are you sure you want to retrieve an estimated 3 billion rows? Can you
> cope if that estimate is wrong and the true figure is much higher? Do
> you think the estimate is realistic?
>
> --
> Simon Riggs           www.2ndQuadrant.com
> PostgreSQL Training, Services and Support
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Join runs for > 10 hours and then fills up >1.3TB of disk space