Обсуждение: Performance of outer joins?
I have a situation that can be summarized to the following:
-- day in 20061215 format
Create table calendar (
day integer unique not null
);
Create table customers (
id serial unique not null,
name varchar,
address varchar,
);
Create table deliveries (
customers_id integer not null references customers(id),
calendar_day integer not null references calendar(day),
delivered bool not null default false,
unique(customers_id, calendar_id)
);
Imagine tens of thousands of customers, a few million deliveries. A query
that's structurally similar to the following query is rather slow. It's
taking over 30 seconds, climbing fast on reasonable HW. (SMP Opteron, 10k
SCSI, 4 GB RAM) If I remove the outer join, performance is < 1 second.
SELECT customers.id as customers_id,
customers.name AS customers_name,
calendar.day AS calendar_day,
CASE WHEN (deliveries.delivered IS NULL) THEN 'n/a'
WHEN (deliveries.delivered=TRUE) THEN 'yes'
ELSE 'no' END AS delivered
FROM customers
JOIN calendars ON
(
-- GIVE A CALENDAR OF POSSIBLE DAYS FOR DELIVERIES
calendar.day < 20061201
AND calendar.day >= 20060101
)
LEFT OUTER JOIN deliveries ON
(
customers.id=deliveries.customers_id
AND deliveries.calendar_day=calendar.day
)
;
What can I do to improve the performance of this oft-used query? Is there a
better way to do this, or am I doomed to looping thru results and parsing the
results in code?
Thanks,
-Ben
--
"I kept looking around for somebody to solve the problem.
Then I realized I am somebody"
-Anonymous
Benjamin Smith <ben@schoolpathways.com> writes:
> FROM customers
> JOIN calendars ON
> (
> -- GIVE A CALENDAR OF POSSIBLE DAYS FOR DELIVERIES
> calendar.day < 20061201
> AND calendar.day >= 20060101
> )
Haven't you written a cross-product join here? Seems like a bad
idea if you'd like it to be fast. Especially since you then left-joined
it to something else, meaning that every row in the cross product
must in fact generate output. Do you really want to be deluged with
a table showing all the customers you DIDN'T deliver to on each day,
for every day in the year?
regards, tom lane
ben would something like this work in your situation?
SELECT customer.id, customer.name, deliveries.calendar_day,
deliveries.delivered
FROM ben_customers as customer, ben_deliveries as deliveries
WHERE customer.id = deliveries.customers_id
and deliveries.calendar_day in (Select day
from ben_calendar
where day < 20061201 and day >= 20060101)
I think that would cut down the deliveries table fairly quickly, thus
making the customer_id join nice and quick. You would also only be
returning data from the tables in your from clause. I suppose it
depends on what you are trying to display. If you want a list of all
the possible days, and deliveries on those days then I'd approach it a
bit differently.
On Dec 15, 7:59 pm, b...@schoolpathways.com (Benjamin Smith) wrote:
>
> What can I do to improve the performance of this oft-used query? Is there a
> better way to do this, or am I doomed to looping thru results and parsing the
> results in code?
>