Обсуждение: OT: Data structure design question: How do they count so fast?

Поиск
Список
Период
Сортировка

OT: Data structure design question: How do they count so fast?

От
Brendan Duddridge
Дата:
Hi,

First of all, the reason I'm posting on the PostgreSQL Performance list is we have a performance issue with one of our applications and it's related to the speed at which PostgreSQL can do counts. But it's also related to the data structure we've designed to develop our comparison shopping engine. It's very normalized and the speed of our queries are slowing down considerably as we add more and more data.

So I've been looking at some of the other comparison shopping engines and I'm trying to figure out how they manage to get the counts of their products for a set of attributes and attribute values so quickly.

For example, on the following page, they have 1,260,658 products:


They display 3 attributes with values on the page: Price Range, Home Furnishing Type, and Store. Plus there are a selection of other attributes not displaying the value choices.

For Price Range, they have the following values and product counts (in brackets):

# Below $20 (204,315)
# $20 - $50 (234,694)
# $50 - $80 (188,811)
# $80 - $130 (182,721)
# $130 - $240 (222,519)

For Home Furnishing Type they have:

# Wall Art and Decor (438,493)
# Lighting (243,098)
# Bathroom Furnishings (132,441)
# Rugs (113,216)
# Decorative Accents (65,418)

And for Store they have:

# Art.com (360,933)
# HomeAnnex (130,410)
# AllPosters.com (72,529)
# HomeClick.com (61,423)
# 1STOPlighting Superstore (32,074)

Now, initially I thought they would just pre-compute these counts, but the problem is, when you click on any of the above attribute values, they reduce the remaining possible set of matching products (and set of possible remaining attributes and attribute values) by the amount displayed next to the attribute value selected. You can click on any combination of attribute values to filter down the remaining set of matching products, so there's a large combination of paths you can take to arrive at a set of products you might be interested in.

Do you think they are pre-computed? Or do you think they might use a query similar to the following?:

select pav.attribute_value_id, count(p.product_id)
from product_attribute_value pav,
 attribute a,
 product p
where a.attribute_id in (some set of attribute ids) and
pav.product_id = p.product_id and
pav.attribute_id = a.attribute_id and p.product_id in
(select product_id
 from category_product
 where category_id = some category id) and
p.is_active = 'true'
group by pav.attribute_value_id;

It would seem to me that although the above query suggests a normalized database structure, that joining with 3 tables plus a 4th table in the sub-query with an IN qualifier and grouping to get the product counts would take a VERY long time, especially on a possible result set of 1,260,658 products. 

The other issue is trying to figure out what the remaining set of possible attribute and attribute values are in order to reach the remaining set of products.

Does anyone have any insights into what kind of data structures would be necessary to accomplish such a feat? I know that PostgreSQL has performance issues with counting rows, so I can't imagine being able to use the above kind of query to get the results we need. I don't know what kind of database backend mysimon is using either. It would also seem to be logical that having a flattened data structure would seem to be necessary in order to get the performance required. Their pages seem to load pretty fast.

We are possibly considering some kind of pre-computed decision-tree type data structure to get the counts of the set of products that could be reached by selecting any combination of attributes and attribute values. Does this seem like a reasonable idea?

Perhaps there's a better way?

I also apologize if this isn't the appropriate list to publish this kind of question to. The reason I posted here was because it is a performance related question, but not necessarily directly related to PostgreSQL. It's just that we are using PostgreSQL for our product comparison engine so I thought there could be some PostgreSQL specific optimizations that could be made. If not, please let me know and I'll move it elsewhere. 


Thanks very much,


____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 |  brendan@clickspace.com

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com 
Вложения

Re: OT: Data structure design question: How do they count so fast?

От
Brendan Duddridge
Дата:
Hi Brandon,

Thanks for your suggestion. I'll think about that one. Part of the
problem is also trying to figure out what the remaining set of
attributes and attribute values are, so that slows it down
considerably too. There are many many combinations of attribute
values that can be clicked on.

More work to do!

Thanks,

____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 |  brendan@clickspace.com

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 9, 2006, at 9:56 PM, Brandon Hines wrote:

> Brendan,
>
> I have a number of applications the require similar functionality.
> What I typically do is to create a count table that gets updated
> with a trigger.  But instead of keeping absolute counts, I keep
> counts of the smallest necessary element.  For example, I have a
> table with approx 12 million elements, each element belongs to one
> of a thousand classes of elements.  The materialized view table
> with only about a thousand rows is small enough for sum() queries
> of various classes fast enough for web pages.
>
> -Brandon
>
> Brendan Duddridge wrote:
>> Hi,
>> First of all, the reason I'm posting on the PostgreSQL Performance
>> list is we have a performance issue with one of our applications
>> and it's related to the speed at which PostgreSQL can do counts.
>> But it's also related to the data structure we've designed to
>> develop our comparison shopping engine. It's very normalized and
>> the speed of our queries are slowing down considerably as we add
>> more and more data.
>> So I've been looking at some of the other comparison shopping
>> engines and I'm trying to figure out how they manage to get the
>> counts of their products for a set of attributes and attribute
>> values so quickly.
>> For example, on the following page, they have 1,260,658 products:
>> http://www.mysimon.com/Home-Furnishings/9000-10975_8-0.html?tag=glnav
>> They display 3 attributes with values on the page: Price Range,
>> Home Furnishing Type, and Store. Plus there are a selection of
>> other attributes not displaying the value choices.
>> For Price Range, they have the following values and product counts
>> (in brackets):
>> # Below $20 (204,315)
>> # $20 - $50 (234,694)
>> # $50 - $80 (188,811)
>> # $80 - $130 (182,721)
>> # $130 - $240 (222,519)
>> For Home Furnishing Type they have:
>> # Wall Art and Decor (438,493)
>> # Lighting (243,098)
>> # Bathroom Furnishings (132,441)
>> # Rugs (113,216)
>> # Decorative Accents (65,418)
>> And for Store they have:
>> # Art.com (360,933)
>> # HomeAnnex (130,410)
>> # AllPosters.com (72,529)
>> # HomeClick.com (61,423)
>> # 1STOPlighting Superstore (32,074)
>> Now, initially I thought they would just pre-compute these counts,
>> but the problem is, when you click on any of the above attribute
>> values, they reduce the remaining possible set of matching
>> products (and set of possible remaining attributes and attribute
>> values) by the amount displayed next to the attribute value
>> selected. You can click on any combination of attribute values to
>> filter down the remaining set of matching products, so there's a
>> large combination of paths you can take to arrive at a set of
>> products you might be interested in.
>> Do you think they are pre-computed? Or do you think they might use
>> a query similar to the following?:
>> select pav.attribute_value_id, count(p.product_id)
>> from product_attribute_value pav,
>>  attribute a,
>>  product p
>> where a.attribute_id in (some set of attribute ids) and
>> pav.product_id = p.product_id and
>> pav.attribute_id = a.attribute_id and p.product_id in
>> (select product_id
>>  from category_product
>>  where category_id = some category id) and
>> p.is_active = 'true'
>> group by pav.attribute_value_id;
>> It would seem to me that although the above query suggests a
>> normalized database structure, that joining with 3 tables plus a
>> 4th table in the sub-query with an IN qualifier and grouping to
>> get the product counts would take a VERY long time, especially on
>> a possible result set of 1,260,658 products. The other issue is
>> trying to figure out what the remaining set of possible attribute
>> and attribute values are in order to reach the remaining set of
>> products.
>> Does anyone have any insights into what kind of data structures
>> would be necessary to accomplish such a feat? I know that
>> PostgreSQL has performance issues with counting rows, so I can't
>> imagine being able to use the above kind of query to get the
>> results we need. I don't know what kind of database backend
>> mysimon is using either. It would also seem to be logical that
>> having a flattened data structure would seem to be necessary in
>> order to get the performance required. Their pages seem to load
>> pretty fast.
>> We are possibly considering some kind of pre-computed decision-
>> tree type data structure to get the counts of the set of products
>> that could be reached by selecting any combination of attributes
>> and attribute values. Does this seem like a reasonable idea?
>> Perhaps there's a better way?
>> I also apologize if this isn't the appropriate list to publish
>> this kind of question to. The reason I posted here was because it
>> is a performance related question, but not necessarily directly
>> related to PostgreSQL. It's just that we are using PostgreSQL for
>> our product comparison engine so I thought there could be some
>> PostgreSQL specific optimizations that could be made. If not,
>> please let me know and I'll move it elsewhere. Thanks very much,
>> *
>> *____________________________________________________________________
>> *Brendan Duddridge* | CTO | 403-277-5591 x24 |
>> brendan@clickspace.com <mailto:brendan@clickspace.com>
>> *
>> *ClickSpace Interactive Inc.
>> Suite L100, 239 - 10th Ave. SE
>> Calgary, AB  T2G 0V9
>> http://www.clickspace.com
>
>


Вложения

Re: OT: Data structure design question: How do they count

От
Richard Huxton
Дата:
Brendan Duddridge wrote:
>
> Now, initially I thought they would just pre-compute these counts, but
> the problem is, when you click on any of the above attribute values,
> they reduce the remaining possible set of matching products (and set of
> possible remaining attributes and attribute values) by the amount
> displayed next to the attribute value selected. You can click on any
> combination of attribute values to filter down the remaining set of
> matching products, so there's a large combination of paths you can take
> to arrive at a set of products you might be interested in.
>
> Do you think they are pre-computed? Or do you think they might use a
> query similar to the following?:

Pre-computed almost certainly, but at what level of granularity? And
with application-level caching?

> select pav.attribute_value_id, count(p.product_id)
> from product_attribute_value pav,
>      attribute a,
>      product p
> where a.attribute_id in (some set of attribute ids) and
> pav.product_id = p.product_id and
> pav.attribute_id = a.attribute_id and p.product_id in
>     (select product_id
>      from category_product
>      where category_id = some category id) and
> p.is_active = 'true'
> group by pav.attribute_value_id;
>
> It would seem to me that although the above query suggests a normalized
> database structure, that joining with 3 tables plus a 4th table in the
> sub-query with an IN qualifier and grouping to get the product counts
> would take a VERY long time, especially on a possible result set of
> 1,260,658 products.

Hmm - I'm not sure I'd say this was necessarily normalised. In the
example you gave there were three definite types of attribute:
  1. Price range (< 20, 20-50, ...)
  2. Product type (lighting, rugs, ...)
  3. Store (art.com, homeannex, ...)
Your example discards this type information.

I'm also not sure it lets store A sell widgets for 19.99 and B for 25.99

So - let's look at how we might break this down into simple relations:
  product_types (product_id, prod_type, prod_subtype)
  product_availability (product_id, store_id, price_range)
and so on for each set of parameters.

Then, if PG isn't calculating fast enough I'd be tempted to throw in a
summary table:
  product_counts(store_id, price_range, prod_type, prod_subtype, ...,
num_products)
Then total over this for the top-level queries.

I'd also cache common top-level queries at the applicaton level anyway.

--
   Richard Huxton
   Archonet Ltd

Re: OT: Data structure design question: How do they count so fast?

От
Brendan Duddridge
Дата:
Hi Richard (and anyone else who want's to comment!),

I'm not sure it will really work pre-computed. At least not in an
obvious way (for me! :-)) It's fine to display a pre-computed list of
product counts for the initial set of attribute and attribute values,
but we need to be able to display the counts of products for any
combination of selected attribute values.

Once an attribute value is picked that reduces the set of products to
a small enough set, the queries are fast, so, perhaps we just need to
pre-compute the counts for combinations of attribute values that lead
to a high count of products. Hence, our thought on building a
decision-tree type data structure. This would store the various
combinations of attributes and attribute values, along with the
product count, along with a list of attributes and attribute values
that are applicable for the current set of selected attribute values.
This sounds like it could get rather complicated, so we were hoping
someone might have an idea on a much simpler solution.

Thanks,

____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 |  brendan@clickspace.com

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 10, 2006, at 3:23 AM, Richard Huxton wrote:

> Brendan Duddridge wrote:
>> Now, initially I thought they would just pre-compute these counts,
>> but the problem is, when you click on any of the above attribute
>> values, they reduce the remaining possible set of matching
>> products (and set of possible remaining attributes and attribute
>> values) by the amount displayed next to the attribute value
>> selected. You can click on any combination of attribute values to
>> filter down the remaining set of matching products, so there's a
>> large combination of paths you can take to arrive at a set of
>> products you might be interested in.
>> Do you think they are pre-computed? Or do you think they might use
>> a query similar to the following?:
>
> Pre-computed almost certainly, but at what level of granularity?
> And with application-level caching?
>
>> select pav.attribute_value_id, count(p.product_id)
>> from product_attribute_value pav,
>>      attribute a,
>>      product p
>> where a.attribute_id in (some set of attribute ids) and
>> pav.product_id = p.product_id and
>> pav.attribute_id = a.attribute_id and p.product_id in
>>     (select product_id
>>      from category_product
>>      where category_id = some category id) and
>> p.is_active = 'true'
>> group by pav.attribute_value_id;
>> It would seem to me that although the above query suggests a
>> normalized database structure, that joining with 3 tables plus a
>> 4th table in the sub-query with an IN qualifier and grouping to
>> get the product counts would take a VERY long time, especially on
>> a possible result set of 1,260,658 products.
>
> Hmm - I'm not sure I'd say this was necessarily normalised. In the
> example you gave there were three definite types of attribute:
>  1. Price range (< 20, 20-50, ...)
>  2. Product type (lighting, rugs, ...)
>  3. Store (art.com, homeannex, ...)
> Your example discards this type information.
>
> I'm also not sure it lets store A sell widgets for 19.99 and B for
> 25.99
>
> So - let's look at how we might break this down into simple relations:
>  product_types (product_id, prod_type, prod_subtype)
>  product_availability (product_id, store_id, price_range)
> and so on for each set of parameters.
>
> Then, if PG isn't calculating fast enough I'd be tempted to throw
> in a summary table:
>  product_counts(store_id, price_range, prod_type,
> prod_subtype, ..., num_products)
> Then total over this for the top-level queries.
>
> I'd also cache common top-level queries at the applicaton level
> anyway.
>
> --
>   Richard Huxton
>   Archonet Ltd
>