Обсуждение: Could use some advice on search architecture

Поиск

Список

Период

Сортировка

Could use some advice on search architecture

От

Ron Pasch

Дата:

18 апреля 2014 г., 13:59:59

Hello,

I'm contemplating what architecture I should use to make searching as
fast as possible given the information available and the search
requirements. Let me give some background first;

- The database contains products of can potentially have a lot of them
(up to about 3 to 5 million)
- Each product has about 30 different properties defined about them.
Things like what color they are etc. All these properties are enumerated
choices, so for instance for color there is a list of available static
never changing options of which one can be chosen for that product. This
is the same for all those 30 properties. Currently they are stored as
enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
'option2', etc..)
- It should be possible to search for products and provide properties
that the product SHOULD have, not must have. For instance, for color,
the search could specify that it should return products that are either
red, blue or green.
- The products that match with the most properties should be in the top
of the search results
- If different products match with the same amount of properties, the
ordering should then be on the product that is most popular. There is
information in the database (and if need be also in the same table)
about how many times a product is sold.
- The results will be paginated per 15 products

The requirement is that these searches should be as fast as possible,
with a maximum of about 200 ms time taken for a search query.

What would be the best approach to this if I were to do this in the
database only? Should/can this be done with postgresql only or should I
look into other types of technology? (Lucene? Sphinx? others?)

Any advice on this would be greatly appreciated.

Thx in advance!

Ron

Re: Could use some advice on search architecture

От

Susan Cassidy

Дата:

18 апреля 2014 г., 15:10:18

First, I would not restrict color to 30 colors, if anything like furniture or clothing, etc. is involved. Colors are very important to consumers, and exact colors are important. I would re-think my color selections.

Make sure you have indexes on all the appropriate columns, of course.

Susan

On Fri, Apr 18, 2014 at 6:59 AM, Ron Pasch <postgresql@ronpasch.nl> wrote:

Hello,

I'm contemplating what architecture I should use to make searching as fast as possible given the information available and the search requirements. Let me give some background first;

- The database contains products of can potentially have a lot of them (up to about 3 to 5 million)
- Each product has about 30 different properties defined about them. Things like what color they are etc. All these properties are enumerated choices, so for instance for color there is a list of available static never changing options of which one can be chosen for that product. This is the same for all those 30 properties. Currently they are stored as enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1', 'option2', etc..)
- It should be possible to search for products and provide properties that the product SHOULD have, not must have. For instance, for color, the search could specify that it should return products that are either red, blue or green.
- The products that match with the most properties should be in the top of the search results
- If different products match with the same amount of properties, the ordering should then be on the product that is most popular. There is information in the database (and if need be also in the same table) about how many times a product is sold.
- The results will be paginated per 15 products

The requirement is that these searches should be as fast as possible, with a maximum of about 200 ms time taken for a search query.

What would be the best approach to this if I were to do this in the database only? Should/can this be done with postgresql only or should I look into other types of technology? (Lucene? Sphinx? others?)

Any advice on this would be greatly appreciated.

Thx in advance!

Ron

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Could use some advice on search architecture

От

Andy Colson

Дата:

18 апреля 2014 г., 20:24:07

On 4/18/2014 8:59 AM, Ron Pasch wrote:
> Hello,
>
> I'm contemplating what architecture I should use to make searching as
> fast as possible given the information available and the search
> requirements. Let me give some background first;
>
> - The database contains products of can potentially have a lot of them
> (up to about 3 to 5 million)
> - Each product has about 30 different properties defined about them.
> Things like what color they are etc. All these properties are enumerated
> choices, so for instance for color there is a list of available static
> never changing options of which one can be chosen for that product. This
> is the same for all those 30 properties. Currently they are stored as
> enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
> 'option2', etc..)
> - It should be possible to search for products and provide properties
> that the product SHOULD have, not must have. For instance, for color,
> the search could specify that it should return products that are either
> red, blue or green.
> - The products that match with the most properties should be in the top
> of the search results
> - If different products match with the same amount of properties, the
> ordering should then be on the product that is most popular. There is
> information in the database (and if need be also in the same table)
> about how many times a product is sold.
> - The results will be paginated per 15 products
>
> The requirement is that these searches should be as fast as possible,
> with a maximum of about 200 ms time taken for a search query.
>
> What would be the best approach to this if I were to do this in the
> database only? Should/can this be done with postgresql only or should I
> look into other types of technology? (Lucene? Sphinx? others?)
>
> Any advice on this would be greatly appreciated.
>
> Thx in advance!
>
> Ron
>
>
>


As we are PG users, on a PG list, we are gonna recommend PG, obviously. :-)

Actually though, I recommend PG.


 > - The products that match with the most properties should be in the top
 > of the search results


That kinda query is going to be really difficult, I think, regardless of
what you use.  To find that you'll have to measure every product (all 5
million) and then sort the results.


 > enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
 > 'option2', etc..)

So, you have something like:

CREATE TYPE colortype AS ENUM ('red', 'green', 'blue');
CREATE TYPE sizetype AS ENUM ('small', 'medium', 'large');

create table product
(
   id serial,
   name text,
   color colortype,
   size  sizetype,
    ...
);


I assume the problem is you don't want to index all 30 properties?  That
makes sense.

 > - It should be possible to search for products and provide properties
 > that the product SHOULD have, not must have.

I don't understand this.  Say you have a sprocket in red and green.  Do
you want to search for:

select * from product where name = 'sprocket' and (color = 'red' or
color = 'green')

Or do you want something else?  Does the user say they'd "prefer" blue,
but will take whatever else you have?

Do you search for some properties exactly and some "preferred"?

Perhaps you could describe a little more how you want to query the
database?  Or, maybe, what your user's are searching for?

-Andy

Re: Could use some advice on search architecture

От

Robin

Дата:

19 апреля 2014 г., 07:39:26


On 18/04/2014 21:24, Andy Colson wrote:
> On 4/18/2014 8:59 AM, Ron Pasch wrote:
>> Hello,
>>
>> I'm contemplating what architecture I should use to make searching as
>> fast as possible given the information available and the search
>> requirements. Let me give some background first;
>>
>> - The database contains products of can potentially have a lot of them
>> (up to about 3 to 5 million)
>> - Each product has about 30 different properties defined about them.
>> Things like what color they are etc. All these properties are enumerated
>> choices, so for instance for color there is a list of available static
>> never changing options of which one can be chosen for that product. This
>> is the same for all those 30 properties. Currently they are stored as
>> enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
>> 'option2', etc..)
>> - It should be possible to search for products and provide properties
>> that the product SHOULD have, not must have. For instance, for color,
>> the search could specify that it should return products that are either
>> red, blue or green.
>> - The products that match with the most properties should be in the top
>> of the search results
>> - If different products match with the same amount of properties, the
>> ordering should then be on the product that is most popular. There is
>> information in the database (and if need be also in the same table)
>> about how many times a product is sold.
>> - The results will be paginated per 15 products
>>
>> The requirement is that these searches should be as fast as possible,
>> with a maximum of about 200 ms time taken for a search query.
>>
>> What would be the best approach to this if I were to do this in the
>> database only? Should/can this be done with postgresql only or should I
>> look into other types of technology? (Lucene? Sphinx? others?)
>>
>> Any advice on this would be greatly appreciated.
>>
>> Thx in advance!
>>
>> Ron
>>
>>
>>
>
>
> As we are PG users, on a PG list, we are gonna recommend PG,
> obviously. :-)
>
> Actually though, I recommend PG.
>
>
> > - The products that match with the most properties should be in the top
> > of the search results
>
>
> That kinda query is going to be really difficult, I think, regardless
> of what you use.  To find that you'll have to measure every product
> (all 5 million) and then sort the results.
>
>
> > enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
> > 'option2', etc..)
>
> So, you have something like:
>
> CREATE TYPE colortype AS ENUM ('red', 'green', 'blue');
> CREATE TYPE sizetype AS ENUM ('small', 'medium', 'large');
>
> create table product
> (
>   id serial,
>   name text,
>   color colortype,
>   size  sizetype,
>    ...
> );
>
>
> I assume the problem is you don't want to index all 30 properties?
> That makes sense.
>
> > - It should be possible to search for products and provide properties
> > that the product SHOULD have, not must have.
>
> I don't understand this.  Say you have a sprocket in red and green.
> Do you want to search for:
>
> select * from product where name = 'sprocket' and (color = 'red' or
> color = 'green')
>
> Or do you want something else?  Does the user say they'd "prefer"
> blue, but will take whatever else you have?
>
> Do you search for some properties exactly and some "preferred"?
>
> Perhaps you could describe a little more how you want to query the
> database?  Or, maybe, what your user's are searching for?
>
> -Andy
>
>
Well, given that there are known limited attributes, this is the type of
application that really really suits a column oriented database, such as
Sybase IQ (now sold by SAP). Its a neat product that scales. Great
performance with drag'n'drop analytics.

Unless you can charm IQ out of SAP (it has been known to happen), you
might have to look at some other techniques

So consider some binary data representation
Red - 1 (0000 0001)
Orange - 2 (0000 0010)
Yellow - 4 (0000 0100)
Green - 8 (0000 1000)
Blue - 16 (0001 0000)
Indigo - 32 (0010 0000)
Violet - 64 (0100 0000)

This way, you can encode several colours in 1 value
Red or Green or Indigo = 1 + 8  + 32 = 41 = 0010 1001


Robin

Re: Could use some advice on search architecture

От

Marc Mamin

Дата:

19 апреля 2014 г., 11:49:40

Hi,

Seems that this blog post is worth reading in your case
http://hlinnaka.iki.fi/2014/03/28/gin-as-a-substitute-for-bitmap-indexes/
regards,
Marc Mamin

________________________________________
Von: pgsql-general-owner@postgresql.org [pgsql-general-owner@postgresql.org]" im Auftrag von "Robin
[robinstc@live.co.uk]
Gesendet: Samstag, 19. April 2014 09:38
An: pgsql-general@postgresql.org
Betreff: Re: [GENERAL] Could use some advice on search architecture

On 18/04/2014 21:24, Andy Colson wrote:
> On 4/18/2014 8:59 AM, Ron Pasch wrote:
>> Hello,
>>
>> I'm contemplating what architecture I should use to make searching as
>> fast as possible given the information available and the search
>> requirements. Let me give some background first;
>>
>> - The database contains products of can potentially have a lot of them
>> (up to about 3 to 5 million)
>> - Each product has about 30 different properties defined about them.
>> Things like what color they are etc. All these properties are enumerated
>> choices, so for instance for color there is a list of available static
>> never changing options of which one can be chosen for that product. This
>> is the same for all those 30 properties. Currently they are stored as
>> enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
>> 'option2', etc..)
>> - It should be possible to search for products and provide properties
>> that the product SHOULD have, not must have. For instance, for color,
>> the search could specify that it should return products that are either
>> red, blue or green.
>> - The products that match with the most properties should be in the top
>> of the search results
>> - If different products match with the same amount of properties, the
>> ordering should then be on the product that is most popular. There is
>> information in the database (and if need be also in the same table)
>> about how many times a product is sold.
>> - The results will be paginated per 15 products
>>
>> The requirement is that these searches should be as fast as possible,
>> with a maximum of about 200 ms time taken for a search query.
>>
>> What would be the best approach to this if I were to do this in the
>> database only? Should/can this be done with postgresql only or should I
>> look into other types of technology? (Lucene? Sphinx? others?)
>>
>> Any advice on this would be greatly appreciated.
>>
>> Thx in advance!
>>
>> Ron
>>
>>
>>
>
>
> As we are PG users, on a PG list, we are gonna recommend PG,
> obviously. :-)
>
> Actually though, I recommend PG.
>
>
> > - The products that match with the most properties should be in the top
> > of the search results
>
>
> That kinda query is going to be really difficult, I think, regardless
> of what you use.  To find that you'll have to measure every product
> (all 5 million) and then sort the results.
>
>
> > enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
> > 'option2', etc..)
>
> So, you have something like:
>
> CREATE TYPE colortype AS ENUM ('red', 'green', 'blue');
> CREATE TYPE sizetype AS ENUM ('small', 'medium', 'large');
>
> create table product
> (
>   id serial,
>   name text,
>   color colortype,
>   size  sizetype,
>    ...
> );
>
>
> I assume the problem is you don't want to index all 30 properties?
> That makes sense.
>
> > - It should be possible to search for products and provide properties
> > that the product SHOULD have, not must have.
>
> I don't understand this.  Say you have a sprocket in red and green.
> Do you want to search for:
>
> select * from product where name = 'sprocket' and (color = 'red' or
> color = 'green')
>
> Or do you want something else?  Does the user say they'd "prefer"
> blue, but will take whatever else you have?
>
> Do you search for some properties exactly and some "preferred"?
>
> Perhaps you could describe a little more how you want to query the
> database?  Or, maybe, what your user's are searching for?
>
> -Andy
>
>
Well, given that there are known limited attributes, this is the type of
application that really really suits a column oriented database, such as
Sybase IQ (now sold by SAP). Its a neat product that scales. Great
performance with drag'n'drop analytics.

Unless you can charm IQ out of SAP (it has been known to happen), you
might have to look at some other techniques

So consider some binary data representation
Red - 1 (0000 0001)
Orange - 2 (0000 0010)
Yellow - 4 (0000 0100)
Green - 8 (0000 1000)
Blue - 16 (0001 0000)
Indigo - 32 (0010 0000)
Violet - 64 (0100 0000)

This way, you can encode several colours in 1 value
Red or Green or Indigo = 1 + 8  + 32 = 41 = 0010 1001


Robin


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Could use some advice on search architecture

От

Robin

Дата:

19 апреля 2014 г., 12:11:10

bottom post
On 19/04/2014 12:46, R. Pasch wrote:

On 19-4-2014 9:38, Robin wrote:

Well, given that there are known limited attributes, this is the type of application that really really suits a column oriented database, such as Sybase IQ (now sold by SAP). Its a neat product that scales. Great performance with drag'n'drop analytics.

Unless you can charm IQ out of SAP (it has been known to happen), you might have to look at some other techniques

So consider some binary data representation
Red - 1 (0000 0001)
Orange - 2 (0000 0010)
Yellow - 4 (0000 0100)
Green - 8 (0000 1000)
Blue - 16 (0001 0000)
Indigo - 32 (0010 0000)
Violet - 64 (0100 0000)

This way, you can encode several colours in 1 value
Red or Green or Indigo = 1 + 8 + 32 = 41 = 0010 1001

Robin

I stopped reading when I heard the word "sold by SAP" ;-) This project is solely build with open-source and freely available software.

I've been thinking about using a binary data representation but didn't come to a solution to this specific problem quite yet. Per property of a product, only one bit would be 1 and the rest would be 0. What would a query look like to match all products that have a bit in the correct position?

Say for instance these are a couple records (and yes, property values can be null as well)

title, property1, property2, property3
================================
product1, 0000 0001, 0000 0010, NULL
product2, 0000 0100, 0100 0000, 0010 0000
product3, 0010 0000, 0010 0000, 0100 0000

Say that I would like to retrieve the products that either have property1 as 0010 0000, 1000 000 or 0000 0001. Combined that would be 0010 1001 and would have to match product1 and product3 as they both have their individual bit matching one of the bits being asked for. What would a where statement look like using this type of binary representation?

If that would be fairly simple to do and fast (most important factor) then I could do an OR construction on all property columns and have something count the amount of properties that actually matched. Is that something you can do with a binary operator of some sort as well? Count the amount of overlapping bits?

Say for instance I have a binary value of 0110 0101 and another binary value of 1100 0100, how could I found out how many bits matched? (in this case the number of matching bits would be 2)

I understand the reluctance to pay SAP-style rates, as a longtime DB user, I have learned some 'charm' techniques.

However, I poked around a bit for alternatives, as I do like the column-oriented approach, and found something called - MonetDB - it apparently has a column-store db kernel, and is open source - I suggest you have a look, if it does what it says on the label, then it looks like a find.

There is a discussion of bitmask-trickiness here also dealing with colours

Robin

Re: Could use some advice on search architecture

От

Dorian Hoxha

Дата:

19 апреля 2014 г., 12:24:26

Postgresql has 2 column store, 1-in memory(cant remember the name) and http://www.citusdata.com/blog/76-postgresql-columnar-store-for-analytics

On Sat, Apr 19, 2014 at 2:10 PM, Robin <robinstc@live.co.uk> wrote:

bottom post
On 19/04/2014 12:46, R. Pasch wrote:
On 19-4-2014 9:38, Robin wrote:

Well, given that there are known limited attributes, this is the type of application that really really suits a column oriented database, such as Sybase IQ (now sold by SAP). Its a neat product that scales. Great performance with drag'n'drop analytics.

Unless you can charm IQ out of SAP (it has been known to happen), you might have to look at some other techniques

So consider some binary data representation
Red - 1 (0000 0001)
Orange - 2 (0000 0010)
Yellow - 4 (0000 0100)
Green - 8 (0000 1000)
Blue - 16 (0001 0000)
Indigo - 32 (0010 0000)
Violet - 64 (0100 0000)

This way, you can encode several colours in 1 value
Red or Green or Indigo = 1 + 8 + 32 = 41 = 0010 1001

Robin

I stopped reading when I heard the word "sold by SAP" ;-) This project is solely build with open-source and freely available software.

I've been thinking about using a binary data representation but didn't come to a solution to this specific problem quite yet. Per property of a product, only one bit would be 1 and the rest would be 0. What would a query look like to match all products that have a bit in the correct position?

Say for instance these are a couple records (and yes, property values can be null as well)

title, property1, property2, property3
================================
product1, 0000 0001, 0000 0010, NULL
product2, 0000 0100, 0100 0000, 0010 0000
product3, 0010 0000, 0010 0000, 0100 0000

Say that I would like to retrieve the products that either have property1 as 0010 0000, 1000 000 or 0000 0001. Combined that would be 0010 1001 and would have to match product1 and product3 as they both have their individual bit matching one of the bits being asked for. What would a where statement look like using this type of binary representation?

If that would be fairly simple to do and fast (most important factor) then I could do an OR construction on all property columns and have something count the amount of properties that actually matched. Is that something you can do with a binary operator of some sort as well? Count the amount of overlapping bits?

Say for instance I have a binary value of 0110 0101 and another binary value of 1100 0100, how could I found out how many bits matched? (in this case the number of matching bits would be 2)

I understand the reluctance to pay SAP-style rates, as a longtime DB user, I have learned some 'charm' techniques.

However, I poked around a bit for alternatives, as I do like the column-oriented approach, and found something called - MonetDB - it apparently has a column-store db kernel, and is open source - I suggest you have a look, if it does what it says on the label, then it looks like a find.

There is a discussion of bitmask-trickiness here also dealing with colours

Robin

Re: Could use some advice on search architecture

От

Andy Colson

Дата:

19 апреля 2014 г., 15:04:28

On 04/19/2014 06:26 AM, Ron Pasch wrote:
>  > - It should be possible to search for products and provide properties
>  > that the product SHOULD have, not must have.
>>
>> I don't understand this.  Say you have a sprocket in red and green. Do you want to search for:
>>
>> select * from product where name = 'sprocket' and (color = 'red' or color = 'green')
>>
>> Or do you want something else?  Does the user say they'd "prefer" blue, but will take whatever else you have?
>>
>> Do you search for some properties exactly and some "preferred"?
>>
>> Perhaps you could describe a little more how you want to query the database?  Or, maybe, what your user's are
searchingfor? 
>>
>> -Andy
>>
>>
>
> Yes, the user can prefer certain properties and the products that match most of the properties should be in the top
ofthe results, but if a product doesn't match all of them but just some of them, they should still be returned, but
lowerin the results. 
>
> I'm seriously wondering if doing this solely with postgres is even possible without having long execution times. I've
donesome tests with 5 million records and just doing the "or" construction you mentioned above, which resulted in 600
to900 ms queries and returning only those records of which all properties match at least one selected value. 
>
> I was thinking that perhaps using a search engine like lucene or sphinx would be more appropriate, but then I wonder
whatI would exactly be indexing and how I would be querying that, but that's a question for a different mailing list
;-)

Please keep the list cc'd, so others can help as well.

Yeah, doing a bunch of or's is gonna have to test all 5 million products.

I wonder if there is a way we can treat this like a two step process.

1) cut down the number of products

2) sort them by #matches, popularity, etc

You've talked about #2, but how about #1.  Is there any way to either include or exclude a product?  Users don't just
askfor red, they ask for "tires (maybe red)".  Not all 5 million products are tires, right? 

-Andy

Re: Could use some advice on search architecture

От

Marc Mamin

Дата:

19 апреля 2014 г., 17:36:31

>On 04/19/2014 06:26 AM, Ron Pasch wrote:
>>  > - It should be possible to search for products and provide properties
>>  > that the product SHOULD have, not must have.
>>>
>>> I don't understand this.  Say you have a sprocket in red and green. Do you want to search for:
>>>
>>> select * from product where name = 'sprocket' and (color = 'red' or color = 'green')
>>>
>>> Or do you want something else?  Does the user say they'd "prefer" blue, but will take whatever else you have?
>>>
>>> Do you search for some properties exactly and some "preferred"?
>>>
>>> Perhaps you could describe a little more how you want to query the database?  Or, maybe, what your user's are
searchingfor? 
>>>
>>> -Andy
>>>
>>
>> Yes, the user can prefer certain properties and the products that match most of the properties should be in the top
ofthe results, but if a product doesn't match all of them but just some of them, they should still be returned, but
lowerin the results. 
>>
>> I'm seriously wondering if doing this solely with postgres is even possible without having long execution times.
I'vedone some tests with 5 million records and just doing the "or" construction you mentioned above, which resulted in
600to 900 ms queries and returning only those records of which all properties match at least one selected value. 

 I don't think that OR clauses are the right way as the aim is to count how many attributes do match the search.
 basically a standard approach would look like:

     SELECT pID, sum(match) as matches FROM
     (
     selct pID, 1 as match from Products where color ='pink'
     UNION ALL
     selct pID, 1 as match from Products where size ='XXL'
     ...
     )foo
    GROUP BY pID order by matches DESC

How many distinct attributes are involved ? ( 15 colors + 9 sizes + ....)

Marc

>>
>> I was thinking that perhaps using a search engine like lucene or sphinx would be more appropriate, but then I wonder
whatI would exactly be indexing and how I would be querying that, but that's a question for a different mailing list
;-)
>
>Please keep the list cc'd, so others can help as well.
>
>Yeah, doing a bunch of or's is gonna have to test all 5 million products.
>
>I wonder if there is a way we can treat this like a two step process.
>
>1) cut down the number of products
>
>2) sort them by #matches, popularity, etc
>
>You've talked about #2, but how about #1.  Is there any way to either include or exclude a product?  Users don't just
askfor red, they ask for "tires (maybe red)".  Not all 5 million products are tires, right? 
>
>-Andy

Re: Could use some advice on search architecture

От

Robin

Дата:

19 апреля 2014 г., 22:58:43

Because this is a topic that interests me I have done some digging.

MonetDB is a proper column-store DBMS. It is used on some decent sized projects, several of which are EU-funded.

Digging a bit deeper I have discovered that a PostgreSQL Foreign Data Wrapper for MonetDB has been created - monetdb_fdw.

The fdw is written up here

There is a video that shows what happens when a query is executed in PostgreSQL and MonetDB

The video shows an analytical query being processed

Using PostgreSQL - 177 seconds
Using MonetDB - 8 seconds
Using a remote MonetDB server through MonetDB FDW - 1 second

I think its worth investigating further

Robin St.Clair

Re: Could use some advice on search architecture

От

Jov

Дата:

20 апреля 2014 г., 00:05:42

the other is named imcs.
imcs is mainly a memory database,it is very fast for olap,because of multi threads parallel query plan and column storage.for billions of data,most group by like querys return within 2s.
we use it for several weeks and happy for the performance.

jov

在 2014-4-19 下午8:27，"Dorian Hoxha" <dorian.hoxha@gmail.com>写道：

Postgresql has 2 column store, 1-in memory(cant remember the name) and http://www.citusdata.com/blog/76-postgresql-columnar-store-for-analytics

On Sat, Apr 19, 2014 at 2:10 PM, Robin <robinstc@live.co.uk> wrote:
bottom post
On 19/04/2014 12:46, R. Pasch wrote:
On 19-4-2014 9:38, Robin wrote:

Well, given that there are known limited attributes, this is the type of application that really really suits a column oriented database, such as Sybase IQ (now sold by SAP). Its a neat product that scales. Great performance with drag'n'drop analytics.

Unless you can charm IQ out of SAP (it has been known to happen), you might have to look at some other techniques

So consider some binary data representation
Red - 1 (0000 0001)
Orange - 2 (0000 0010)
Yellow - 4 (0000 0100)
Green - 8 (0000 1000)
Blue - 16 (0001 0000)
Indigo - 32 (0010 0000)
Violet - 64 (0100 0000)

This way, you can encode several colours in 1 value
Red or Green or Indigo = 1 + 8 + 32 = 41 = 0010 1001

Robin

I stopped reading when I heard the word "sold by SAP" ;-) This project is solely build with open-source and freely available software.

I've been thinking about using a binary data representation but didn't come to a solution to this specific problem quite yet. Per property of a product, only one bit would be 1 and the rest would be 0. What would a query look like to match all products that have a bit in the correct position?

Say for instance these are a couple records (and yes, property values can be null as well)

title, property1, property2, property3
================================
product1, 0000 0001, 0000 0010, NULL
product2, 0000 0100, 0100 0000, 0010 0000
product3, 0010 0000, 0010 0000, 0100 0000

Say that I would like to retrieve the products that either have property1 as 0010 0000, 1000 000 or 0000 0001. Combined that would be 0010 1001 and would have to match product1 and product3 as they both have their individual bit matching one of the bits being asked for. What would a where statement look like using this type of binary representation?

If that would be fairly simple to do and fast (most important factor) then I could do an OR construction on all property columns and have something count the amount of properties that actually matched. Is that something you can do with a binary operator of some sort as well? Count the amount of overlapping bits?

Say for instance I have a binary value of 0110 0101 and another binary value of 1100 0100, how could I found out how many bits matched? (in this case the number of matching bits would be 2)

I understand the reluctance to pay SAP-style rates, as a longtime DB user, I have learned some 'charm' techniques.

However, I poked around a bit for alternatives, as I do like the column-oriented approach, and found something called - MonetDB - it apparently has a column-store db kernel, and is open source - I suggest you have a look, if it does what it says on the label, then it looks like a find.

There is a discussion of bitmask-trickiness here also dealing with colours

Robin

Re: Could use some advice on search architecture

От

Ron Pasch

Дата:

21 апреля 2014 г., 12:10:44

I ended up running some tests using 5 million rows of products. I used
about 5 properties that a product should always be matched to, and then
I used the following in the select;

(CASE property1 in (option1, option2, option3, etc) WHEN TRUE THEN 1
ELSE 0 END)
+ (CASE property2 in (option1, option2, option3, etc) WHEN TRUE THEN 1
ELSE 0 END)
+ (CASE property3 in (option1, option2, option3, etc) WHEN TRUE THEN 1
ELSE 0 END)
...
AS numberOfMatchingProperties

That way I can use the number of matching properties in the order by
clause and have the properties that must always match filter out the
bulk of the 5 million records.

The tests that I've done return around 100.000 records in about 100 to
150 milliseconds using this technique, and using OFFSET and LIMIT to
paginate those by about 15 records each time is very very fast.

This I can live with :) Thx for letting me pick your brains on this a
little.

Cheers,
Ron

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Could use some advice on search architecture