Обсуждение: Comparing user attributes with bitwise operators

Поиск
Список
Период
Сортировка

Comparing user attributes with bitwise operators

От
Patrick Clery
Дата:
I'm working on a dating/personals/match-making site, that has used many
different methods of "match-making", that all seem to be very slow. One I am
attempting now that seems to be an efficient method of storage, but not the
best for indexing, is using bitwise operators to compare one person's profile
to another's.

This method seems to be very fast on a small scale, but I am dealing with a
large user-base here, in excess of 200,000 users that will be executing this
search function every time they login (the search results of their profile
will appear on the main page after they have logged in). I've opted to use
"label tables" for each possible set of answers. (i.e: Marital Status)

For this table, the set of bits -- bit(5) -- are represented as such:

+-----+------------+
| Bit | Meaning    |
+-----+------------+
| 1   | single     |
| 2   | separated  |
| 3   | married    |
| 4   | divorced   |
| 5   | widowed    |
+-----+------------+

Here's the structure of the marital status table:

# \d marital_matrix
            
Table "public.marital_matrix"
  Column   |      Type      |                               Modifiers
-----------+----------------+-----------------------------------------------------------------------
 member_id | integer        | not null default
nextval('public.marital_matrix_member_id_seq'::text)
 status    | bit varying(5) | not null default (0)::bit(5)
 p_status  | bit varying(5) | not null default (0)::bit(5)
Indexes:
    "marital_matrix_pkey" PRIMARY KEY, btree (member_id)
    "idx_marital_matrix" btree ((status::"bit" & p_status::"bit"))
    "idx_marital_matrix_single" btree ((status::"bit" & p_status::"bit"))
    "idx_marital_p_status" btree (p_status)
    "idx_marital_status" btree (status)
Foreign-key constraints:
    "$1" FOREIGN KEY (member_id) REFERENCES members(member_id) ON DELETE
CASCADE DEFERRABLE INITIALLY DEFERRED

To give you an idea of the selectivity (NOTE: there are only 50,000 rows, a
smaller sample than what I will actually be using):

datingsite=> select count(*),status,p_status from marital_matrix group by
status,p_status;
 count | status | p_status
-------+--------+----------
    89 | 00001  | 00000
  1319 | 00010  | 00000
  2465 | 00100  | 00000
     1 | 00100  | 11111
 46117 | 10000  | 00000

here is the user I'll be comparing against, which has selected that he be
matched with any but married people:

datingsite=> SELECT * FROM marital_matrix WHERE member_id = 21;
 member_id | status | p_status
-----------+--------+----------
        21 | 10000  | 11011
(1 row)




Here are a few possible comparison methods I can think of (NOTE: tests were
run on a 2.4Ghz Intel CPU w/ 256M RAM on FreeBSD 4.10:


METHOD 1: Any bit that is set in p_status (prefered marital status) of the
searching user should be set in the potential match's marital status. This is
the method I'd like to improve, if possible. Running the query twice didn't
produce a different runtime.

EXPLAIN ANALYZE
SELECT
    m2.member_id
FROM
    marital_matrix m1, marital_matrix m2
WHERE
    m1.member_id = 21 AND
    m2.status & m1.p_status != B'00000';
                                                                QUERY PLAN
                    

---------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..2357.79 rows=49742 width=4) (actual
time=18.062..708.469 rows=47525 loops=1)
   Join Filter: ((("inner".status)::"bit" & ("outer".p_status)::"bit") <>
B'00000'::"bit")
   ->  Index Scan using marital_matrix_pkey on marital_matrix m1
(cost=0.00..5.01 rows=1 width=9) (actual time=0.035..0.045 rows=1 loops=1)
         Index Cond: (member_id = 21)
   ->  Seq Scan on marital_matrix m2  (cost=0.00..1602.91 rows=49991 width=13)
(actual time=17.966..255.529 rows=49991 loops=1)
 Total runtime: 905.694 ms
(6 rows)


METHOD 2: Specifying the value (I don't think this would make a difference,
but I'll post anyways):

EXPLAIN ANALYZE
SELECT
    member_id
FROM
    marital_matrix
WHERE
    status & B'11011' != B'00000';

                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Seq Scan on marital_matrix  (cost=0.00..1852.87 rows=49742 width=4) (actual
time=18.113..281.101 rows=47525 loops=1)
   Filter: (((status)::"bit" & B'11011'::"bit") <> B'00000'::"bit")
 Total runtime: 480.836 ms
(3 rows)


METHOD 3: Checking for one bit only. This is definitely not a "real world"
example and unacceptable since the p_status column can and will have multiple
bits. For categories other than "Marital Status", such as "Prefered Hair
Color", the users are likely to select multiple bits (they choose all that
apply). This query does use the index, but is still not very fast at all:

EXPLAIN ANALYZE
SELECT
    member_id
FROM
    marital_matrix m1
WHERE
    status & B'10000' = B'10000';
                                                                      QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_marital_matrix_single on marital_matrix m1
(cost=0.00..903.59 rows=250 width=4) (actual time=0.042..258.907 rows=46117
loops=1)
   Index Cond: (((status)::"bit" & B'10000'::"bit") = B'10000'::"bit")
 Total runtime: 451.162 ms
(3 rows)

METHOD 4: Using an IN statement. This method seems to be very fussy about
using the index, and I have at some point made it use the index when there
are less than 3 possibilites. Also, for fields other than Marital Status,
users will be able to select many bits for their own profile, which means
there would be many permutations:

EXPLAIN ANALYZE
SELECT
    member_id
FROM
    marital_matrix
WHERE
    status & B'11011' IN (B'10000',B'01000');
          QUERY PLAN
                  

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on marital_matrix  (cost=0.00..2602.73 rows=993 width=4) (actual
time=17.845..288.279 rows=47525 loops=1)
   Filter: ((((status)::"bit" & B'11011'::"bit") = B'10000'::"bit") OR
(((status)::"bit" & B'11011'::"bit") = B'01000'::"bit") OR (((status)::"bit"
& B'11011'::"bit") = B'00010'::"bit") OR (((status)::"bit" & B'11011'::"bit")
= B'00001'::"bit"))
 Total runtime: 488.651 ms
(3 rows)


Method 3 is the only one that used the index, but the only really acceptable
method here is Method 1.

My questions are...
- Is there any hope in getting this to use an efficient index?
- Any mathmaticians know if there is a way to reorder my bitwise comparison to
have the operator use = and not an != (perhaps to force an index)? (AFAIK,
the answer to the second question is no)

If anyone could offer any performance tips here I'd really appreciate it. I
imagine that having this schema wouldn't last an hour with the amount of CPU
cycles it would be consuming on math operations.

Also, I have read the thread that was posted here by Daniel in August:
http://archives.postgresql.org/pgsql-performance/2004-08/msg00328.php

I have spoke with Daniel on this issue and we both agree it's very difficult
to find a solution that can scale to very large sites.

I would very much appreciate any advice that some experienced users may have
to offer me for such a situation. TIA

Patrick

Re: Comparing user attributes with bitwise operators

От
Christopher Kings-Lynne
Дата:
Sounds like you want a many-to-many table that maps user_ids to match_ids

Then you can put an index over (user_id, match_id) and the search will
be very fast.

Chris

Patrick Clery wrote:
> I'm working on a dating/personals/match-making site, that has used many
> different methods of "match-making", that all seem to be very slow. One I am
> attempting now that seems to be an efficient method of storage, but not the
> best for indexing, is using bitwise operators to compare one person's profile
> to another's.
>
> This method seems to be very fast on a small scale, but I am dealing with a
> large user-base here, in excess of 200,000 users that will be executing this
> search function every time they login (the search results of their profile
> will appear on the main page after they have logged in). I've opted to use
> "label tables" for each possible set of answers. (i.e: Marital Status)
>
> For this table, the set of bits -- bit(5) -- are represented as such:
>
> +-----+------------+
> | Bit | Meaning    |
> +-----+------------+
> | 1   | single     |
> | 2   | separated  |
> | 3   | married    |
> | 4   | divorced   |
> | 5   | widowed    |
> +-----+------------+
>
> Here's the structure of the marital status table:
>
> # \d marital_matrix
              
> Table "public.marital_matrix"
>   Column   |      Type      |                               Modifiers
> -----------+----------------+-----------------------------------------------------------------------
>  member_id | integer        | not null default
> nextval('public.marital_matrix_member_id_seq'::text)
>  status    | bit varying(5) | not null default (0)::bit(5)
>  p_status  | bit varying(5) | not null default (0)::bit(5)
> Indexes:
>     "marital_matrix_pkey" PRIMARY KEY, btree (member_id)
>     "idx_marital_matrix" btree ((status::"bit" & p_status::"bit"))
>     "idx_marital_matrix_single" btree ((status::"bit" & p_status::"bit"))
>     "idx_marital_p_status" btree (p_status)
>     "idx_marital_status" btree (status)
> Foreign-key constraints:
>     "$1" FOREIGN KEY (member_id) REFERENCES members(member_id) ON DELETE
> CASCADE DEFERRABLE INITIALLY DEFERRED
>
> To give you an idea of the selectivity (NOTE: there are only 50,000 rows, a
> smaller sample than what I will actually be using):
>
> datingsite=> select count(*),status,p_status from marital_matrix group by
> status,p_status;
>  count | status | p_status
> -------+--------+----------
>     89 | 00001  | 00000
>   1319 | 00010  | 00000
>   2465 | 00100  | 00000
>      1 | 00100  | 11111
>  46117 | 10000  | 00000
>
> here is the user I'll be comparing against, which has selected that he be
> matched with any but married people:
>
> datingsite=> SELECT * FROM marital_matrix WHERE member_id = 21;
>  member_id | status | p_status
> -----------+--------+----------
>         21 | 10000  | 11011
> (1 row)
>
>
>
>
> Here are a few possible comparison methods I can think of (NOTE: tests were
> run on a 2.4Ghz Intel CPU w/ 256M RAM on FreeBSD 4.10:
>
>
> METHOD 1: Any bit that is set in p_status (prefered marital status) of the
> searching user should be set in the potential match's marital status. This is
> the method I'd like to improve, if possible. Running the query twice didn't
> produce a different runtime.
>
> EXPLAIN ANALYZE
> SELECT
>     m2.member_id
> FROM
>     marital_matrix m1, marital_matrix m2
> WHERE
>     m1.member_id = 21 AND
>     m2.status & m1.p_status != B'00000';
>                                                                 QUERY PLAN
                      
>
---------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=0.00..2357.79 rows=49742 width=4) (actual
> time=18.062..708.469 rows=47525 loops=1)
>    Join Filter: ((("inner".status)::"bit" & ("outer".p_status)::"bit") <>
> B'00000'::"bit")
>    ->  Index Scan using marital_matrix_pkey on marital_matrix m1
> (cost=0.00..5.01 rows=1 width=9) (actual time=0.035..0.045 rows=1 loops=1)
>          Index Cond: (member_id = 21)
>    ->  Seq Scan on marital_matrix m2  (cost=0.00..1602.91 rows=49991 width=13)
> (actual time=17.966..255.529 rows=49991 loops=1)
>  Total runtime: 905.694 ms
> (6 rows)
>
>
> METHOD 2: Specifying the value (I don't think this would make a difference,
> but I'll post anyways):
>
> EXPLAIN ANALYZE
> SELECT
>     member_id
> FROM
>     marital_matrix
> WHERE
>     status & B'11011' != B'00000';
>
>                                                       QUERY PLAN

>
----------------------------------------------------------------------------------------------------------------------
>  Seq Scan on marital_matrix  (cost=0.00..1852.87 rows=49742 width=4) (actual
> time=18.113..281.101 rows=47525 loops=1)
>    Filter: (((status)::"bit" & B'11011'::"bit") <> B'00000'::"bit")
>  Total runtime: 480.836 ms
> (3 rows)
>
>
> METHOD 3: Checking for one bit only. This is definitely not a "real world"
> example and unacceptable since the p_status column can and will have multiple
> bits. For categories other than "Marital Status", such as "Prefered Hair
> Color", the users are likely to select multiple bits (they choose all that
> apply). This query does use the index, but is still not very fast at all:
>
> EXPLAIN ANALYZE
> SELECT
>     member_id
> FROM
>     marital_matrix m1
> WHERE
>     status & B'10000' = B'10000';
>                                                                       QUERY
> PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------------------
>  Index Scan using idx_marital_matrix_single on marital_matrix m1
> (cost=0.00..903.59 rows=250 width=4) (actual time=0.042..258.907 rows=46117
> loops=1)
>    Index Cond: (((status)::"bit" & B'10000'::"bit") = B'10000'::"bit")
>  Total runtime: 451.162 ms
> (3 rows)
>
> METHOD 4: Using an IN statement. This method seems to be very fussy about
> using the index, and I have at some point made it use the index when there
> are less than 3 possibilites. Also, for fields other than Marital Status,
> users will be able to select many bits for their own profile, which means
> there would be many permutations:
>
> EXPLAIN ANALYZE
> SELECT
>     member_id
> FROM
>     marital_matrix
> WHERE
>     status & B'11011' IN (B'10000',B'01000');
>           QUERY PLAN
                    
>
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Seq Scan on marital_matrix  (cost=0.00..2602.73 rows=993 width=4) (actual
> time=17.845..288.279 rows=47525 loops=1)
>    Filter: ((((status)::"bit" & B'11011'::"bit") = B'10000'::"bit") OR
> (((status)::"bit" & B'11011'::"bit") = B'01000'::"bit") OR (((status)::"bit"
> & B'11011'::"bit") = B'00010'::"bit") OR (((status)::"bit" & B'11011'::"bit")
> = B'00001'::"bit"))
>  Total runtime: 488.651 ms
> (3 rows)
>
>
> Method 3 is the only one that used the index, but the only really acceptable
> method here is Method 1.
>
> My questions are...
> - Is there any hope in getting this to use an efficient index?
> - Any mathmaticians know if there is a way to reorder my bitwise comparison to
> have the operator use = and not an != (perhaps to force an index)? (AFAIK,
> the answer to the second question is no)
>
> If anyone could offer any performance tips here I'd really appreciate it. I
> imagine that having this schema wouldn't last an hour with the amount of CPU
> cycles it would be consuming on math operations.
>
> Also, I have read the thread that was posted here by Daniel in August:
> http://archives.postgresql.org/pgsql-performance/2004-08/msg00328.php
>
> I have spoke with Daniel on this issue and we both agree it's very difficult
> to find a solution that can scale to very large sites.
>
> I would very much appreciate any advice that some experienced users may have
> to offer me for such a situation. TIA
>
> Patrick
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

Re: Comparing user attributes with bitwise operators

От
Greg Stark
Дата:
Patrick Clery <patrick@phpforhire.com> writes:

> Method 3 is the only one that used the index, but the only really acceptable
> method here is Method 1.
>
> My questions are...
> - Is there any hope in getting this to use an efficient index?
> - Any mathmaticians know if there is a way to reorder my bitwise comparison to
> have the operator use = and not an != (perhaps to force an index)? (AFAIK,
> the answer to the second question is no)

The only kind of index that is capable of indexing this type of data structure
for arbitrary searches would be a GiST index. I'm not aware of any
implementation for bitfields, though it would be an appropriate use.

What there is now is the contrib/intarray package. You would have to store
more than just the bitfields, you would have to store an array of integer
flags. That might be denser actually if you end up with many flags few of
which are set.

GiST indexes allow you to search arbitrary combinations of set and unset
flags. using the "@@" operator

  int[] @@ query_int  - returns TRUE if array satisfies query (like '1&(2|3)')

You might be able to look at the code there and adapt it to apply to bit
fields. If so I think it would be a useful tool. But GiST indexing is pretty
esoteric stuff.

--
greg

Re: Comparing user attributes with bitwise operators

От
Greg Stark
Дата:
Patrick Clery <patrick@phpforhire.com> writes:

> Here's the structure of the marital status table:

Also I find it very odd that you have a "marital status table". marital status
is just one attribute of member. Do you expect to have more than one marital
status bitfield per member? How would you distinguish which one to use?

It's going to make it very hard to combine criteria against other attributes
even if you do manage to get a GiST index to work against marital status and
you do the same with the other, then postgres will have to do some sort of
merge join between them. It also means you'll have to write the same code over
and over for each of these tables.

I think you're much more likely to want to merge all these attributes into a
single "member_attributes" table, or even into the member table itself. Then
your goal would be to match all the member_attribute bits against all the
member_preferred bits in the right way.

The more conventional approach is to break them out into a fact separate
table:

member_id, attribute_id

And then just have a list of pairs that apply. This kind of normalized data is
much more flexible for writing all kinds of queries against. But like you've
found, it's hard to optimize this to be fast enough for transactional use.

I think the normal approach with dating sites is to leave this for a batch job
that populates a match table for everyone and just have the web site display
the contents out of that table.

--
greg

Re: Comparing user attributes with bitwise operators

От
Daniel Ceregatti
Дата:
Christopher Kings-Lynne wrote:

> Sounds like you want a many-to-many table that maps user_ids to match_ids
>
> Then you can put an index over (user_id, match_id) and the search will
> be very fast.
>
> Chris
>
If I understand you correctly, I believe I've tried this approach. While
matching on a single attribute and a single value was indeed very fast
and used an index, as soon as I tried to match on more than one value
(where valueid in (1, 2, 3)) the index was no longer used. Since my
approach used ints, I used in(), which is effectively "or", which is
presumably why the index is no longer used. With the bit, one would do a
bitwise "or" (where value & search = value). This cannot be easily
indexed, afaik.

The other problem I had with a 1:many table, where there was a row for
every person's attributes (~20M rows) was that somehow time was lost in
either sorting or somewhere else. Individual queries against a single
attribute would be very fast, but as soon as I tried to join another
attribute, the query times got really bad. See http://sh.nu/w/email.txt
line 392 (Don't worry, there are line numbers in the page).

So far I've stuck with my original plan, which is to maintain a 1:1
table of people:attributes where each attribute is in its own column.
Still, no index is used, but it's been the best performer up to now.

I'm still looking for a better plan though.

Daniel

--

Daniel Ceregatti - Programmer
Omnis Network, LLC

You are fighting for survival in your own sweet and gentle way.


Re: Comparing user attributes with bitwise operators

От
Patrick Clery
Дата:
I have currently implemented a schema for my "Dating Site" that is storing
user search preferences and user attributes in an int[] array using the
contrib/intarray package (suggested by Greg Stark). But there are a few
problems.
    a) query_int can't be cast to int4.
    b) query_int can't be indexed.

datingsite=> alter table people_attributes add column bla query_int;
ALTER TABLE
datingsite=> create index idx_query_int on people_attributes (bla);
ERROR:  data type query_int has no default operator class for access method
"btree"
HINT:  You must specify an operator class for the index or define a default
operator class for the data type.
datingsite=> create index idx_query_int on people_attributes (bla
gist__int_ops);
ERROR:  operator class "gist__int_ops" does not exist for access method
"btree"
datingsite=> alter table people_attributes drop column bla;
ALTER TABLE

    c) query_int can only be used in one operation against int[]:

    README.intarray:
    int[] @@ query_int  - returns TRUE if array satisfies query (like '1&(2|3)')

     It is not possible to use >=, <=, =, etc. Also, this operator does not work
like example says:

datingsite=> select '{2,3}'::int[] @@ '1'::query_int;
 ?column?
----------
 f
(1 row)

    d) I can't find a way to simply check if an integer is an array without
declaring it as an array; Therefore, I need to use an int[] type for a column
that will only be storing one int4 if I want to compare it to an int[] array:

    README.intarray:
    int[] && int[]  - overlap - returns TRUE if arrays has at least one common
elements.

    e) int[] and query_int are somewhat ugly to deal with since query_int needs
to be quoted as a string, and int[] is returned as '{1,2,3}'. Or maybe I'm
just being anal :)


Because of these limitations, I've chosen to declare the attribute columns as
int[] arrays (even though they will only contain one value) so that I can use
'{1,2,3}'::int[] && column_name:

    README.intarray:
    int[] && int[]  - overlap - returns TRUE if arrays has at least one common
elements.

Here is the schema:

create table people (
    person_id serial,
    datecreated timestamp with time zone default now (),
    signup_ip cidr not null,
    username character varying(30) not null,
    password character varying(28) not null,
    email character varying(65) not null,
    dob date not null,
    primary key (person_id)
);

create table people_attributes (
    person_id int references people (person_id) on delete cascade initially
deferred,
    askmecount int not null default 0,
    age int[] not null default '{1}'::int[],
    gender int[] not null default '{1}'::int[],
    orientation int[] not null default '{1}'::int[],
    bodytype int[] not null default '{1}'::int[],
    children int[] not null default '{1}'::int[],
    drinking int[] not null default '{1}'::int[],
    education int[] not null default '{1}'::int[],
    ethnicity int[] not null default '{1}'::int[],
    eyecolor int[] not null default '{1}'::int[],
    haircolor int[] not null default '{1}'::int[],
    hairstyle int[] not null default '{1}'::int[],
    height int[] not null default '{1}'::int[],
    income int[] not null default '{1}'::int[],
    occupation int[] not null default '{1}'::int[],
    relation int[] not null default '{1}'::int[], /* multiple answer */
    religion int[] not null default '{1}'::int[],
    seeking int[] not null default '{1}'::int[], /* multiple answer */
    smoking int[] not null default '{1}'::int[],
    want_children int[] not null default '{1}'::int[],
    weight int[] not null default '{1}'::int[],

    primary key (person_id)
)
without oids;

create index people_attributes_search on people_attributes using gist (
    age gist__int_ops,
    gender gist__int_ops,
    orientation gist__int_ops,
    bodytype gist__int_ops,
    children gist__int_ops,
    drinking gist__int_ops,
    education gist__int_ops,
    ethnicity gist__int_ops,
    eyecolor gist__int_ops,
    haircolor gist__int_ops,
    hairstyle gist__int_ops,
    height gist__int_ops,
    income gist__int_ops,
    occupation gist__int_ops,
    relation gist__int_ops,
    religion gist__int_ops,
    seeking gist__int_ops,
    smoking gist__int_ops,
    want_children gist__int_ops,
    weight gist__int_ops
    );

/* These will be compared against the people_attributes table */
create table people_searchprefs (
    person_id int references people (person_id) on delete cascade initially
deferred,
    age int[] not null default
'{18,19,20,21,22,23,24,25,26,27,28,29,30}'::int[],
    gender int[] not null default '{1,2,4}'::int[],
    orientation int[] not null default '{1,2,8}'::int[],
    bodytype int[] not null default '{1,2,3,4,5,6}'::int[],
    children int[] not null default '{0}'::int[],
    drinking int[] not null default '{0}'::int[],
    education int[] not null default '{0}'::int[],
    ethnicity int[] not null default '{0}'::int[],
    eyecolor int[] not null default '{0}'::int[],
    haircolor int[] not null default '{0}'::int[],
    hairstyle int[] not null default '{0}'::int[],
    height int[] not null default '{0}'::int[],
    income int[] not null default '{0}'::int[],
    occupation int[] not null default '{0}'::int[],
    relation int[] not null default '{0}'::int[],
    religion int[] not null default '{0}'::int[],
    seeking int[] not null default '{0}'::int[],
    smoking int[] not null default '{0}'::int[],
    want_children int[] not null default '{0}'::int[],
    weight int[] not null default '{0}'::int[],

    primary key (person_id)
)
without oids;


And now, the moment you've all been waiting for: performance!


(Number of profiles)

datingsite=> select count(*) from people_attributes ;
 count
-------
 96146
(1 row)

(age, gender and sexual orientation will always be a part of the query, and
are necessary to be invoke the index. The query is to show females, age 30-40
of any orientation. But first, without the index)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation;
                                                                   QUERY PLAN
                         

-------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on people_attributes  (cost=0.00..9078.56 rows=1 width=36) (actual
time=0.044..299.537 rows=937 loops=1)
   Filter: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND
('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
 Total runtime: 304.707 ms
(3 rows)


( with the index )

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation;
                                                                     QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=36) (actual time=0.064..52.383 rows=937
loops=1)
   Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND
('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
 Total runtime: 57.032 ms
(3 rows)

(more realistically, it will have a limit of 10)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation limit 10;

                                                                        QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..6.02 rows=1 width=36) (actual time=0.235..0.651 rows=10
loops=1)
   ->  Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=36) (actual time=0.224..0.550 rows=10 loops=1)
         Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age)
AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
 Total runtime: 0.817 ms
(4 rows)

(slower with an sort key)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation order by age;
                                                                        QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=6.03..6.03 rows=1 width=68) (actual time=62.572..66.338 rows=937
loops=1)
   Sort Key: age
   ->  Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=68) (actual time=0.223..55.999 rows=937
loops=1)
         Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age)
AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
 Total runtime: 71.206 ms
(5 rows)

(no better with a limit)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation order by age limit 10;
                                                                           QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6.03..6.03 rows=1 width=68) (actual time=69.391..69.504 rows=10
loops=1)
   ->  Sort  (cost=6.03..6.03 rows=1 width=68) (actual time=69.381..69.418
rows=10 loops=1)
         Sort Key: age
         ->  Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=68) (actual time=0.068..61.648 rows=937
loops=1)
               Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[]
&& age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] &&
orientation))
 Total runtime: 69.899 ms
(6 rows)

The last query is the most likely since I will need to be sorting by some key.
If I wasn't sorting it looks like it wouldn't be too bad, but sorting is
inevitable I think.  I've only imported  96,146 of the 150,000 profiles. This
seems a bit slow now, and it doesn't look like it will scale.

My questions are:

- Is there a way of speeding up the sort?
- Will using queries like " WHERE orientation IN (1,2,4) " be any
better/worse?
- The queries with the GiST index are faster, but is it of any benefit when
the int[] arrays all contain a single value?
- Is there any hope for this structure?

Thanks for the suggestion Greg, and thanks to those who responded to this
thread.



On Thursday 16 September 2004 02:44, Greg Stark wrote:
> The only kind of index that is capable of indexing this type of data
> structure for arbitrary searches would be a GiST index. I'm not aware of
> any implementation for bitfields, though it would be an appropriate use.
>
> What there is now is the contrib/intarray package. You would have to store
> more than just the bitfields, you would have to store an array of integer
> flags. That might be denser actually if you end up with many flags few of
> which are set.
>
> GiST indexes allow you to search arbitrary combinations of set and unset
> flags. using the "@@" operator
>
>   int[] @@ query_int  - returns TRUE if array satisfies query (like
> '1&(2|3)')
>
> You might be able to look at the code there and adapt it to apply to bit
> fields. If so I think it would be a useful tool. But GiST indexing is
> pretty esoteric stuff.



    primary key (person_id)
)
without oids;


Re: Comparing user attributes with bitwise operators

От
Greg Stark
Дата:
Patrick Clery <patrick@phpforhire.com> writes:

> PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=6.03..6.03 rows=1 width=68) (actual time=69.391..69.504 rows=10 loops=1)
>    ->  Sort  (cost=6.03..6.03 rows=1 width=68) (actual time=69.381..69.418 rows=10 loops=1)
>          Sort Key: age
>          ->  Index Scan using people_attributes_search on people_attributes (cost=0.00..6.02 rows=1 width=68) (actual
time=0.068..61.648rows=937 loops=1) 
>                Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND ('{2}'::integer[] && gender)
AND('{1,2,4}'::integer[] && orientation)) 
>  Total runtime: 69.899 ms
> (6 rows)
...
> - Is there a way of speeding up the sort?

The sort seems to have only taken 8ms out of 69ms or just over 10%. As long as
the index scan doesn't match too many records the sort should never be any
slower so it shouldn't be the performance bottleneck. You might consider
putting a subquery inside the order by with a limit to ensure that the sort
never gets more than some safe maximum. Something like:

select * from (select * from people_attributes where ... limit 1000) order by age limit 10

This means if the query matches more than 1000 it won't be sorted properly by
age; you'll get the top 10 out of some random subset. But you're protected
against ever having to sort more than 1000 records.

> - Will using queries like " WHERE orientation IN (1,2,4) " be any better/worse?

Well they won't use the GiST index, so no. If there was a single column with a
btree index then this would be the cleanest way to go.

> - The queries with the GiST index are faster, but is it of any benefit when
> the int[] arrays all contain a single value?

Well you've gone from 5 minutes to 60ms. You'll have to do more than one test
to be sure but it sure seems like it's of some benefit.

If they're always a single value you could make it an expression index instead
and not have to change your data model.

Just have the fields be integers individually and make an index as:

create index idx on people_attributes using gist (
  (array[age]) gist__int_ops,
  (array[gender]) gist__int_ops,
...
)


However I would go one step further. I would make the index simply:

create index idx on people_attributes using gist (
  (array[age,gender,orientation,...]) gist__int_ops
)

And ensure that all of these attributes have distinct domains. Ie, that they
don't share any values. There are 4 billion integer values available so that
shouldn't be an issue.

Then you could use query_int to compare them the way you want. You
misunderstood how query_int is supposed to work. You index an array column and
then you can check it against a query_int just as you're currently checking
for overlap. Think of @@ as a more powerful version of the overlap operator
that can do complex logical expressions.

The equivalent of

    where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
      and '{2}'::int[] && gender
      and '{1,2,4}'::int[] && orientation

would then become:

   WHERE array[age,gender,orientation] @@ '(30|31|32|33|34|35|36|37|38|39|40)&(2)&(1|2|4)'

except you would have to change orientation and gender to not both have a
value of 2.

You might consider doing the expression index a bit of overkill actually. You
might consider just storing a column "attributes" with an integer array
directly in the table.

You would also want a table that lists the valid attributes to be sure not to
have any overlaps:

1   age          1
2   age          2
...
101 gender       male
102 gender       female
103 orientation  straight
104 orientation  gay
105 orientation  bi
106 bodytype     scrawny
...


> - Is there any hope for this structure?

You'll have to test this carefully. I tried using GiST indexes for my project
and found that I couldn't load the data and build the GiST indexes fast
enough. You have to test the costs of building and maintaining this index,
especially since it has so many columns in it.

But it looks like your queries are in trouble without it so hopefully it'll be
ok on the insert/update side for you.

--
greg

Re: Comparing user attributes with bitwise operators

От
Patrick Clery
Дата:
Sorry I have taken this long to reply, Greg, but here are the results of the
personals site done with contrib/intarray:

The first thing I did was add a serial column to the attributes table. So
instead of having a unique constraint on (attribute_id,value_id), every row
has a unique value:

datingsite=> \d attribute_names
                                           Table "public.attribute_names"
     Column     |         Type          |
Modifiers
----------------+-----------------------+---------------------------------------------------------------------------
 attribute_id   | integer               | not null default
nextval('public.attribute_names_attribute_id_seq'::text)
 attribute_name | character varying(50) | not null
Indexes:
    "attribute_names_pkey" PRIMARY KEY, btree (attribute_id)
    "attribute_names_attribute_id_key" UNIQUE, btree (attribute_id,
attribute_name

an example insert:
insert into attribute_names (attribute_name) values ('languages');



datingsite=> \d attribute_values
                                        Table "public.attribute_values"
    Column    |          Type          |
Modifiers
--------------+------------------------+------------------------------------------------------------------------
 attribute_id | integer                | not null
 order_id     | integer                | not null default
(nextval('order_id_seq'::text) - 1)
 label        | character varying(255) | not null
 value_id     | integer                | not null default
nextval('public.attribute_values_value_id_seq'::text)
Indexes:
    "attribute_values_pkey" PRIMARY KEY, btree (value_id)
Foreign-key constraints:
    "attribute_values_attribute_id_fkey" FOREIGN KEY (attribute_id) REFERENCES
attribute_names(attribute_id)

an example insert (22 is the attribute_id of "languages"):
insert into attribute_values (attribute_id, label) values (22, 'English');


The "value_id" column is where the integers inside the int[] arrays will
reference. Even age (between 18-99) and height (between 48-84) have rows for
every possible choice, as well as "Ask me!" where a user could choose to
leave that blank.

Here is "the int[] table":

create table people_attributes (
    person_id int references people (person_id) on delete cascade initially
deferred,
    askmecount int not null default 0,
    age int not null references attribute_values(value_id) on delete restrict,
    gender int not null references attribute_values(value_id) on delete
restrict,
    bodytype int not null references attribute_values(value_id) on delete
restrict,
    children int not null references attribute_values(value_id) on delete
restrict,
    drinking int not null references attribute_values(value_id) on delete
restrict,
    education int not null references attribute_values(value_id) on delete
restrict,
    ethnicity int not null references attribute_values(value_id) on delete
restrict,
    eyecolor int not null references attribute_values(value_id) on delete
restrict,
    haircolor int not null references attribute_values(value_id) on delete
restrict,
    hairstyle int not null references attribute_values(value_id) on delete
restrict,
    height int not null references attribute_values(value_id) on delete
restrict,
    income int not null references attribute_values(value_id) on delete
restrict,
    languages int[] not null,
    occupation int not null references attribute_values(value_id) on delete
restrict,
    orientation int not null references attribute_values(value_id) on delete
restrict,
    relation int not null references attribute_values(value_id) on delete
restrict,
    religion int not null references attribute_values(value_id) on delete
restrict,
    smoking int not null references attribute_values(value_id) on delete
restrict,
    want_children int not null references attribute_values(value_id) on delete
restrict,
    weight int not null references attribute_values(value_id) on delete
restrict,

    seeking int[] not null,

    primary key (person_id)
)
without oids;


If you'll notice that "seeking" and "languages" are both int[] types. I did
this because those will be multiple choice. The index was created like so:

create index people_attributes_search on people_attributes using gist (
    (array[
        age,
        gender,
        orientation,
        children,
        drinking,
        education,
        ethnicity,
        eyecolor,
        haircolor,
        hairstyle,
        height,
        income,
        occupation,
        relation,
        religion,
        smoking,
        want_children,
        weight
    ] + seeking + languages) gist__int_ops
    );

seeking and languages are appended with the intarray + op.


I'm not going to go too in depth on how this query was generated since that
was mostly done with the PHP side of things, but from the structure it should
be obvious. I did, however, have to generate a few SQL functions using Smarty
templates since it would be way to tedious to map all these values by hand.

There are 96,000 rows (people) in the people_attributes table. Here is what is
going on in the following query: "Show me all single (48) females (88) who
are heterosexual (92) age between 18 and 31 (95|96|97|98|99|100|101|102|103|
104|105|106|107|108)"

EXPLAIN ANALYZE SELECT *
FROM people_attributes pa
WHERE person_id <> 1
AND (ARRAY[age, gender, orientation, children, drinking, education,
ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation,
relation, religion, smoking, want_children, weight] + seeking + languages) @@
'48 & 89 & 92 & ( 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105
| 106  | 107 | 108 )'


Index Scan using people_attributes_search on people_attributes pa
(cost=0.00..386.45 rows=96 width=140) (actual time=0.057..19.266 rows=516
loops=1)
   Index Cond: (((ARRAY[age, gender, orientation, children, drinking,
education, ethnicity, eyecolor, haircolor, hairstyle, height, income,
occupation, relation, religion, smoking, want_children, weight] + seeking) +
languages) @@ '48 & 89 & 92 & ( ( ( ( ( ( ( ( ( ( ( ( ( 95 | 96 ) | 97 ) |
98 ) | 99 ) | 100 ) | 101 ) | 102 ) | 103 ) | 104 ) | 105 ) | 106 ) | 107 ) |
108 )'::query_int)
   Filter: (person_id <> 1)
 Total runtime: 21.646 ms


The speed only seems to vary significant on very broad searches, e.g: "All
females." But once the threshold of about 2 or more attributes is met, the
times are very acceptable.

If we get a little more specific by adding "non-smokers and non-drinkers
between 18 and 22", slight improvements:


EXPLAIN ANALYZE SELECT *
FROM people_attributes pa
WHERE person_id <> 1
AND (ARRAY[age, gender, orientation, children, drinking, education,
ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation,
relation, religion, smoking, want_children, weight] + seeking + languages) @@
'48 & 89 & 92 & ( 95 | 96 | 97 | 98) & 67 & 2'

Index Scan using people_attributes_search on people_attributes pa
(cost=0.00..386.45 rows=96 width=140) (actual time=0.077..13.090 rows=32
loops=1)
   Index Cond: (((ARRAY[age, gender, orientation, children, drinking,
education, ethnicity, eyecolor, haircolor, hairstyle, height, income,
occupation, relation, religion, smoking, want_children, weight] + seeking) +
languages) @@ '48 & 89 & 92 & ( ( ( 95 | 96 ) | 97 ) | 98 ) & 67 &
2'::query_int)
   Filter: (person_id <> 1)
 Total runtime: 13.393 ms


All in all, my final thoughts on this are that it is "hella faster" than the
previous methods. Vertical tables for your user attributes will not work for
a personals site -- there are just too many conditions to be able to
efficiently use an index. Out of all the methods I have tried, verticle table
was not even remotely scalable on large amounts of data. Horizontal table is
the way to go, but it wouldn't perform like this if not for the intarray
module. The array method works quite nicely, especially for the columns like
"languages" and "seeking" that are multiple choice. However, even though this
method is fast, I still might opt for caching the results because the "real
world" search query involves a lot more and will be executed non-stop. But to
have it run this fast the first time certainly helps.

The only drawback I can think of is that the attributes no longer have values
like 1,2,3 -- instead they could be any integer value. This puts a spin on
the programming side of things, which required me to write "code that writes
code" on a few occassions during the attribute "mapping" process. For
example, keeping an associative array of all the attributes without fetching
that data from the database each time. My advice: if you're not a masochist,
use a template engine (or simply parse out a print_r() ) to create these PHP
arrays or SQL functions.

Greg, thanks a lot for the advice. I owe you a beer ;)


On Saturday 18 September 2004 23:07, you wrote:
> Patrick Clery <patrick@phpforhire.com> writes:
> > PLAN
> > -------------------------------------------------------------------------
> >--------------------------------------------------------------------------
> >-------------- Limit  (cost=6.03..6.03 rows=1 width=68) (actual
> > time=69.391..69.504 rows=10 loops=1) ->  Sort  (cost=6.03..6.03 rows=1
> > width=68) (actual time=69.381..69.418 rows=10 loops=1) Sort Key: age
> >          ->  Index Scan using people_attributes_search on
> > people_attributes (cost=0.00..6.02 rows=1 width=68) (actual
> > time=0.068..61.648 rows=937 loops=1) Index Cond:
> > (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND
> > ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
> > Total runtime: 69.899 ms
> > (6 rows)
>
> ...
>
> > - Is there a way of speeding up the sort?
>
> The sort seems to have only taken 8ms out of 69ms or just over 10%. As long
> as the index scan doesn't match too many records the sort should never be
> any slower so it shouldn't be the performance bottleneck. You might
> consider putting a subquery inside the order by with a limit to ensure that
> the sort never gets more than some safe maximum. Something like:
>
> select * from (select * from people_attributes where ... limit 1000) order
> by age limit 10
>
> This means if the query matches more than 1000 it won't be sorted properly
> by age; you'll get the top 10 out of some random subset. But you're
> protected against ever having to sort more than 1000 records.
>
> > - Will using queries like " WHERE orientation IN (1,2,4) " be any
> > better/worse?
>
> Well they won't use the GiST index, so no. If there was a single column
> with a btree index then this would be the cleanest way to go.
>
> > - The queries with the GiST index are faster, but is it of any benefit
> > when the int[] arrays all contain a single value?
>
> Well you've gone from 5 minutes to 60ms. You'll have to do more than one
> test to be sure but it sure seems like it's of some benefit.
>
> If they're always a single value you could make it an expression index
> instead and not have to change your data model.
>
> Just have the fields be integers individually and make an index as:
>
> create index idx on people_attributes using gist (
>   (array[age]) gist__int_ops,
>   (array[gender]) gist__int_ops,
> ...
> )
>
>
> However I would go one step further. I would make the index simply:
>
> create index idx on people_attributes using gist (
>   (array[age,gender,orientation,...]) gist__int_ops
> )
>
> And ensure that all of these attributes have distinct domains. Ie, that
> they don't share any values. There are 4 billion integer values available
> so that shouldn't be an issue.
>
> Then you could use query_int to compare them the way you want. You
> misunderstood how query_int is supposed to work. You index an array column
> and then you can check it against a query_int just as you're currently
> checking for overlap. Think of @@ as a more powerful version of the overlap
> operator that can do complex logical expressions.
>
> The equivalent of
>
>     where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
>       and '{2}'::int[] && gender
>       and '{1,2,4}'::int[] && orientation
>
> would then become:
>
>    WHERE array[age,gender,orientation] @@
> '(30|31|32|33|34|35|36|37|38|39|40)&(2)&(1|2|4)'
>
> except you would have to change orientation and gender to not both have a
> value of 2.
>
> You might consider doing the expression index a bit of overkill actually.
> You might consider just storing a column "attributes" with an integer array
> directly in the table.
>
> You would also want a table that lists the valid attributes to be sure not
> to have any overlaps:
>
> 1   age          1
> 2   age          2
> ...
> 101 gender       male
> 102 gender       female
> 103 orientation  straight
> 104 orientation  gay
> 105 orientation  bi
> 106 bodytype     scrawny
> ...
>
> > - Is there any hope for this structure?
>
> You'll have to test this carefully. I tried using GiST indexes for my
> project and found that I couldn't load the data and build the GiST indexes
> fast enough. You have to test the costs of building and maintaining this
> index, especially since it has so many columns in it.
>
> But it looks like your queries are in trouble without it so hopefully it'll
> be ok on the insert/update side for you.

Re: Comparing user attributes with bitwise operators

От
Josh Berkus
Дата:
Patrick,

First off, thanks for posting this solution!  I love to see a new demo of The
Power of Postgres(tm) and have been wondering about this particular problem
since it came up on IRC.

> The array method works quite nicely, especially for the
> columns like "languages" and "seeking" that are multiple choice. However,
> even though this method is fast, I still might opt for caching the results
> because the "real world" search query involves a lot more and will be
> executed non-stop. But to have it run this fast the first time certainly
> helps.

Now, for the bad news:  you need to test having a large load of users updating
their data.   The drawback to GiST indexes is that they are low-concurrency,
because the updating process needs to lock the whole index (this has been on
our TODO list for about a decade, but it's a hard problem).

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Comparing user attributes with bitwise operators

От
Patrick Clery
Дата:
Another problem I should note is that when I first insert all the data into
the people_attributes table ("the int[] table"), the GiST index is not used:

THE INDEX:
"people_attributes_search" gist ((ARRAY[age, gender, orientation, children,
drinking, education,
ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation,
relation, religion, smoking, w
ant_children, weight] + seeking + languages))

PART OF THE QUERY PLAN:
Seq Scan on people_attributes pa  (cost=0.00..0.00 rows=1 width=20)
                     Filter: (((ARRAY[age, gender, orientation, children,
drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height,
income, occupation, relation, religion, smoking, want_children, weight] +
seeking) + languages) @@ '( ( 4 | 5 ) | 6 ) & 88 & 48 & ( 69 | 70 ) & 92 &
( ( ( ( ( ( ( ( ( ( ( ( ( 95 | 96 ) | 97 ) | 98 ) | 99 ) | 100 ) | 101 ) |
102 ) | 103 ) | 104 ) | 105 ) | 106 ) | 107 ) | 108 ) &
( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( 190
| 191 ) | 192 ) | 193 ) | 194 ) | 195 ) | 196 ) | 197 ) | 198 ) | 199 ) |
200 ) | 201 ) | 202 ) | 203 ) | 204 ) | 205 ) | 206 ) | 207 ) | 208 ) | 209 )
| 210 ) | 211 ) | 212 ) | 213 ) | 214 ) | 215 ) | 216 ) | 217 ) | 218 ) |
219 ) | 220 ) | 221 ) | 222 ) | 223 ) | 224 ) | 225 ) | 226 ) | 227 ) | 228 )
| 229 ) | 230 ) | 231 ) | 232 ) | 233 ) | 234 ) | 235 ) | 236 ) | 237 ) |
238 ) | 239 ) | 240 ) | 241 ) | 242 ) | 243 )'::query_int)


So I run "VACUUM ANALYZE people_attributes", then run again:

PART OF THE QUERY PLAN:
Index Scan using people_attributes_pkey on people_attributes pa
(cost=0.00..5.32 rows=1 width=20)
         Index Cond: (pa.person_id = "outer".person_id)
         Filter: (((ARRAY[age, gender, orientation, children, drinking,
education, ethnicity, eyecolor, haircolor, hairstyle, height, income,
occupation, relation, religion, smoking, want_children, weight] + seeking) +
languages) @@ '( ( 4 | 5 ) | 6 ) & 88 & 48 & ( 69 | 70 ) & 92 &
( ( ( ( ( ( ( ( ( ( ( ( ( 95 | 96 ) | 97 ) | 98 ) | 99 ) | 100 ) | 101 ) |
102 ) | 103 ) | 104 ) | 105 ) | 106 ) | 107 ) | 108 ) &
( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( 190
| 191 ) | 192 ) | 193 ) | 194 ) | 195 ) | 196 ) | 197 ) | 198 ) | 199 ) |
200 ) | 201 ) | 202 ) | 203 ) | 204 ) | 205 ) | 206 ) | 207 ) | 208 ) | 209 )
| 210 ) | 211 ) | 212 ) | 213 ) | 214 ) | 215 ) | 216 ) | 217 ) | 218 ) |
219 ) | 220 ) | 221 ) | 222 ) | 223 ) | 224 ) | 225 ) | 226 ) | 227 ) | 228 )
| 229 ) | 230 ) | 231 ) | 232 ) | 233 ) | 234 ) | 235 ) | 236 ) | 237 ) |
238 ) | 239 ) | 240 ) | 241 ) | 242 ) | 243 )'::query_int)

Still not using the index. I'm trying to DROP INDEX and recreate it, but the
query just stalls. I remember last time this situation happened that I just
dropped and recreated the index, and voila it was using the index again. Now
I can't seem to get this index to drop. Here's the table structure:


    Column     |   Type    |     Modifiers
---------------+-----------+--------------------
 person_id     | integer   | not null
 askmecount    | integer   | not null default 0
 age           | integer   | not null
 gender        | integer   | not null
 bodytype      | integer   | not null
 children      | integer   | not null
 drinking      | integer   | not null
 education     | integer   | not null
 ethnicity     | integer   | not null
 eyecolor      | integer   | not null
 haircolor     | integer   | not null
 hairstyle     | integer   | not null
 height        | integer   | not null
 income        | integer   | not null
 languages     | integer[] | not null
 occupation    | integer   | not null
 orientation   | integer   | not null
 relation      | integer   | not null
 religion      | integer   | not null
 smoking       | integer   | not null
 want_children | integer   | not null
 weight        | integer   | not null
 seeking       | integer[] | not null
Indexes:
    "people_attributes_pkey" PRIMARY KEY, btree (person_id)
    "people_attributes_search" gist ((ARRAY[age, gender, orientation,
children, drinking, education,
ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation,
relation, religion, smoking, w
ant_children, weight] + seeking + languages))
Foreign-key constraints:
    "people_attributes_weight_fkey" FOREIGN KEY (weight) REFERENCES
attribute_values(value_id) ON DEL
ETE RESTRICT
    "people_attributes_person_id_fkey" FOREIGN KEY (person_id) REFERENCES
people(person_id) ON DELETE
 CASCADE DEFERRABLE INITIALLY DEFERRED
    "people_attributes_age_fkey" FOREIGN KEY (age) REFERENCES
attribute_values(value_id) ON DELETE RE
STRICT
    "people_attributes_gender_fkey" FOREIGN KEY (gender) REFERENCES
attribute_values(value_id) ON DEL
ETE RESTRICT
    "people_attributes_bodytype_fkey" FOREIGN KEY (bodytype) REFERENCES
attribute_values(value_id) ON
 DELETE RESTRICT
    "people_attributes_children_fkey" FOREIGN KEY (children) REFERENCES
attribute_values(value_id) ON
 DELETE RESTRICT
    "people_attributes_drinking_fkey" FOREIGN KEY (drinking) REFERENCES
attribute_values(value_id) ON
 DELETE RESTRICT
    "people_attributes_education_fkey" FOREIGN KEY (education) REFERENCES
attribute_values(value_id)
ON DELETE RESTRICT
    "people_attributes_ethnicity_fkey" FOREIGN KEY (ethnicity) REFERENCES
attribute_values(value_id)
ON DELETE RESTRICT
    "people_attributes_eyecolor_fkey" FOREIGN KEY (eyecolor) REFERENCES
attribute_values(value_id) ON
 DELETE RESTRICT
    "people_attributes_haircolor_fkey" FOREIGN KEY (haircolor) REFERENCES
attribute_values(value_id)
ON DELETE RESTRICT
    "people_attributes_hairstyle_fkey" FOREIGN KEY (hairstyle) REFERENCES
attribute_values(value_id)
ON DELETE RESTRICT
    "people_attributes_height_fkey" FOREIGN KEY (height) REFERENCES
attribute_values(value_id) ON DELETE RESTRICT
    "people_attributes_income_fkey" FOREIGN KEY (income) REFERENCES
attribute_values(value_id) ON DELETE RESTRICT
    "people_attributes_occupation_fkey" FOREIGN KEY (occupation) REFERENCES
attribute_values(value_id
) ON DELETE RESTRICT
    "people_attributes_orientation_fkey" FOREIGN KEY (orientation) REFERENCES
attribute_values(value_
id) ON DELETE RESTRICT
    "people_attributes_relation_fkey" FOREIGN KEY (relation) REFERENCES
attribute_values(value_id) ON
 DELETE RESTRICT
    "people_attributes_religion_fkey" FOREIGN KEY (religion) REFERENCES
attribute_values(value_id) ON
 DELETE RESTRICT
    "people_attributes_smoking_fkey" FOREIGN KEY (smoking) REFERENCES
attribute_values(value_id) ON D
ELETE RESTRICT
    "people_attributes_want_children_fkey" FOREIGN KEY (want_children)
REFERENCES attribute_values(va
lue_id) ON DELETE RESTRICT


Is it all the foreign keys that are stalling the drop? I have done VACUUM
ANALYZE on the entire db. Could anyone offer some insight as to why this
index is not being used or why the index is not dropping easily?


On Tuesday 05 October 2004 10:32, you wrote:
> Patrick,
>
> First off, thanks for posting this solution!  I love to see a new demo of
> The Power of Postgres(tm) and have been wondering about this particular
> problem since it came up on IRC.
>
> > The array method works quite nicely, especially for the
> > columns like "languages" and "seeking" that are multiple choice. However,
> > even though this method is fast, I still might opt for caching the
> > results because the "real world" search query involves a lot more and
> > will be executed non-stop. But to have it run this fast the first time
> > certainly helps.
>
> Now, for the bad news:  you need to test having a large load of users
> updating their data.   The drawback to GiST indexes is that they are
> low-concurrency, because the updating process needs to lock the whole index
> (this has been on our TODO list for about a decade, but it's a hard
> problem).

Re: Comparing user attributes with bitwise operators

От
Patrick Clery
Дата:
Err... I REINDEX'ed it and it is now using the index. :)

I'd still appreciate if anyone could tell me why this needs to be
reindexed. Is the index not updated when the records are inserted?

> On Wednesday 06 October 2004 12:55, I wrote:
> > Another problem I should note is that when I first insert all the data
> > into the people_attributes table ("the int[] table"), the GiST index is
> > not used:
> >
> > THE INDEX:
> > "people_attributes_search" gist ((ARRAY[age, gender, orientation,
> > children, drinking, education,
> > ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation,
> > relation, religion, smoking, w
> > ant_children, weight] + seeking + languages))
> >
> > PART OF THE QUERY PLAN:
> > Seq Scan on people_attributes pa  (cost=0.00..0.00 rows=1 width=20)
> >                      Filter: (((ARRAY[age, gender, orientation, children,
> > drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height,
> > income, occupation, relation, religion, smoking, want_children, weight] +
> > seeking) + languages) @@ '( ( 4 | 5 ) | 6 ) & 88 & 48 & ( 69 | 70 ) & 92
> > & ( ( ( ( ( ( ( ( ( ( ( ( ( 95 | 96 ) | 97 ) | 98 ) | 99 ) | 100 ) | 101
> > ) | 102 ) | 103 ) | 104 ) | 105 ) | 106 ) | 107 ) | 108 ) &
> > ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (
> > ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( 190
> >
> > | 191 ) | 192 ) | 193 ) | 194 ) | 195 ) | 196 ) | 197 ) | 198 ) | 199 ) |
> >
> > 200 ) | 201 ) | 202 ) | 203 ) | 204 ) | 205 ) | 206 ) | 207 ) | 208 ) |
> > 209 )
> >
> > | 210 ) | 211 ) | 212 ) | 213 ) | 214 ) | 215 ) | 216 ) | 217 ) | 218 ) |
> >
> > 219 ) | 220 ) | 221 ) | 222 ) | 223 ) | 224 ) | 225 ) | 226 ) | 227 ) |
> > 228 )
> >
> > | 229 ) | 230 ) | 231 ) | 232 ) | 233 ) | 234 ) | 235 ) | 236 ) | 237 ) |
> >
> > 238 ) | 239 ) | 240 ) | 241 ) | 242 ) | 243 )'::query_int)
> >
> >
> > So I run "VACUUM ANALYZE people_attributes", then run again:
> >
> > PART OF THE QUERY PLAN:
> > Index Scan using people_attributes_pkey on people_attributes pa
> > (cost=0.00..5.32 rows=1 width=20)
> >          Index Cond: (pa.person_id = "outer".person_id)
> >          Filter: (((ARRAY[age, gender, orientation, children, drinking,
> > education, ethnicity, eyecolor, haircolor, hairstyle, height, income,
> > occupation, relation, religion, smoking, want_children, weight] +
> > seeking) + languages) @@ '( ( 4 | 5 ) | 6 ) & 88 & 48 & ( 69 | 70 ) & 92
> > & ( ( ( ( ( ( ( ( ( ( ( ( ( 95 | 96 ) | 97 ) | 98 ) | 99 ) | 100 ) | 101
> > ) | 102 ) | 103 ) | 104 ) | 105 ) | 106 ) | 107 ) | 108 ) &
> > ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (
> > ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( 190
> >
> > | 191 ) | 192 ) | 193 ) | 194 ) | 195 ) | 196 ) | 197 ) | 198 ) | 199 ) |
> >
> > 200 ) | 201 ) | 202 ) | 203 ) | 204 ) | 205 ) | 206 ) | 207 ) | 208 ) |
> > 209 )
> >
> > | 210 ) | 211 ) | 212 ) | 213 ) | 214 ) | 215 ) | 216 ) | 217 ) | 218 ) |
> >
> > 219 ) | 220 ) | 221 ) | 222 ) | 223 ) | 224 ) | 225 ) | 226 ) | 227 ) |
> > 228 )
> >
> > | 229 ) | 230 ) | 231 ) | 232 ) | 233 ) | 234 ) | 235 ) | 236 ) | 237 ) |
> >
> > 238 ) | 239 ) | 240 ) | 241 ) | 242 ) | 243 )'::query_int)
> >
> > Still not using the index. I'm trying to DROP INDEX and recreate it, but
> > the query just stalls. I remember last time this situation happened that
> > I just dropped and recreated the index, and voila it was using the index
> > again. Now I can't seem to get this index to drop. Here's the table
> > structure:
> >
> >
> >     Column     |   Type    |     Modifiers
> > ---------------+-----------+--------------------
> >  person_id     | integer   | not null
> >  askmecount    | integer   | not null default 0
> >  age           | integer   | not null
> >  gender        | integer   | not null
> >  bodytype      | integer   | not null
> >  children      | integer   | not null
> >  drinking      | integer   | not null
> >  education     | integer   | not null
> >  ethnicity     | integer   | not null
> >  eyecolor      | integer   | not null
> >  haircolor     | integer   | not null
> >  hairstyle     | integer   | not null
> >  height        | integer   | not null
> >  income        | integer   | not null
> >  languages     | integer[] | not null
> >  occupation    | integer   | not null
> >  orientation   | integer   | not null
> >  relation      | integer   | not null
> >  religion      | integer   | not null
> >  smoking       | integer   | not null
> >  want_children | integer   | not null
> >  weight        | integer   | not null
> >  seeking       | integer[] | not null
> > Indexes:
> >     "people_attributes_pkey" PRIMARY KEY, btree (person_id)
> >     "people_attributes_search" gist ((ARRAY[age, gender, orientation,
> > children, drinking, education,
> > ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation,
> > relation, religion, smoking, w
> > ant_children, weight] + seeking + languages))
> > Foreign-key constraints:
> >     "people_attributes_weight_fkey" FOREIGN KEY (weight) REFERENCES
> > attribute_values(value_id) ON DEL
> > ETE RESTRICT
> >     "people_attributes_person_id_fkey" FOREIGN KEY (person_id) REFERENCES
> > people(person_id) ON DELETE
> >  CASCADE DEFERRABLE INITIALLY DEFERRED
> >     "people_attributes_age_fkey" FOREIGN KEY (age) REFERENCES
> > attribute_values(value_id) ON DELETE RE
> > STRICT
> >     "people_attributes_gender_fkey" FOREIGN KEY (gender) REFERENCES
> > attribute_values(value_id) ON DEL
> > ETE RESTRICT
> >     "people_attributes_bodytype_fkey" FOREIGN KEY (bodytype) REFERENCES
> > attribute_values(value_id) ON
> >  DELETE RESTRICT
> >     "people_attributes_children_fkey" FOREIGN KEY (children) REFERENCES
> > attribute_values(value_id) ON
> >  DELETE RESTRICT
> >     "people_attributes_drinking_fkey" FOREIGN KEY (drinking) REFERENCES
> > attribute_values(value_id) ON
> >  DELETE RESTRICT
> >     "people_attributes_education_fkey" FOREIGN KEY (education) REFERENCES
> > attribute_values(value_id)
> > ON DELETE RESTRICT
> >     "people_attributes_ethnicity_fkey" FOREIGN KEY (ethnicity) REFERENCES
> > attribute_values(value_id)
> > ON DELETE RESTRICT
> >     "people_attributes_eyecolor_fkey" FOREIGN KEY (eyecolor) REFERENCES
> > attribute_values(value_id) ON
> >  DELETE RESTRICT
> >     "people_attributes_haircolor_fkey" FOREIGN KEY (haircolor) REFERENCES
> > attribute_values(value_id)
> > ON DELETE RESTRICT
> >     "people_attributes_hairstyle_fkey" FOREIGN KEY (hairstyle) REFERENCES
> > attribute_values(value_id)
> > ON DELETE RESTRICT
> >     "people_attributes_height_fkey" FOREIGN KEY (height) REFERENCES
> > attribute_values(value_id) ON DELETE RESTRICT
> >     "people_attributes_income_fkey" FOREIGN KEY (income) REFERENCES
> > attribute_values(value_id) ON DELETE RESTRICT
> >     "people_attributes_occupation_fkey" FOREIGN KEY (occupation)
> > REFERENCES attribute_values(value_id
> > ) ON DELETE RESTRICT
> >     "people_attributes_orientation_fkey" FOREIGN KEY (orientation)
> > REFERENCES attribute_values(value_
> > id) ON DELETE RESTRICT
> >     "people_attributes_relation_fkey" FOREIGN KEY (relation) REFERENCES
> > attribute_values(value_id) ON
> >  DELETE RESTRICT
> >     "people_attributes_religion_fkey" FOREIGN KEY (religion) REFERENCES
> > attribute_values(value_id) ON
> >  DELETE RESTRICT
> >     "people_attributes_smoking_fkey" FOREIGN KEY (smoking) REFERENCES
> > attribute_values(value_id) ON D
> > ELETE RESTRICT
> >     "people_attributes_want_children_fkey" FOREIGN KEY (want_children)
> > REFERENCES attribute_values(va
> > lue_id) ON DELETE RESTRICT
> >
> >
> > Is it all the foreign keys that are stalling the drop? I have done VACUUM
> > ANALYZE on the entire db. Could anyone offer some insight as to why this
> > index is not being used or why the index is not dropping easily?
> >
> > On Tuesday 05 October 2004 10:32, you wrote:
> > > Patrick,
> > >
> > > First off, thanks for posting this solution!  I love to see a new demo
> > > of The Power of Postgres(tm) and have been wondering about this
> > > particular problem since it came up on IRC.
> > >
> > > > The array method works quite nicely, especially for the
> > > > columns like "languages" and "seeking" that are multiple choice.
> > > > However, even though this method is fast, I still might opt for
> > > > caching the results because the "real world" search query involves a
> > > > lot more and will be executed non-stop. But to have it run this fast
> > > > the first time certainly helps.
> > >
> > > Now, for the bad news:  you need to test having a large load of users
> > > updating their data.   The drawback to GiST indexes is that they are
> > > low-concurrency, because the updating process needs to lock the whole
> > > index (this has been on our TODO list for about a decade, but it's a
> > > hard problem).
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> >                http://archives.postgresql.org

Re: Comparing user attributes with bitwise operators

От
Greg Stark
Дата:
Patrick Clery <patrick@phpforhire.com> writes:

> PART OF THE QUERY PLAN:
> Index Scan using people_attributes_pkey on people_attributes pa (cost=0.00..5.32 rows=1 width=20)
>          Index Cond: (pa.person_id = "outer".person_id)
>          Filter: (((ARRAY[age, gender, orientation, children, drinking,

You'll probably have to show the rest of the plan for anyone to have much idea
what's going on. It seems to be part of a join of some sort and the planner is
choosing to drive the join from the wrong table. This may make it awkward to
force the right plan using enable_seqscan or anything like that. But GiST
indexes don't have very good selectivity estimates so I'm not sure you can
hope for the optimizer to guess right on its own.

> Is it all the foreign keys that are stalling the drop? I have done VACUUM
> ANALYZE on the entire db. Could anyone offer some insight as to why this
> index is not being used or why the index is not dropping easily?

I don't think foreign keys cause problems dropping indexes. Foreign key
constraints are just checked whenever there's an insert/update/delete. Perhaps
you're just underestimating the size of this index and the amount of time
it'll take to delete it? Or are there queries actively executing using the
index while you're trying to delete it? Or a vacuum running?

--
greg