Comparing user attributes with bitwise operators

Поиск
Список
Период
Сортировка
I'm working on a dating/personals/match-making site, that has used many
different methods of "match-making", that all seem to be very slow. One I am
attempting now that seems to be an efficient method of storage, but not the
best for indexing, is using bitwise operators to compare one person's profile
to another's.

This method seems to be very fast on a small scale, but I am dealing with a
large user-base here, in excess of 200,000 users that will be executing this
search function every time they login (the search results of their profile
will appear on the main page after they have logged in). I've opted to use
"label tables" for each possible set of answers. (i.e: Marital Status)

For this table, the set of bits -- bit(5) -- are represented as such:

+-----+------------+
| Bit | Meaning    |
+-----+------------+
| 1   | single     |
| 2   | separated  |
| 3   | married    |
| 4   | divorced   |
| 5   | widowed    |
+-----+------------+

Here's the structure of the marital status table:

# \d marital_matrix
            
Table "public.marital_matrix"
  Column   |      Type      |                               Modifiers
-----------+----------------+-----------------------------------------------------------------------
 member_id | integer        | not null default
nextval('public.marital_matrix_member_id_seq'::text)
 status    | bit varying(5) | not null default (0)::bit(5)
 p_status  | bit varying(5) | not null default (0)::bit(5)
Indexes:
    "marital_matrix_pkey" PRIMARY KEY, btree (member_id)
    "idx_marital_matrix" btree ((status::"bit" & p_status::"bit"))
    "idx_marital_matrix_single" btree ((status::"bit" & p_status::"bit"))
    "idx_marital_p_status" btree (p_status)
    "idx_marital_status" btree (status)
Foreign-key constraints:
    "$1" FOREIGN KEY (member_id) REFERENCES members(member_id) ON DELETE
CASCADE DEFERRABLE INITIALLY DEFERRED

To give you an idea of the selectivity (NOTE: there are only 50,000 rows, a
smaller sample than what I will actually be using):

datingsite=> select count(*),status,p_status from marital_matrix group by
status,p_status;
 count | status | p_status
-------+--------+----------
    89 | 00001  | 00000
  1319 | 00010  | 00000
  2465 | 00100  | 00000
     1 | 00100  | 11111
 46117 | 10000  | 00000

here is the user I'll be comparing against, which has selected that he be
matched with any but married people:

datingsite=> SELECT * FROM marital_matrix WHERE member_id = 21;
 member_id | status | p_status
-----------+--------+----------
        21 | 10000  | 11011
(1 row)




Here are a few possible comparison methods I can think of (NOTE: tests were
run on a 2.4Ghz Intel CPU w/ 256M RAM on FreeBSD 4.10:


METHOD 1: Any bit that is set in p_status (prefered marital status) of the
searching user should be set in the potential match's marital status. This is
the method I'd like to improve, if possible. Running the query twice didn't
produce a different runtime.

EXPLAIN ANALYZE
SELECT
    m2.member_id
FROM
    marital_matrix m1, marital_matrix m2
WHERE
    m1.member_id = 21 AND
    m2.status & m1.p_status != B'00000';
                                                                QUERY PLAN
                    

---------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..2357.79 rows=49742 width=4) (actual
time=18.062..708.469 rows=47525 loops=1)
   Join Filter: ((("inner".status)::"bit" & ("outer".p_status)::"bit") <>
B'00000'::"bit")
   ->  Index Scan using marital_matrix_pkey on marital_matrix m1
(cost=0.00..5.01 rows=1 width=9) (actual time=0.035..0.045 rows=1 loops=1)
         Index Cond: (member_id = 21)
   ->  Seq Scan on marital_matrix m2  (cost=0.00..1602.91 rows=49991 width=13)
(actual time=17.966..255.529 rows=49991 loops=1)
 Total runtime: 905.694 ms
(6 rows)


METHOD 2: Specifying the value (I don't think this would make a difference,
but I'll post anyways):

EXPLAIN ANALYZE
SELECT
    member_id
FROM
    marital_matrix
WHERE
    status & B'11011' != B'00000';

                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Seq Scan on marital_matrix  (cost=0.00..1852.87 rows=49742 width=4) (actual
time=18.113..281.101 rows=47525 loops=1)
   Filter: (((status)::"bit" & B'11011'::"bit") <> B'00000'::"bit")
 Total runtime: 480.836 ms
(3 rows)


METHOD 3: Checking for one bit only. This is definitely not a "real world"
example and unacceptable since the p_status column can and will have multiple
bits. For categories other than "Marital Status", such as "Prefered Hair
Color", the users are likely to select multiple bits (they choose all that
apply). This query does use the index, but is still not very fast at all:

EXPLAIN ANALYZE
SELECT
    member_id
FROM
    marital_matrix m1
WHERE
    status & B'10000' = B'10000';
                                                                      QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_marital_matrix_single on marital_matrix m1
(cost=0.00..903.59 rows=250 width=4) (actual time=0.042..258.907 rows=46117
loops=1)
   Index Cond: (((status)::"bit" & B'10000'::"bit") = B'10000'::"bit")
 Total runtime: 451.162 ms
(3 rows)

METHOD 4: Using an IN statement. This method seems to be very fussy about
using the index, and I have at some point made it use the index when there
are less than 3 possibilites. Also, for fields other than Marital Status,
users will be able to select many bits for their own profile, which means
there would be many permutations:

EXPLAIN ANALYZE
SELECT
    member_id
FROM
    marital_matrix
WHERE
    status & B'11011' IN (B'10000',B'01000');
          QUERY PLAN
                  

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on marital_matrix  (cost=0.00..2602.73 rows=993 width=4) (actual
time=17.845..288.279 rows=47525 loops=1)
   Filter: ((((status)::"bit" & B'11011'::"bit") = B'10000'::"bit") OR
(((status)::"bit" & B'11011'::"bit") = B'01000'::"bit") OR (((status)::"bit"
& B'11011'::"bit") = B'00010'::"bit") OR (((status)::"bit" & B'11011'::"bit")
= B'00001'::"bit"))
 Total runtime: 488.651 ms
(3 rows)


Method 3 is the only one that used the index, but the only really acceptable
method here is Method 1.

My questions are...
- Is there any hope in getting this to use an efficient index?
- Any mathmaticians know if there is a way to reorder my bitwise comparison to
have the operator use = and not an != (perhaps to force an index)? (AFAIK,
the answer to the second question is no)

If anyone could offer any performance tips here I'd really appreciate it. I
imagine that having this schema wouldn't last an hour with the amount of CPU
cycles it would be consuming on math operations.

Also, I have read the thread that was posted here by Daniel in August:
http://archives.postgresql.org/pgsql-performance/2004-08/msg00328.php

I have spoke with Daniel on this issue and we both agree it's very difficult
to find a solution that can scale to very large sites.

I would very much appreciate any advice that some experienced users may have
to offer me for such a situation. TIA

Patrick

В списке pgsql-performance по дате отправления:

Предыдущее
От: Kevin Neufeld
Дата:
Сообщение: declared cursor uses slow plan
Следующее
От: Christopher Kings-Lynne
Дата:
Сообщение: Re: Comparing user attributes with bitwise operators