Обсуждение: where in (select array)

Поиск
Список
Период
Сортировка

where in (select array)

От
Marcus Engene
Дата:
Hi List,

I have the might_like table that contains products a user might like if
he likes the present one (item).

CREATE TABLE might_like
(
 item                       INTEGER NOT NULL
,created_at                 TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
,child                      INTEGER NOT NULL
)
WITHOUT OIDS;

CREATE INDEX might_like_x1 ON might_like(item);

Since there are (will be) houndreds of thousands of items, and 20+ might
like items, i thought it would be nice to reduce the set to 1/20th by
using a vector.

CREATE TABLE might_like_vector
(
 item                       INTEGER NOT NULL
,created_at                 TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
,child_arr                  INTEGER[]
)
WITHOUT OIDS;

CREATE INDEX might_like_vector_x1 ON might_like_vector(item);

But then this don't work:

select
    ...
from
    item pic
where
    pic.objectid in (
        select mlv.child_arr
        from might_like_vector mlv
        where mlv.item = 125 AND
              mlv.created_at > now() - interval '1 week'
    )
limit 16

This seems to work but is ugly:

select
    ...
from
    item pic
where
    pic.objectid in (
        select mlv.child_arr[s.a]
        from might_like_vector mlv
            ,generate_series(1,20) as s(a)
        where mlv.item = 125 AND
              mlv.created_at > now() - interval '1 week'
    )
limit 16

Is there a better way to do it?

Thanks,
Marcus


Re: where in (select array)

От
"Grzegorz Jaśkiewicz"
Дата:
just as a little advice, I would actually use joins - for performance reasons. 'where in' seems to be rather slow, especially if you use it on large sets of data.

Re: where in (select array)

От
Richard Huxton
Дата:
Marcus Engene wrote:
> Hi List,
>
> I have the might_like table that contains products a user might like if
> he likes the present one (item).
>
> CREATE TABLE might_like
> (
> item                       INTEGER NOT NULL
> ,created_at                 TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
> ,child                      INTEGER NOT NULL
> )
> WITHOUT OIDS;
>
> CREATE INDEX might_like_x1 ON might_like(item);
>
> Since there are (will be) houndreds of thousands of items, and 20+ might
> like items, i thought it would be nice to reduce the set to 1/20th by
> using a vector.
>
> CREATE TABLE might_like_vector
> (
> item                       INTEGER NOT NULL
> ,created_at                 TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
> ,child_arr                  INTEGER[]
> )
> WITHOUT OIDS;

You haven't reduced the set at all, you've just turned part of it
sideways. You might gain something on your search, but I'm guessing
you've not tested it.

Hmm - the attached script generates 100,000 items and 10 liked ones for
each (well for the first 99,990 it says you like the next 10 items).
They're all given different timestamps at day intervals which means
you'll end up with 6 or seven matches for you sample query.

> But then this don't work:
>
> select
>    ...
> from
>    item pic
> where
>    pic.objectid in (
>        select mlv.child_arr
>        from might_like_vector mlv
>        where mlv.item = 125 AND
>              mlv.created_at > now() - interval '1 week'
>    )
> limit 16

Without messing around with arrays you get this query (which seems
readable enough to me)

SELECT
    objectid, objname
FROM
    items i
    JOIN might_like m ON (i.objectid = m.child)
WHERE
    m.created_at > (now() - '1 week'::interval)
    AND m.item = 125
ORDER BY
    objectid
LIMIT
    16
;

I'm getting times less than a millisecond for this - are you sure it's
worth fiddling with arrays?

--
  Richard Huxton
  Archonet Ltd
BEGIN;

CREATE SCHEMA mightlike;

SET search_path = mightlike;

CREATE TABLE items (
    objectid  integer NOT NULL,
    objname   text NOT NULL
);

CREATE TABLE might_like (
    item       integer NOT NULL,
    created_at timestamp with time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
    child      integer NOT NULL
);

INSERT INTO items SELECT i, 'item number ' || i
FROM generate_series(1, 100000) i;

INSERT INTO might_like SELECT i, (now() - j * '1 day'::interval), i+j
FROM generate_series(1, 99990) i, generate_series(1, 10) j;

ALTER TABLE items ADD PRIMARY KEY (objectid);
ALTER TABLE might_like ADD PRIMARY KEY (item, child);
ALTER TABLE might_like ADD CONSTRAINT valid_child FOREIGN KEY (child) REFERENCES items;
CREATE INDEX might_like_idx1 ON might_like (item, created_at);

-- EXPLAIN ANALYSE
SELECT
    objectid, objname
FROM
    items i
    JOIN might_like m ON (i.objectid = m.child)
WHERE
    m.created_at > (now() - '1 week'::interval)
    AND m.item = 125
ORDER BY
    objectid
LIMIT
    16
;

ROLLBACK;

Re: where in (select array)

От
Marcus Engene
Дата:
Richard Huxton wrote:
> Marcus Engene wrote:
>
>> Hi List,
>>
>> I have the might_like table that contains products a user might like if
>> he likes the present one (item).
>>
>> CREATE TABLE might_like
>> (
>> item                       INTEGER NOT NULL
>> ,created_at                 TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
>> ,child                      INTEGER NOT NULL
>> )
>> WITHOUT OIDS;
>>
>> CREATE INDEX might_like_x1 ON might_like(item);
>>
>> Since there are (will be) houndreds of thousands of items, and 20+ might
>> like items, i thought it would be nice to reduce the set to 1/20th by
>> using a vector.
>>
>> CREATE TABLE might_like_vector
>> (
>> item                       INTEGER NOT NULL
>> ,created_at                 TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
>> ,child_arr                  INTEGER[]
>> )
>> WITHOUT OIDS;
>>
>
> You haven't reduced the set at all, you've just turned part of it
> sideways. You might gain something on your search, but I'm guessing
> you've not tested it.
>
> Hmm - the attached script generates 100,000 items and 10 liked ones for
> each (well for the first 99,990 it says you like the next 10 items).
> They're all given different timestamps at day intervals which means
> you'll end up with 6 or seven matches for you sample query.
>
Sorry, I was a bit unclear. This is run on a web server. The might like
rows are generated if they are not available for an item when the
corresponding page is generated. The one row per might-like-item is
online since yesterday and has when this is written 1/2 M rows in it.
The caching is primarily initiated by a google indexer agent.

This might-like generation is expensive so caching them in the db is a
must and the used CPU of the web-server halfed when this caching had
been put live and cached the most commonly used items.

When the might-like data is read from the database, the generated
presentation html is put in memcached with a timeout of 1h (presently).
Memcached here is probably way overkill, but using it in situations like
this makes the site more scaleable to add webservers and postpones the
problem of clustering pg.

So with memcached I care less about saving a few mS in select latency
and more about postponing other approaching problems like having the
dbdump manageble. Right now it's a 100MB gzipped dump, which is very
manageable, so where it's possible I'd like to keep the data compact. I
imagine it's cheaper disk & dump wise to do the array thing in this and
some other similar usages, and therefore it would be nice to have a
non-ugly usage pattern. Also, we're going to collect usage statistics to
further optimize the behavior of the site, and I'm really worried about
these millions of rows.

If this is a bad approach to the problem I'm very interested to hear
this. Regardless of the best approach, a "proper" solution to the
subquery in would be super appreciated too :)

Thanks for your answer!

Best regards,
Marcus

Re: where in (select array)

От
Richard Huxton
Дата:
Marcus Engene wrote:
> So with memcached I care less about saving a few mS in select latency
> and more about postponing other approaching problems like having the
> dbdump manageble. Right now it's a 100MB gzipped dump, which is very
> manageable, so where it's possible I'd like to keep the data compact.

If you remove the duplication you reduce the amount it will compress by.
Oh - and you're not gzipping the dump yourself are you? Use the "-Fc"
format to dump with and it'll already be compressed.

> I
> imagine it's cheaper disk & dump wise to do the array thing in this and
> some other similar usages, and therefore it would be nice to have a
> non-ugly usage pattern.

Don't imagine, test. And then factor in the cost of fiddling around with
arrays when you need to access individual values. And the cost of the
time you spent working on all this.

> Also, we're going to collect usage statistics to
> further optimize the behavior of the site, and I'm really worried about
> these millions of rows.

Either:
1. Collect usage stats, *then* optimise. Then collect more stats and
make sure what you did was useful.
2. Stop worrying, and knock off early. Go have a couple of drinks with
works colleagues and unwind - it's the weekend!
3. Flip a coin and decide how to proceed based on that.

Now, #1 is the most effective but also the most effort and cost. Numbers
2,3 are about equal but both beat spending time optimising something you
haven't measured yet.

--
  Richard Huxton
  Archonet Ltd

Re: where in (select array)

От
Marcus Engene
Дата:
Richard Huxton wrote:
>> I imagine it's cheaper disk & dump wise to do the array thing in this and
>> some other similar usages, and therefore it would be nice to have a
>> non-ugly usage pattern.
>>
>
> Don't imagine, test. And then factor in the cost of fiddling around with
> arrays when you need to access individual values. And the cost of the
> time you spent working on all this.
>
On my dev 8.2.4 I get
using real values from a db dump with
931873 might like rows
46539 might like vector rows

Might like (row version):
10s to dump the second time, 38MB txt, 4MB gzip

Might like vector:
2s to dump the second time, 7.6MB text, 2MB gzip

Might like (row version)
explain cost, my in () version: ~200
explain cost, join on: ~670
explain cost, virtual table *): ~670

*)
select
   ...
from
   (select ...) as a.b

Might like vector:
explain cost, my in (): 1669

If there would have been a "generate_series" function for vectors, the
choice would have been easy I think.

Best regards,
Marcus