Обсуждение: Optimizing `WHERE x IN` query
Hi!
I am attempting to replicate YouTube's subscription feed. Each user has a list
of channel IDs (as text[]) that they are subscribed to. The `users` table
looks like:
```
=# \d users
Table "public.users"
Column | Type | Collation | Nullable | Default
-------------------+--------------------------+-----------+----------
+---------
email | text | | not null |
subscriptions | text[] | | |
feed_needs_update | boolean | | |
Indexes:
"users_pkey" PRIMARY KEY, btree (email)
"email_unique_idx" UNIQUE, btree (lower(email))
"users_email_key" UNIQUE CONSTRAINT, btree (email)
```
For storing videos, I have a table `channel_videos` from which I want to
select all videos where the video's `ucid` (channel ID) is in the user's
subscriptions. `channel_videos` looks like:
```
=# \d channel_videos
Table "public.channel_videos"
Column | Type | Collation | Nullable |
Default
--------------------+--------------------------+-----------+----------
+---------
id | text | | not null |
title | text | | |
published | timestamp with time zone | | |
updated | timestamp with time zone | | |
ucid | text | | |
author | text | | |
views | bigint | | |
Indexes:
"channel_videos_id_key" UNIQUE CONSTRAINT, btree (id)
"channel_videos_published_idx" btree (published)
"channel_videos_ucid_idx" btree (ucid)
```
In case it might help with indexing, a UCID always begins with `UC`, then 22
random characters in [a-zA-Z0-9_-] (Base64-urlsafe).
Currently, the query I'm using to generate a user's feed is:
```
SELECT * FROM channel_videos WHERE ucid IN (SELECT unnest(subscriptions) FROM
users WHERE email = $1) ORDER BY published DESC;
```
This works great when `subscriptions` and `channel_videos` are both fairly
small. Unfortunately, at this point `channel_videos` contains roughly 28M
rows. Attempting to generate a feed for a user with 177 subscriptions takes
around 40-70s, here's the relevant `EXPLAIN (ANALYZE, BUFFERS)`:
```
=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM channel_videos WHERE ucid IN
(SELECT unnest(subscriptions) FROM users WHERE email = 'omarroth') ORDER BY
published DESC;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=478207.59..478506.73 rows=119656 width=142) (actual
time=68599.562..68613.824 rows=46456 loops=1)
Sort Key: channel_videos.published DESC
Sort Method: external merge Disk: 6352kB
Buffers: shared hit=456 read=13454 dirtied=13 written=456, temp read=794
written=797
-> Nested Loop (cost=55.94..459526.49 rows=119656 width=142) (actual
time=0.555..68531.550 rows=46456 loops=1)
Buffers: shared hit=453 read=13454 dirtied=13 written=456
-> HashAggregate (cost=10.18..11.18 rows=100 width=32) (actual
time=0.417..0.850 rows=177 loops=1)
Group Key: unnest(users.subscriptions)
Buffers: shared hit=39 read=7
-> ProjectSet (cost=0.41..8.93 rows=100 width=32) (actual
time=0.305..0.363 rows=177 loops=1)
Buffers: shared hit=39 read=7
-> Index Scan using users_email_key on users
(cost=0.41..8.43 rows=1 width=127) (actual time=0.049..0.087 rows=1 loops=1)
Index Cond: (email = 'omarroth'::text)
Buffers: shared hit=5 read=4
-> Bitmap Heap Scan on channel_videos (cost=45.76..4583.18
rows=1197 width=142) (actual time=28.835..387.068 rows=262 loops=177)
Recheck Cond: (ucid = (unnest(users.subscriptions)))
Heap Blocks: exact=12808
Buffers: shared hit=414 read=13447 dirtied=13 written=456
-> Bitmap Index Scan on channel_videos_ucid_idx
(cost=0.00..45.46 rows=1197 width=0) (actual time=22.255..22.255 rows=263
loops=177)
Index Cond: (ucid = (unnest(users.subscriptions)))
Buffers: shared hit=291 read=762 written=27
Planning Time: 1.463 ms
Execution Time: 68619.316 ms
(23 rows)
```
Since a feed is expected to be fairly consistent to what would appear on
YouTube, `channel_videos` receives fairly frequent UPDATEs and INSERTs and is
vacuumed fairly frequently:
```
=# SELECT * FROM pg_stat_user_tables WHERE relname = 'channel_videos';
relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan |
idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup
| n_dead_tup | n_mod_since_analyze | last_vacuum |
last_autovacuum | last_analyze | last_autoanalyze
| vacuum_count | autovacuum_count | analyze_count | autoanalyze_count
-------+------------+----------------+----------+--------------+----------
+---------------+-----------+-----------+-----------+---------------
+------------+------------+---------------------
+------------------------------+-----------------
+-------------------------------+-------------------------------
+--------------+------------------+---------------+-------------------
16444 | public | channel_videos | 3 | 6346042 | 28294649 |
6605201260 | 218485 | 13280741 | 2413 | 11241541 | 24601363 |
585451 | 763200 | 2019-07-05 17:57:21.34215+00 |
| 2019-07-05 17:59:21.013308+00 | 2019-07-04 14:54:02.845591+00 | 4
| 0 | 4 | 2
(1 row)
The machine running Postgres has 4 vCPUs, 8GB of RAM (which I expect is the
problem), and 160GB SSD. `uname -srv`:
# uname -srv
Linux 4.4.0-154-generic #181-Ubuntu SMP Tue Jun 25 05:29:03 UTC 2019
```
I am running Postgres 11.4, `SELECT version()`:
PostgreSQL 11.4 (Ubuntu 11.4-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled
by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, 64-bit
(1 row)
Omar Roth schrieb am 07.07.2019 um 15:43:
> Currently, the query I'm using to generate a user's feed is:
>
> ```
> SELECT * FROM channel_videos WHERE ucid IN (SELECT unnest(subscriptions) FROM
> users WHERE email = $1) ORDER BY published DESC;
> ```
You could try an EXISTS query without unnest:
select cv.*
from channel_videos cv
where exists ucid (select *
from users u
where cv.ucid = any(u.subscriptions)
and u.email = $1);
Did you try if a properly normalized model performs better?
The suggested query indeed appears to be faster. Thank you.
> Did you try if a properly normalized model performs better?
I've tested the below schema, which doesn't appear to perform much better but
has a couple other advantages for my application:
```
create table subscriptions (
email text references users(email),
ucid text,
primary key (email, ucid)
);
explain (analyze, buffers) select cv.* from channel_videos cv, subscriptions s
where cv.ucid = s.ucid and s.email = $1;
```
Is there something else here I'm missing?
Le 07/07/2019 à 16:33, Thomas Kellerer a écrit : > Omar Roth schrieb am 07.07.2019 um 15:43: >> Currently, the query I'm using to generate a user's feed is: >> >> ``` >> SELECT * FROM channel_videos WHERE ucid IN (SELECT >> unnest(subscriptions) FROM >> users WHERE email = $1) ORDER BY published DESC; >> ``` > > You could try an EXISTS query without unnest: > > select cv.* > from channel_videos cv > where exists ucid (select * > from users u > where cv.ucid = any(u.subscriptions) > and u.email = $1); > > Did you try if a properly normalized model performs better? > > Hi We had big performance issues with queries like that, and we modified them to use && (see https://www.postgresql.org/docs/current/functions-array.html ), resulting in a big perf boost so, with your model, the query could be ``` select cv.* from channel_videos cv inner join user u on cv.ucid && u.subscription where u.email = $1; ``` or ``` select cv.* from channel_videos cv inner join ( select subscription from user where email = $1) as u on cv.ucid && u.subscription ; ``` (disclaimer, I didn't try this queries, they may contain typos) Regards Nicolas
> We had big performance issues with queries like that, and we modified > them to use && (see > https://www.postgresql.org/docs/current/functions-array.html ), > resulting in a big perf boost Much appreciated! Unfortunately I'm having trouble turning your suggestions into a working query. `cv.ucid && u.subscriptions` doesn't appear to work in my case since the comparison is `text && text[]`. Something like: ``` explain (analyze, buffers) select cv.* from channel_videos cv inner join (select subscriptions from users where email = $1) as u on array[cv.ucid] && u.subscriptions; ``` does work but unfortunately doesn't appear to be faster than the current query. Cheers
Did you create a GIN index on subscriptions column to support the && operator?