Optimizing `WHERE x IN` query

Поиск
Список
Период
Сортировка
От Omar Roth
Тема Optimizing `WHERE x IN` query
Дата
Msg-id DM5PR18MB1051CD296F9F94AE0D67FD84DFF70@DM5PR18MB1051.namprd18.prod.outlook.com
обсуждение исходный текст
Ответы Re: Optimizing `WHERE x IN` query
Список pgsql-performance
Hi!

I am attempting to replicate YouTube's subscription feed. Each user has a list
of channel IDs (as text[]) that they are subscribed to. The `users` table
looks like:

```
=# \d users
                             Table "public.users"
      Column       |           Type           | Collation | Nullable | Default
-------------------+--------------------------+-----------+----------
+---------
 email             | text                     |           | not null |
 subscriptions     | text[]                   |           |          |
 feed_needs_update | boolean                  |           |          |
Indexes:
    "users_pkey" PRIMARY KEY, btree (email)
    "email_unique_idx" UNIQUE, btree (lower(email))
    "users_email_key" UNIQUE CONSTRAINT, btree (email)
```

For storing videos, I have a table `channel_videos` from which I want to
select all videos where the video's `ucid` (channel ID) is in the user's
subscriptions. `channel_videos` looks like:

```
=# \d channel_videos
                         Table "public.channel_videos"
       Column       |           Type           | Collation | Nullable |
Default
--------------------+--------------------------+-----------+----------
+---------
 id                 | text                     |           | not null |
 title              | text                     |           |          |
 published          | timestamp with time zone |           |          |
 updated            | timestamp with time zone |           |          |
 ucid               | text                     |           |          |
 author             | text                     |           |          |
 views              | bigint                   |           |          |
Indexes:
    "channel_videos_id_key" UNIQUE CONSTRAINT, btree (id)
    "channel_videos_published_idx" btree (published)
    "channel_videos_ucid_idx" btree (ucid)
```

In case it might help with indexing, a UCID always begins with `UC`, then 22
random characters in [a-zA-Z0-9_-] (Base64-urlsafe).

Currently, the query I'm using to generate a user's feed is:

```
SELECT * FROM channel_videos WHERE ucid IN (SELECT unnest(subscriptions) FROM
users WHERE email = $1) ORDER BY published DESC;
```

This works great when `subscriptions` and `channel_videos` are both fairly
small. Unfortunately, at this point `channel_videos` contains roughly 28M
rows. Attempting to generate a feed for a user with 177 subscriptions takes
around 40-70s, here's the relevant `EXPLAIN (ANALYZE, BUFFERS)`:

```
=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM channel_videos WHERE ucid IN
(SELECT unnest(subscriptions) FROM users WHERE email = 'omarroth') ORDER BY
published DESC;
                                                                      QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=478207.59..478506.73 rows=119656 width=142) (actual
time=68599.562..68613.824 rows=46456 loops=1)
   Sort Key: channel_videos.published DESC
   Sort Method: external merge  Disk: 6352kB
   Buffers: shared hit=456 read=13454 dirtied=13 written=456, temp read=794
written=797
   ->  Nested Loop  (cost=55.94..459526.49 rows=119656 width=142) (actual
time=0.555..68531.550 rows=46456 loops=1)
         Buffers: shared hit=453 read=13454 dirtied=13 written=456
         ->  HashAggregate  (cost=10.18..11.18 rows=100 width=32) (actual
time=0.417..0.850 rows=177 loops=1)
               Group Key: unnest(users.subscriptions)
               Buffers: shared hit=39 read=7
               ->  ProjectSet  (cost=0.41..8.93 rows=100 width=32) (actual
time=0.305..0.363 rows=177 loops=1)
                     Buffers: shared hit=39 read=7
                     ->  Index Scan using users_email_key on users
(cost=0.41..8.43 rows=1 width=127) (actual time=0.049..0.087 rows=1 loops=1)
                           Index Cond: (email = 'omarroth'::text)
                           Buffers: shared hit=5 read=4
         ->  Bitmap Heap Scan on channel_videos  (cost=45.76..4583.18
rows=1197 width=142) (actual time=28.835..387.068 rows=262 loops=177)
               Recheck Cond: (ucid = (unnest(users.subscriptions)))
               Heap Blocks: exact=12808
               Buffers: shared hit=414 read=13447 dirtied=13 written=456
               ->  Bitmap Index Scan on channel_videos_ucid_idx
(cost=0.00..45.46 rows=1197 width=0) (actual time=22.255..22.255 rows=263
loops=177)
                     Index Cond: (ucid = (unnest(users.subscriptions)))
                     Buffers: shared hit=291 read=762 written=27
 Planning Time: 1.463 ms
 Execution Time: 68619.316 ms
(23 rows)

```

Since a feed is expected to be fairly consistent to what would appear on
YouTube, `channel_videos` receives fairly frequent UPDATEs and INSERTs and is
vacuumed fairly frequently:

```
=# SELECT * FROM pg_stat_user_tables WHERE relname = 'channel_videos';
 relid | schemaname |    relname     | seq_scan | seq_tup_read | idx_scan |
idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup
| n_dead_tup | n_mod_since_analyze |         last_vacuum          |
last_autovacuum |         last_analyze          |       last_autoanalyze
| vacuum_count | autovacuum_count | analyze_count | autoanalyze_count
-------+------------+----------------+----------+--------------+----------
+---------------+-----------+-----------+-----------+---------------
+------------+------------+---------------------
+------------------------------+-----------------
+-------------------------------+-------------------------------
+--------------+------------------+---------------+-------------------
 16444 | public     | channel_videos |        3 |      6346042 | 28294649 |
6605201260 |    218485 |  13280741 |      2413 |      11241541 |   24601363 |
585451 |              763200 | 2019-07-05 17:57:21.34215+00 |
| 2019-07-05 17:59:21.013308+00 | 2019-07-04 14:54:02.845591+00 |            4
|                0 |             4 |                 2
(1 row)
The machine running Postgres has 4 vCPUs, 8GB of RAM (which I expect is the
problem), and 160GB SSD. `uname -srv`:

# uname -srv
Linux 4.4.0-154-generic #181-Ubuntu SMP Tue Jun 25 05:29:03 UTC 2019
```
I am running Postgres 11.4, `SELECT version()`:

 PostgreSQL 11.4 (Ubuntu 11.4-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled
by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, 64-bit
(1 row)





В списке pgsql-performance по дате отправления:

Предыдущее
От: Ancoron Luciferis
Дата:
Сообщение: Re: Custom opclass for column statistics?
Следующее
От: Thomas Kellerer
Дата:
Сообщение: Re: Optimizing `WHERE x IN` query