Searching GIN-index (FTS) and sort by timestamp-column

Поиск
Список
Период
Сортировка
От Andreas Joseph Krogh
Тема Searching GIN-index (FTS) and sort by timestamp-column
Дата
Msg-id VisenaEmail.42.91df4628bdf7755c.1537e96e852@tc7-visena
обсуждение исходный текст
Ответы Re: Searching GIN-index (FTS) and sort by timestamp-column
Список pgsql-performance
Hi all.
 
I'm doing full-text-search and want do display the results in the order the articles were received (timestamp). I have millions of articles where the newest are the most interesting, and a search may match many articles so doing the sort using some INDEX would be great.
 
Take the following example-schema:
 
create extension if not exists btree_gin;
drop table if EXISTS delivery;
create table delivery(    id BIGSERIAL primary key,    fts_all TSVECTOR not null,    folder_id BIGINT NOT NULL,    received_timestamp TIMESTAMP not null,    message varchar not null
);
create index fts_idx ON delivery using gin(fts_all, folder_id);

CREATE OR REPLACE FUNCTION update_delivery_tsvector_tf() RETURNS TRIGGER AS $$
BEGIN
    NEW.fts_all = to_tsvector('simple', NEW.message);
    return NEW;
END;
$$ LANGUAGE PLPGSQL;

CREATE TRIGGER update_delivery_tsvector_t BEFORE INSERT OR UPDATE ON delivery
FOR EACH ROW EXECUTE PROCEDURE update_delivery_tsvector_tf();

insert into delivery(folder_id, received_timestamp, message)
values (1, '2015-01-01', 'Yes hit four')    , (1, '2014-01-01', 'Hi man')    , (2, '2013-01-01', 'Hi man')    , (2, '2013-01-01', 'fish')
;

analyze delivery;

set ENABLE_SEQSCAN to off;

explain analyze SELECT del.id    , del.received_timestamp
FROM delivery del
WHERE 1 = 1
      AND del.fts_all @@ to_tsquery('simple', 'hi:*')      AND del.folder_id = 1
ORDER BY  del.received_timestamp DESC LIMIT 101 OFFSET 0;
 
I use btree_gin extention to make folder_id part of index.
 
I get the following plan (using 9.6 from master):
                                                         QUERY PLAN                                                         
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=5.23..5.23 rows=1 width=16) (actual time=0.042..0.043 rows=2 loops=1)
   ->  Sort  (cost=5.23..5.23 rows=1 width=16) (actual time=0.040..0.040 rows=2 loops=1)
         Sort Key: received_timestamp DESC
         Sort Method: quicksort  Memory: 25kB
         ->  Bitmap Heap Scan on delivery del  (cost=3.90..5.22 rows=1 width=16) (actual time=0.029..0.030 rows=2 loops=1)
               Recheck Cond: (fts_all @@ '''hi'':*'::tsquery)
               Filter: (folder_id = 1)
               Rows Removed by Filter: 1
               Heap Blocks: exact=1
               ->  Bitmap Index Scan on fts_idx  (cost=0.00..3.90 rows=1 width=0) (actual time=0.018..0.018 rows=3 loops=1)
                     Index Cond: (fts_all @@ '''hi'':*'::tsquery)
 Planning time: 0.207 ms
 Execution time: 0.085 ms
(13 rows)
 
Here is the explain from a real-world query:
 
EXPLAIN ANALYZE SELECT del.entity_id    , del.received_timestamp
FROM origo_email_delivery del
WHERE 1 = 1
      AND del.fts_all @@ to_tsquery('simple', 'andre:*')      AND del.folder_id = 44964
ORDER BY del.received_timestamp DESC LIMIT 101 OFFSET 0;
                                                                        QUERY PLAN                                                                        
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=92260.66..92260.91 rows=101 width=16) (actual time=347.891..347.907 rows=101 loops=1)
   ->  Sort  (cost=92260.66..92291.08 rows=12167 width=16) (actual time=347.888..347.899 rows=101 loops=1)
         Sort Key: received_timestamp DESC
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Bitmap Heap Scan on origo_email_delivery del  (cost=2480.95..91794.77 rows=12167 width=16) (actual time=152.568..346.132 rows=18257 loops=1)
               Recheck Cond: (fts_all @@ '''andre'':*'::tsquery)
               Filter: (folder_id = 44964)
               Rows Removed by Filter: 264256
               Heap Blocks: exact=80871
               ->  Bitmap Index Scan on temp_fts_idx  (cost=0.00..2477.91 rows=309588 width=0) (actual time=134.903..134.903 rows=282513 loops=1)
                     Index Cond: (fts_all @@ '''andre'':*'::tsquery)
 Planning time: 0.530 ms
 Execution time: 347.967 ms
(13 rows)
 
 
1. Why isnt' folder_id part of the index-cond?
2. Is there a way to make it use the (same) index to sort by received_timestamp?
3. Using a GIN-index, is there a way to use the index at all for sorting?
4. It doesn't seem like ts_rank uses the index for sorting either.
 
Thanks.
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Вложения

В списке pgsql-performance по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Merge joins on index scans
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Searching GIN-index (FTS) and sort by timestamp-column