Re: GIN improvements part2: fast scan

Поиск
Список
Период
Сортировка
От Rod Taylor
Тема Re: GIN improvements part2: fast scan
Дата
Msg-id CAKddOFDakagyJYYyVHrx69K89HwSZZLvZU0xJ03hZ_oLkPO7XQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: GIN improvements part2: fast scan  (Alexander Korotkov <aekorotkov@gmail.com>)
Ответы Re: GIN improvements part2: fast scan  (Alexander Korotkov <aekorotkov@gmail.com>)
Список pgsql-hackers
2%.

It's essentially sentence fragments from 1 to 5 words in length. I wasn't expecting it to be much smaller.

10 recent value selections:

 white vinegar reduce color running
 vinegar cure uti
 cane vinegar acidity depends parameter
 how remedy fir clogged shower
 use vinegar sensitive skin
 home remedies removing rust heating
 does non raw apple cider
 home remedies help maintain healthy
 can vinegar mess up your
 apple cide vineger ph balance

regards,

Rod



On Fri, Nov 15, 2013 at 12:51 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Fri, Nov 15, 2013 at 6:57 PM, Rod Taylor <pg@rbt.ca> wrote:
I tried again this morning using gin-packed-postinglists-16.patch and gin-fast-scan.6.patch. No crashes.

It is about a 0.1% random sample of production data (10,000,000 records) with the below structure. Pg was compiled with debug enabled in both cases.

      Table "public.kp"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | bigint  | not null
 string | text    | not null
 score1 | integer |
 score2 | integer |
 score3 | integer |
 score4 | integer |
Indexes:
    "kp_pkey" PRIMARY KEY, btree (id)
    "kp_string_key" UNIQUE CONSTRAINT, btree (string)
    "textsearch_gin_idx" gin (to_tsvector('simple'::regconfig, string)) WHERE score1 IS NOT NULL



This is a query tested. All data is in Pg buffer cache for these timings. Words like "the" and "and" are very common (~9% of entries, each) and a word like "hotel" is much less common (~0.2% of entries).

  SELECT id,string
    FROM kp
   WHERE score1 IS NOT NULL
     AND to_tsvector('simple', string) @@ to_tsquery('simple', ?)
             -- ? is substituted with the query strings
ORDER BY score1 DESC, score2 ASC
LIMIT 1000;

 Limit  (cost=56.04..56.04 rows=1 width=37) (actual time=250.010..250.032 rows=142 loops=1)
   ->  Sort  (cost=56.04..56.04 rows=1 width=37) (actual time=250.008..250.017 rows=142 loops=1)
         Sort Key: score1, score2
         Sort Method: quicksort  Memory: 36kB
         ->  Bitmap Heap Scan on kp  (cost=52.01..56.03 rows=1 width=37) (actual time=249.711..249.945 rows=142 loops=1)
               Recheck Cond: ((to_tsvector('simple'::regconfig, string) @@ '''hotel'' & ''and'' & ''the'''::tsquery) AND (score1 IS NOT NULL))
               ->  Bitmap Index Scan on textsearch_gin_idx  (cost=0.00..52.01 rows=1 width=0) (actual time=249.681..249.681 rows=142 loops=1)
                     Index Cond: (to_tsvector('simple'::regconfig, string) @@ '''hotel'' & ''and'' & ''the'''::tsquery)
 Total runtime: 250.096 ms



Times are from \timing on.

MASTER
=======
the:               888.436 ms   926.609 ms   885.502 ms
and:               944.052 ms   937.732 ms   920.050 ms
hotel:              53.992 ms    57.039 ms    65.581 ms
and & the & hotel: 260.308 ms   248.275 ms   248.098 ms

These numbers roughly match what we get with Pg 9.2. The time savings between 'the' and 'and & the & hotel' is mostly heap lookups for the score and the final sort.



The size of the index on disk is about 2% smaller in the patched version.

PATCHED
=======
the:              1055.169 ms 1081.976 ms  1083.021 ms
and:               912.173 ms  949.364 ms   965.261 ms
hotel:              62.591 ms   64.341 ms    62.923 ms
and & the & hotel: 268.577 ms  259.293 ms   257.408 ms
hotel & and & the: 253.574 ms  258.071 ms  250.280 ms

I was hoping that the 'and & the & hotel' case would improve with this patch to be closer to the 'hotel' search, as I thought that was the kind of thing it targeted. Unfortunately, it did not. I actually applied the patches, compiled, initdb/load data, and ran it again thinking I made a mistake.

Reordering the terms 'hotel & and & the' doesn't change the result. 
 
Oh, in this path new consistent method isn't implemented for tsvector opclass, for array only. Will be fixed soon.
BTW, was index 2% smaller or 2 times smaller? If it's 2% smaller than I need to know more about your dataset :)

------
With best regards,
Alexander Korotkov.


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Adrian Klaver
Дата:
Сообщение: Review:Patch: SSL: prefer server cipher order
Следующее
От: Kevin Grittner
Дата:
Сообщение: Re: Cannot allocate memory