Re: procost for to_tsvector

Поиск
Список
Период
Сортировка
От Andrew Gierth
Тема Re: procost for to_tsvector
Дата
Msg-id 87h9trs0zm.fsf@news-spur.riddles.org.uk
обсуждение исходный текст
Ответ на Re: procost for to_tsvector  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: procost for to_tsvector  (Jeff Janes <jeff.janes@gmail.com>)
Список pgsql-hackers
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> Nyet ... at least not without you actually making that argument,Tom> with numbers, rather than just handwaving.
Weuse 100 for plpgsqlTom> and suchlike functions.  I'd be OK with making it 10 just onTom> general principles, but
claimingthat it's as expensive as aTom> plpgsql function requires evidence.
 

[TL/DR: 10 isn't enough, even 100 may be too low]

On a text corpus consisting of ~18 thousand blog comments + ~5% of dead
rows, median length 302 bytes, only about 3% long enough to be toasted,
and selecting a common word (~22% of the table):

explain analyze select * from comments where to_tsvector('english',message) @@ '''one'''::tsquery;
                                                  QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------Seq
Scanon comments  (cost=0.00..2406.18 rows=4140 width=792) (actual time=0.601..3946.589 rows=4056 loops=1)  Filter:
(to_tsvector('english'::regconfig,message) @@ '''one'''::tsquery)  Rows Removed by Filter: 14310Planning time: 0.270
msExecutiontime: 3954.745 ms
 
(5 rows)
                                                              QUERY PLAN
               
 

-----------------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on comments  (cost=204.09..2404.30 rows=4140 width=792) (actual time=2.401..11.564 rows=4056 loops=1)  Recheck
Cond:(to_tsvector('english'::regconfig, message) @@ '''one'''::tsquery)  Heap Blocks: exact=1911  ->  Bitmap Index Scan
oncomments_to_tsvector_idx  (cost=0.00..203.05 rows=4140 width=0) (actual time=1.974..1.974 rows=4313 loops=1)
IndexCond: (to_tsvector('english'::regconfig, message) @@ '''one'''::tsquery)Planning time: 0.278 msExecution time:
17.640ms
 
(7 rows)

(strangely, the seqscan plan is picked despite having a cost more than a
point higher? what's up with that?)

So for two plans with virtually identical cost, we have an execution
time difference on the order of 200x.

We can rule out the performance of the @@ by using a precalculated
tsvector:

explain analyze select * from comments where tsv @@ '''one'''::tsquery;
QUERYPLAN                                                  
 
--------------------------------------------------------------------------------------------------------------Seq Scan
oncomments  (cost=0.00..2359.31 rows=4140 width=792) (actual time=0.023..47.746 rows=4056 loops=1)  Filter: (tsv @@
'''one'''::tsquery) Rows Removed by Filter: 14310Planning time: 0.262 msExecution time: 54.220 ms
 
(5 rows)

So we're looking at an execution time for to_tsvector on the order of
200us, which is a seriously big deal when looking at a potential
seqscan.  That's not just _as_ expensive as a plpgsql function, but more
than 50 times as expensive as a simple one like this:

create function f1(text) returns integer language plpgsqlas $f$ begin return length($1); end; $f$;

select sum(length(message)) from comments;  --  89ms
select sum(f1(message)) from comments;      -- 155ms

66ms difference divided by 18366 rows = 3.6us per call

Now, obviously the default cost for plpgsql functions is assuming that
the function is a whole lot more complex than that, so one wouldn't
argue that to_tsvector should cost 5000. But there's a strong case for
arguing that it should cost a whole lot more than 100, because even at
that value the relative costs for the first two plans in this post only
differ by 2x, compared to a 200x runtime difference.  A value of 10
would be inadequate in many cases; in this example it leaves the slower
plan with a cost only ~15% higher, which is way too close to be
comfortable.

(As another example, a function with a simple query in it, such as
obj_description, can have runtimes on the order of 40us, still 5x faster
than to_tsvector.)

-- 
Andrew (irc:RhodiumToad)



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jim Nasby
Дата:
Сообщение: Re: proposal: searching in array function - array_position
Следующее
От: Haribabu Kommi
Дата:
Сообщение: Re: Parallel Seq Scan