Re: Missing CFI in hlCover()?

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Missing CFI in hlCover()?
Дата
Msg-id 1343033.1596066368@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Missing CFI in hlCover()?  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Missing CFI in hlCover()?  (Stephen Frost <sfrost@snowman.net>)
Список pgsql-hackers
I wrote:
> (I wonder if we need to try to make it faster.  I'd supposed that the
> loop was cheap enough to be a non-problem, but with large enough
> documents maybe not?  It seems like converting to a hash table could
> be worthwhile for a large doc.)

OK, I dug into Stephen's test case off-list.  While some CFIs would
be a good idea, that's just band-aid'ing the symptom.  The actual
problem is that hlCover() is taking way too much time.  The test case
boils down to "common_word & rare_word" matched to a very long document,
wherein the rare_word appears only near the front.  Once we're past
that match, hlCover() tries all the remaining matches for common_word,
of which there are plenty ... and for each one, it scans clear to the
end of the document, looking vainly for a substring that will satisfy
the "common_word & rare_word" query.  So obviously, this is O(N^2)
in the length of the document :-(.

I have to suppose that I introduced this problem in c9b0c678d, since
AFAIR we weren't getting ts_headline-takes-forever complaints before
that.  It's not immediately obvious why the preceding algorithm doesn't
have a similar issue, but then there's not anything at all that was
obvious about the preceding algorithm.

The most obvious way to fix the O(N^2) hazard is to put a limit on the
length of "cover" (matching substring) that we'll consider.  For the
mark_hl_words caller, we won't highlight more than max_words words
anyway, so it would be reasonable to bound covers to that length or
some small multiple of it.  The other caller mark_hl_fragments is
willing to highlight up to max_fragments of up to max_words each, and
there can be some daylight between the fragments, so it's not quite
clear what the longest reasonable match is.  Still, I doubt it's
useful to show a headline consisting of a few words from the start of
the document and a few words from thousands of words later, so a limit
of max_fragments times max_words times something would probably be
reasonable.

We could hard-code a rule like that, or we could introduce a new
explicit parameter for the maximum cover length.  The latter would be
more flexible, but we need something back-patchable and I'm concerned
about the compatibility hazards of adding a new parameter in minor
releases.  So on the whole I propose hard-wiring a multiplier of,
say, 10 for both these cases.

            regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Fujii Masao
Дата:
Сообщение: Re: [PATCH] Tab completion for VACUUM of partitioned tables
Следующее
От: Fujii Masao
Дата:
Сообщение: Re: [PATCH] Initial progress reporting for COPY command