Re: [PATCH] Fix for slow GIN index queries when "gin_fuzzy_search_limit"setting is relatively small for large tables

Поиск
Список
Период
Сортировка
От Adé
Тема Re: [PATCH] Fix for slow GIN index queries when "gin_fuzzy_search_limit"setting is relatively small for large tables
Дата
Msg-id CAEknJCfV1aNS5MgXKdRGpmaT9a0TGWifPtQvZCZxAa5pqtf2ew@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] Fix for slow GIN index queries when "gin_fuzzy_search_limit" setting is relatively small for large tables  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: [PATCH] Fix for slow GIN index queries when "gin_fuzzy_search_limit" setting is relatively small for large tables  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Hi, Tom. Thanks for taking a look.
 
It seems like what you're actually trying
to accomplish is to ensure that entryLoadMoreItems's "stepright" path
is taken, instead of re-descending the index from the root.

What I was primarily trying to do is make sure that when entryLoadMoreItems is called, it loads new items that it didn’t load previously, which would occur in special cases. But the solution to that goal does result in the "stepright" path being used. I explain the main goal in a segment of my bug report this way, though it's a bit longwinded (from https://www.postgresql.org/message-id/16220-1a0a4f0cb67cafdc@postgresql.org):


Since the program doesn't load all items into memory at once, it calls the
"entryLoadMoreItems" function when it needs to get another page of items to
iterate through. The "entryLoadMoreItems" function calls are passed an
"advancePast" variable as an argument. This variable decides what leaf page
in the items/posting tree should more items be retrieved from. Usually when
iterating through all possible results, execution will exit the do-while
loop responsible for iteration in order to perform some important actions
(including the updating of the "advancePast" variable) before returning into
the loop again to iterate over more items. However, when "dropItem" returns
true in succession a great many times, the do-while loop can not be exited
for updating the "advancePast" variable until a non-drop finally occurs.
When this "advancePast" variable is not updated it leads to calls to
"entryLoadMoreItems" repeatedly returning the same items when stuck in the
do-while loop by a high succession of dropped items (because "advancePast"
is never updated to a value after items already iterated through).
 
Along with the issue of returning the same items, there's the issue of how
the "entryLoadMoreItems" function traverses the posting tree from the root
each time it's called while stuck in the do-while loop. This especially is
the cause for the bad performance seen for low "gin_fuzzy_search_limit"
values. In some cases, this situation is made even worse when "advancePast"
is set to a value that leads to loading a page of items that has relatively
few items actually past "advancePast", and so it must almost immediately
call "entryLoadMoreItems" again. But because "advancePast" never gets
updated, this results in a higher than usual succession of
"entryLoadMoreItems" function calls (the program loads the same page,
iterates over the same relatively few items until it goes and loads the same
page again), with each call requiring traversal from the root of the posting
tree down to the same leaf page as before.
 
My patch makes it so that when stuck in the do-while loop after many
successive "dropItems" returning true, the program instead now loads actual
new items by passing the last item dropped into the "entryLoadMoreItems"
function instead of the "advancePast" variable that can't be appropriately
updated while stuck in the do-while loop. This means "entryLoadMoreItems"
will instead load items ordered right after the last dropped item. This last
item dropped is also the current item ("curItem") and so the
"entryLoadMoreItems" can directly obtain the next page of items by making a
step right from the current page, instead of traversing from the root of the
posting tree, which is the most important fix for performance.

In regards to this:

While we're here, what do you think about the comment in the other
code branch just above:
                /* XXX: shouldn't we apply the fuzzy search limit here? */
I'm personally inclined to suspect that the fuzzy-search business has
got lots of bugs, which haven't been noticed because (a) it's so squishily
defined that one can hardly tell whether a given result is buggy or not,
and (b) nearly nobody uses it anyway (possibly not unrelated to (a)).
As somebody who evidently is using it, you're probably more motivated
to track down bugs than the rest of us.
 
I think the comment is correct. It should be applied if you are to stay consistent. Like the comment above that comment says, that code branch is for the two cases of either (1) reaching the last page of a posting tree or (2) when an "entry"/keyword has so few results that the item pointers fit in just one page containing a posting list. If there is a chance of a dropped item in the other pages of the posting tree, there should be a chance of dropped items in the last page too for consistency sake at least. And there should also be a chance of dropped items when iterating a single posting list of entry with relatively few results. Placing "|| (entry->reduceResult == true && dropItem(entry))" at the end of the while condition should be all that's needed to apply the fuzzy search limit there.

And I agree that probably usage of the fuzzy search feature is extremely rare and the way I'm using it probably even more rare. So thank you for taking a look at it. It's a really great feature for me though and I'm glad the creator placed it in.

Regards,
Ade

On Tue, Mar 10, 2020 at 7:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Adé <ade.hey@gmail.com> writes:
> Like the title says, using "gin_fuzzy_search_limit" degrades speed when it
> has a relatively low setting.
> ...
> Attached is SQL to test and observe this issue and also attached is a patch
> I want to eventually submit to a commitfest.

I took a brief look at this.  It seems like what you're actually trying
to accomplish is to ensure that entryLoadMoreItems's "stepright" path
is taken, instead of re-descending the index from the root.  Okay,
I can see why that'd be a big win, but why are you tying it to the
dropItem behavior?  It should apply any time we're iterating this loop
more than once.  IOW, it seems to me like the logic about when to step
right is just kinda broken, and this is a band-aid rather than a full fix.
The performance hit is worse for fuzzy-search mode because it will
iterate the loop more (relative to the amount of work done elsewhere),
but there's still a potential for wasted work.

Actually, a look at the code coverage report shows that the
not-step-right path is never taken at all in our standard regression
tests.  Maybe that just says bad things about the tests' coverage, but
now I'm wondering whether we could just flush that code path altogether,
and assume that we should always step right at this point.

[ cc'ing heikki and alexander, who seem to have originated that code
at 626a120656a75bf4fe64b1d0d83c23cb38d3771a.  The commit message says
it saves a lot of I/O, but I'm wondering if this report disproves that.
In any case the lack of test coverage is pretty damning. ]

While we're here, what do you think about the comment in the other
code branch just above:

                /* XXX: shouldn't we apply the fuzzy search limit here? */

I'm personally inclined to suspect that the fuzzy-search business has
got lots of bugs, which haven't been noticed because (a) it's so squishily
defined that one can hardly tell whether a given result is buggy or not,
and (b) nearly nobody uses it anyway (possibly not unrelated to (a)).
As somebody who evidently is using it, you're probably more motivated
to track down bugs than the rest of us.

                        regards, tom lane

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ashutosh Bapat
Дата:
Сообщение: Re: BEFORE ROW triggers for partitioned tables
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Refactor compile-time assertion checks for C/C++