Re: GIN improvements part2: fast scan

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: GIN improvements part2: fast scan
Дата
Msg-id 52E1D47D.2060606@fuzzy.cz
обсуждение исходный текст
Ответ на Re: GIN improvements part2: fast scan  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы Re: GIN improvements part2: fast scan  (Alexander Korotkov <aekorotkov@gmail.com>)
Список pgsql-hackers
On 23.1.2014 17:22, Heikki Linnakangas wrote:
> On 01/14/2014 05:35 PM, Alexander Korotkov wrote:
>> Attached version is rebased against last version of packed posting lists.
> 
> Thanks!
> 
> I think we're missing a trick with multi-key queries. We know that when
> multiple scan keys are used, they are ANDed together, so we can do the
> skip optimization even without the new tri-state consistent function.
> 
> To get started, I propose the three attached patches. These only
> implement the optimization for the multi-key case, which doesn't require
> any changes to the consistent functions and hence no catalog changes.
> Admittedly this isn't anywhere near as useful in practice as the single
> key case, but let's go for the low-hanging fruit first. This
> nevertheless introduces some machinery that will be needed by the full
> patch anyway.
> 
> I structured the code somewhat differently than your patch. There is no
> separate fast-path for the case where the optimization applies. Instead,
> I'm passing the advancePast variable all the way down to where the next
> batch of items are loaded from the posting tree. keyGetItem is now
> responsible for advancing the entry streams, and the logic in
> scanGetItem has been refactored so that it advances advancePast
> aggressively, as soon as one of the key streams let us conclude that no
> items < a certain point can match.
> 
> scanGetItem might yet need to be refactored when we get to the full
> preconsistent check stuff, but one step at a time.
> 
> 
> The first patch is the most interesting one, and contains the
> scanGetItem changes. The second patch allows seeking to the right
> segment in a posting tree page, and the third allows starting the
> posting tree scan from root, when skipping items (instead of just
> following the right-links).
> 
> Here are some simple performance test results, demonstrating the effect
> of each of these patches. This is a best-case scenario. I don't think
> these patches has any adverse effects even in the worst-case scenario,
> although I haven't actually tried hard to measure that. The used this to
> create a test table:
> 
> create table foo (intarr int[]);
> -- Every row contains 0 (frequent term), and a unique number.
> insert into foo select array[0,g] from generate_series(1, 10000000) g;
> -- Add another tuple with 0, 1 combo physically to the end of the table.
> insert into foo values (array[0,1]);
> 
> The query I used is this:
> 
> postgres=# select count(*) from foo where intarr @> array[0] and intarr
> @> array[1];
>  count
> -------
>      2
> (1 row)
> 
> I measured the time that query takes, and the number of pages hit, using
> "explain (analyze, buffers true) ...".
> 
> patches        time (ms)    buffers
> ---
> unpatched    650        1316
> patch 1        0.52        1316
> patches 1+2    0.50        1316
> patches 1+2+3    0.13        15
> 
> So, the second patch isn't doing much in this particular case. But it's
> trivial, and I think it will make a difference in other queries where
> you have the opportunity skip, but return a lot of tuples overall.
> 
> In summary, these are fairly small patches, and useful on their, so I
> think these should be committed now. But please take a look and see if
> the logic in scanGetItem/keyGetItem looks correct to you. After this, I
> think the main fast scan logic will go into keyGetItem.
> 
> PS. I find it a bit surprising that in your patch, you're completely
> bailing out if there are any partial-match keys involved. Is there some
> fundamental reason for that, or just not implemented?

I've done some initial testing (with all the three patches applied)
today to see if there are any crashes or obvious failures and found none
so far. The only issue I've noticed is this LOG message in ginget.c:
elog(LOG, "entryLoadMoreItems, %u/%u, skip: %d",     ItemPointerGetBlockNumber(&advancePast),
ItemPointerGetOffsetNumber(&advancePast),    !stepright);
 

which produces enormous amount of messages. I suppose that was used for
debugging purposes and shouldn't be there?

I plan to do more thorough testing over the weekend, but I'd like to
make sure I understand what to expect. My understanding is that this
patch should:

- give the same results as the current code (e.g. the fulltext should not return different rows / change the ts_rank
etc.)

- improve the performance of fulltext queries

Are there any obvious rules what queries will benefit most from this?
The queries generated by the tool I'm using for testing are mostly of
this form:
 SELECT id FROM messages  WHERE body_tsvector @ plainto_tsquery('english', 'word1 word2 ...')  ORDER BY ts_rank(...)
DESCLIMIT :n;
 

with varying number of words and LIMIT values. During the testing today
I haven't noticed any obvious performance difference, but I haven't
spent much time on that.

regards
Tomas



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Claudio Freire
Дата:
Сообщение: Re: Why do we let autovacuum give up?
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Change authentication error message (patch)