Re: Improving executor performance

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: Improving executor performance
Дата
Msg-id CAM3SWZRq4P6cpMWM6OHWXAqDf_EUsLSrWuywdmVk9G41KNeKvA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Improving executor performance  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Wed, Jul 13, 2016 at 6:18 PM, Andres Freund <andres@anarazel.de> wrote:
> While, as in 6) above, removing linked lists from the relevant
> structures helps, it's not that much. Staring at this for a long while
> made me realize that, somewhat surprisingly to me, is that one of the
> major problems is that we are bottlenecked on stack usage. Constantly
> entering and leaving this many functions for trivial expressions
> evaluations hurts considerably. Some of that is the additional numbers
> of instructions, some of that is managing return jump addresses, and
> some of that the actual bus traffic. It's also rather expensive to check
> for stack limits at a very high rate.

You'll recall how I complained how parallel CREATE INDEX, while
generally effective, became incredibly CPU bound on the still-serial
merge on my C collated text test case (I told you this in person, I
think). I looked into addressing this bottleneck, and made an
interesting discovery, which kind of reminds me of what you say here
about function call overhead.

I hacked up varstrfastcmp_c() to assume 4 byte varlena headers. No
function call to pg_detoast_datum_packed() is made (which otherwise
happens through all that DatumGetVarStringPP() macro indirection). Of
course, that assumption is dangerous in the general case, but it could
be made to work in most cases with a little care, as in practice the
vast majority of text comparisons are of text Datums with a 4 byte
varlena header. SortSupport could reliably detect if it was safe, with
a little help from the text opclass, or something along those lines.
This might not just be useful with sorting, but it does seem to be
particularly useful with parallel sorting.

Just on my laptop, this makes some parallel CREATE INDEX gensort test
cases [1] take as much as 15% less time to execute overall. That's a
big difference. I looked at the disassembly, and the number of
instructions for varstrfastcmp_c() was reduced from 113 to 29. That's
the kind of difference that could add up to a lot.

[1] https://github.com/petergeoghegan/gensort
-- 
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: dumping database privileges broken in 9.6
Следующее
От: Pavan Deolasee
Дата:
Сообщение: Re: heap_update() VM retry could break HOT?