cost_sort() improvements

Поиск
Список
Период
Сортировка
От Teodor Sigaev
Тема cost_sort() improvements
Дата
Msg-id 803ccdce-39ef-f1c3-3c65-c79959f7081d@sigaev.ru
обсуждение исходный текст
Ответы Re: cost_sort() improvements  (Peter Geoghegan <pg@bowt.ie>)
Re: cost_sort() improvements  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: cost_sort() improvements  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: cost_sort() improvements  (Teodor Sigaev <teodor@sigaev.ru>)
Список pgsql-hackers
Hi!

Current estimation of sort cost has following issues:
  - it doesn't differ one and many columns sort
  - it doesn't pay attention to comparison function cost and column width
  - it doesn't try to count number of calls of comparison function on per column
    basis

At least [1] and [2] hit into to that issues and have an objections/questions 
about correctness of cost sort estimation. Suggested patch tries to improve 
current estimation and solve that issues.

Basic ideas:
  - let me introduce some designations
     i   - index of column, started with 1
     N   - number of tuples to sort
     Gi  - number of groups, calculated by i number columns. Obviously,
           G0 == 1. Number of groups is caculated by estimate_num_groups().
     NGi - number of tuples in one group. NG0 == N and NGi = N/G(i-1)
     Fi  - cost of comparison function call of i-th column
     Wi  - average width in bytes of 1-ith column.
     Ci  - total cost of one comparison call of column calculated as
           Fi * fine_function(Wi) where fine_function(Wi) looks like:
           if (Wi <= sizeof(Datum))
               return 1.0; //any type passed in Datum
           else
               return 1 + log16(Wi/sizeof(Datum));
           log16 just an empirical choice
  - So, cost for first column is 2.0 * C0 * log2(N)
    First column is always compared, multiplier 2.0 is defined per regression
    test. Seems, it estimates a cost for transferring tuple to CPU cache,
    starting cost for unpacking tuple, cost of call qsort compare wrapper
    function, etc. Removing this multiplier causes too optimistic prediction of
    cost.
  - cost for i-th columns:
    Ci * log2(NGi) => Ci * log2(N/G(i-1))
    So, here I believe, that i-th comparison function will be called only inside
    group which number is defined by (i-1) leading columns. Note, following
    discussion [1] I add multiplier 1.5 here to be closer to worst case when
    groups are significantly differ. Right now there is no way to determine how
    differ they could be. Note, this formula describes cost of first column too
    except difference with multiplier.
  - Final cost is cpu_operator_cost * N * sum(per column costs described above).
    Note, for single column with width <= sizeof(datum) and F1 = 1 this formula
    gives exactly the same result as current one.
  - for Top-N sort empiric is close to old one: use 2.0 multiplier as constant
    under log2, and use log2(Min(NGi, output_tuples)) for second and following
    columns.

I believe, suggested method could be easy enhanced to support [1] and used in [2].

Open items:
  - Fake Var. I faced with generate_append_tlist() which generates Var with
    varno = 0, it was invented, as I can see, at 2002 with comment 'should be
    changed in future'. Future hasn't yet come... I've added workaround to
    prevent call estimate_num_group() with such Vars, but I'm not sure that is
    enough.
  - Costs of all comparison functions now are the same and equal to 1. May be,
    it's time to change that.
  - Empiric constants. I know, it's impossible to remove them at all, but, may
    be, we need to find better estimation of them.

[1] https://commitfest.postgresql.org/18/1124/
[2] https://commitfest.postgresql.org/18/1651/

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: Tips on committing
Следующее
От: Teodor Sigaev
Дата:
Сообщение: Re: POC: GROUP BY optimization