Re: [PERFORM] A Better External Sort?

Поиск
Список
Период
Сортировка
От Dann Corbit
Тема Re: [PERFORM] A Better External Sort?
Дата
Msg-id D425483C2C5C9F49B5B7A41F8944154757D133@postal.corporate.connx.com
обсуждение исходный текст
Ответ на [PERFORM] A Better External Sort?  (Ron Peacetree <rjpeace@earthlink.net>)
Список pgsql-hackers
> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of PFC
> Sent: Thursday, September 29, 2005 9:10 AM
> To: rjpeace@earthlink.net
> Cc: Pg Hackers; pgsql-performance@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>
>     Just to add a little anarchy in your nice debate...
>
>     Who really needs all the results of a sort on your terabyte
table ?

Reports with ORDER BY/GROUP BY, and many other possibilities.  40% of
mainframe CPU cycles are spent sorting.  That is because huge volumes of
data require lots of energy to be meaningfully categorized.  Let's
suppose that instead of a terabyte of data (or a petabyte or whatever)
we have 10% of it.  That's still a lot of data.

>     I guess not many people do a SELECT from such a table and want
all
> the
> results.

What happens when they do?  The cases where it is already fast are not
very important.  The cases where things go into the crapper are the ones
that need attention.

>So, this leaves :
>     - Really wanting all the results, to fetch using a cursor,
>     - CLUSTER type things, where you really want everything in
order,
>     - Aggregates (Sort->GroupAggregate), which might really need to
sort
> the
> whole table.
>     - Complex queries where the whole dataset needs to be examined,
in
> order
> to return a few values
>     - Joins (again, the whole table is probably not going to be
> selected)
>     - And the ones I forgot.
>
>     However,
>
>     Most likely you only want to SELECT N rows, in some ordering :
>     - the first N (ORDER BY x LIMIT N)
>     - last N (ORDER BY x DESC LIMIT N)

For these, the QuickSelect algorithm is what is wanted.  For example:
#include <stdlib.h>
typedef double  Etype;

extern Etype    RandomSelect(Etype * A, size_t p, size_t r, size_t i);
extern size_t   RandRange(size_t a, size_t b);
extern size_t   RandomPartition(Etype * A, size_t p, size_t r);
extern size_t   Partition(Etype * A, size_t p, size_t r);

/*
**
** In the following code, every reference to CLR means:
**
**    "Introduction to Algorithms"
**    By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest
**    ISBN 0-07-013143-0
*/


/*
** CLR, page 187
*/
Etype           RandomSelect(Etype A[], size_t p, size_t r, size_t i)
{
    size_t          q,
                    k;
    if (p == r)
        return A[p];
    q = RandomPartition(A, p, r);
    k = q - p + 1;

    if (i <= k)
        return RandomSelect(A, p, q, i);
    else
        return RandomSelect(A, q + 1, r, i - k);
}

size_t          RandRange(size_t a, size_t b)
{
    size_t          c = (size_t) ((double) rand() / ((double) RAND_MAX +
1) * (b - a));
    return c + a;
}

/*
** CLR, page 162
*/
size_t          RandomPartition(Etype A[], size_t p, size_t r)
{
    size_t          i = RandRange(p, r);
    Etype           Temp;
    Temp = A[p];
    A[p] = A[i];
    A[i] = Temp;
    return Partition(A, p, r);
}

/*
** CLR, page 154
*/
size_t          Partition(Etype A[], size_t p, size_t r)
{
    Etype           x,
                    temp;
    size_t          i,
                    j;

    x = A[p];
    i = p - 1;
    j = r + 1;

    for (;;) {
        do {
            j--;
        } while (!(A[j] <= x));
        do {
            i++;
        } while (!(A[i] >= x));
        if (i < j) {
            temp = A[i];
            A[i] = A[j];
            A[j] = temp;
        } else
            return j;
    }
}

>     - WHERE x>value ORDER BY x LIMIT N
>     - WHERE x<value ORDER BY x DESC LIMIT N
>     - and other variants
>
>     Or, you are doing a Merge JOIN against some other table ; in
that
> case,
> yes, you might need the whole sorted terabyte table, but most likely
there
> are WHERE clauses in the query that restrict the set, and thus, maybe
we
> can get some conditions or limit values on the column to sort.

Where clause filters are to be applied AFTER the join operations,
according to the SQL standard.

>     Also the new, optimized hash join, which is more memory
efficient,
> might
> cover this case.

For == joins.  Not every order by is applied to joins.  And not every
join is an equal join.

>     Point is, sometimes, you only need part of the results of your
sort.
> And
> the bigger the sort, the most likely it becomes that you only want
part of
> the results.

That is an assumption that will sometimes be true, and sometimes not.
It is not possible to predict usage patterns for a general purpose
database system.

> So, while we're in the fun hand-waving, new algorithm trying
> mode, why not consider this right from the start ? (I know I'm totally
in
> hand-waving mode right now, so slap me if needed).
>
>     I'd say your new, fancy sort algorithm needs a few more input
values
> :
>
>     - Range of values that must appear in the final result of the
sort :
>         none, minimum, maximum, both, or even a set of values
from the
> other
> side of the join, hashed, or sorted.

That will already happen (or it certainly ought to)

>     - LIMIT information (first N, last N, none)

That will already happen (or it certainly ought to -- I would be pretty
surprised if it does not happen)

>     - Enhanced Limit information (first/last N values of the second
> column to
> sort, for each value of the first column) (the infamous "top10 by
> category" query)
>     - etc.

All the filters will (at some point) be applied to the data unless they
cannot be applied to the data by formal rule.

>     With this, the amount of data that needs to be kept in memory is
> dramatically reduced, from the whole table (even using your compressed
> keys, that's big) to something more manageable which will be closer to
the
> size of the final result set which will be returned to the client, and
> avoid a lot of effort.

Sorting the minimal set is a good idea.  Sometimes there is a big
savings there. I would be pretty surprised if a large fraction of data
that does not have to be included is actually processed during the
sorts.

>     So, this would not be useful in all cases, but when it applies,
it
> would
> be really useful.

No argument there.  And if an algorithm is being reworked, it is a good
idea to look at things like filtering to see if all filtering that is
allowed by the language standard before the sort takes place is applied.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tony Caduto
Дата:
Сообщение: Re: Found small issue with OUT params
Следующее
От: "Joshua D. Drake"
Дата:
Сообщение: Re: Found small issue with OUT params