Обсуждение: Shrinking tuplesort.c's SortTuple struct (Was: More ideas forspeeding up sorting)

Поиск
Список
Период
Сортировка

Shrinking tuplesort.c's SortTuple struct (Was: More ideas forspeeding up sorting)

От
Peter Geoghegan
Дата:
On Fri, Sep 9, 2016 at 6:14 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> 1. SortTuple.tupindex is not used when the sort fits in memory. If we
> also got rid of the isnull1 flag, we could shrink SortTuple from 24
> bytes to 16 bytes (on 64-bit systems). That would allow you to pack more
> tuples into memory, which generally speeds things up. We could for
> example steal the least-significant bit from the 'tuple' pointer for
> that, because the pointers are always at least 4-byte aligned. (But see
> next point)

I implemented what you sketched here, back in 2016. That is, I found a
way to make SortTuples 16 bytes on 64-bit platforms, down from 24
bytes (24 bytes includes alignment padding). Note that it only became
possible to do this after commit 8b304b8b72b removed replacement
selection sort, which reduced the number of things that use the
SortTuple.tupindex field to just one: it is now only used for tapenum
during merging. (I also had to remove SortTuple.isnull1 to get down to
16 bytes.)

The easy part was removing SortTuple.tupindex itself -- it was fairly
natural to stash that in the slab allocation for each tape. I used the
aset.c trick of having a metadata "chunk" immediately prior to address
that represents the allocation proper -- we can just back up by a few
bytes from stup.tuple to find the place to stash the tape number
during merging. The worst thing about this change was that it makes a
tape slab allocation mandatory in cases that previously didn't have
any need for a stup.tuple allocation (e.g. datum tuplesorts of
pass-by-value types), though only during merging. Since we must always
store the tapenum when merging, we always need a slab buffer for each
tape when merging. This aspect wasn't so bad.

The hard/ugly part was getting rid of the remaining "unnecessary"
SortTuple field, isnull1. This involved squeezing an extra bit out of
the stup.tuple pointer, by stealing the least-significant bit. This
was invasive in about the way you'd expect it to be. It wasn't awful,
but it also wasn't something I'd countenance pursuing without getting
a fairly noticeable benefit for users. (Actually, the code that I
wrote so far *is* pretty awful, but I could certainly clean it up some
more if I felt like it.)

I think that the rough patch that I came up with gives us an accurate
picture of what the benefits of having SortTuples that are only 16
bytes wide are. The benefits seem kind of underwhelming at this point.
For something like a "SELECT COUNT(distinct my_int4_col) FROM tab"
query, which uses the qsort_ssup() qsort specialization, we can easily
go from getting an external sort to getting an internal sort. We can
maybe end up sorting about 20% faster if things really work out for
the patch. But in cases that users really care about, such as REINDEX,
the difference is in the noise. ISTM that this is simple not worth the
trouble at this time. These days, external sorts are often slightly
faster than internal sorts in practice, due to the fact that we can do
an on-the-fly merge with external sorts, so we could easily hurt
performance by making more memory available!

I don't want to completely close the door on the idea of shrinking
SortTuple to 16 bytes. I can imagine a world in which that matters.
For example, perhaps there will one day be a strong case to be made
for SIMD-based internal sort algorithms for simple cases, such as the
"COUNT(distinct my_int4_col)" query case that I mentioned. Probably
SIMD-based multiway merge sort. I understand that such algorithms are
very sensitive to things like SIMD CPU register sizes, and were only
considered plausible competitors to quicksort due to the advent of
512-bit SIMD registers. 512 bit SIMD registers haven't been available
in mainstream CPUs for that long.

I have to admit that this SIMD sorting stuff seems like a bit of a
stretch, though. The papers in this area all seem to make rather
optimistic assumptions about the distribution of values. And, I think
we'd have to be even more aggressive about shrinking SortTuples in
order to realize the benefits of SIMD-based sorting. Besides, sorting
itself is the bottleneck for tuplesort-using operations less and less
these days -- the only remaining interesting bottleneck is probably in
code like index_form_tuple(), which is probably a good target for JIT.
In general, it's much harder to make tuplesort.c noticeably faster
than it used to be -- we've picked all the low-hanging fruit.

--
Peter Geoghegan



Re: Shrinking tuplesort.c's SortTuple struct (Was: More ideas forspeeding up sorting)

От
Heikki Linnakangas
Дата:
On 10/08/2019 02:14, Peter Geoghegan wrote:
> The easy part was removing SortTuple.tupindex itself -- it was fairly
> natural to stash that in the slab allocation for each tape. I used the
> aset.c trick of having a metadata "chunk" immediately prior to address
> that represents the allocation proper -- we can just back up by a few
> bytes from stup.tuple to find the place to stash the tape number
> during merging. The worst thing about this change was that it makes a
> tape slab allocation mandatory in cases that previously didn't have
> any need for a stup.tuple allocation (e.g. datum tuplesorts of
> pass-by-value types), though only during merging. Since we must always
> store the tapenum when merging, we always need a slab buffer for each
> tape when merging. This aspect wasn't so bad.

Hmm. Wouldn't it be more straightforward to have the extra tupindex 
field at the end of the struct? Something like:

typedef struct
{
    void       *tuple;            /* the tuple itself */
    Datum        datum1;            /* value of first key column */
    bool        isnull1;        /* is first key column NULL? */
} SortTuple;

typedef struct
{
    SortTuple stuple;
    int            tupindex;        /* see notes above */
} MergeTuple;

The initial sorting phase would deal with SortTuples, and the merge 
phase would deal with MergeTuples. The same comparison routines work 
with both.

> The hard/ugly part was getting rid of the remaining "unnecessary"
> SortTuple field, isnull1. This involved squeezing an extra bit out of
> the stup.tuple pointer, by stealing the least-significant bit. This
> was invasive in about the way you'd expect it to be. It wasn't awful,
> but it also wasn't something I'd countenance pursuing without getting
> a fairly noticeable benefit for users. (Actually, the code that I
> wrote so far *is* pretty awful, but I could certainly clean it up some
> more if I felt like it.)
> 
> I think that the rough patch that I came up with gives us an accurate
> picture of what the benefits of having SortTuples that are only 16
> bytes wide are. The benefits seem kind of underwhelming at this point.
> For something like a "SELECT COUNT(distinct my_int4_col) FROM tab"
> query, which uses the qsort_ssup() qsort specialization, we can easily
> go from getting an external sort to getting an internal sort. We can
> maybe end up sorting about 20% faster if things really work out for
> the patch.

If you separate the NULLs from non-NULLs in a separate array, as was 
discussed back in 2016, instead of stealing a bit, you can squeeze some 
instructions out of the comparison routines, which might give some extra 
speedup.

> But in cases that users really care about, such as REINDEX,
> the difference is in the noise. ISTM that this is simple not worth the
> trouble at this time. These days, external sorts are often slightly
> faster than internal sorts in practice, due to the fact that we can do
> an on-the-fly merge with external sorts, so we could easily hurt
> performance by making more memory available!

Yeah, that's a bit sad.

That makes me think: even when everything fits in memory, it might make 
sense to divide the input into a few batches, qsort them individually, 
and do an on-the-fly merge of the batches. I guess I'm essentially 
suggesting that we should use merge instead of quicksort for the 
in-memory case, too.

If we had the concept of in-memory batches, you could merge together 
in-memory and external batches. That might be handy. For example, when 
doing an external sort, instead of flushing the last run to disk before 
you start merging, you could keep it in memory. That might be 
significant in the cases where the input is only slightly too big to fit 
in memory.

- Heikki



Re: Shrinking tuplesort.c's SortTuple struct (Was: More ideas forspeeding up sorting)

От
Peter Geoghegan
Дата:
On Sat, Aug 10, 2019 at 1:20 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm. Wouldn't it be more straightforward to have the extra tupindex
> field at the end of the struct?

> The initial sorting phase would deal with SortTuples, and the merge
> phase would deal with MergeTuples. The same comparison routines work
> with both.

Maybe, but then you would have to use MergeTuples in the
tuplesort_heap* routines, which are not just used when merging
external sort runs. You'd probably incur a penalty for top-N heap
sorts too. Now, that could still be worth it, but it's something to
consider.

> If you separate the NULLs from non-NULLs in a separate array, as was
> discussed back in 2016, instead of stealing a bit, you can squeeze some
> instructions out of the comparison routines, which might give some extra
> speedup.

That might work well, but partitioning the memtuples array isn't
trivial. Routines like grow_memtuples() still need to work, and that
seems like it would be tricky. So again, this may well be a better way
to do it, but that isn't obvious.

> > But in cases that users really care about, such as REINDEX,
> > the difference is in the noise. ISTM that this is simple not worth the
> > trouble at this time. These days, external sorts are often slightly
> > faster than internal sorts in practice, due to the fact that we can do
> > an on-the-fly merge with external sorts, so we could easily hurt
> > performance by making more memory available!
>
> Yeah, that's a bit sad.

I think that this is likely to be the problem with any combination of
enhancements that remove fields from the SortTuple struct, to get it
down to 16 bytes: Making  SortTuples only 16 bytes just isn't that
compelling.

> That makes me think: even when everything fits in memory, it might make
> sense to divide the input into a few batches, qsort them individually,
> and do an on-the-fly merge of the batches. I guess I'm essentially
> suggesting that we should use merge instead of quicksort for the
> in-memory case, too.

That might make sense. The Alphasort paper [1] recommends using
quicksort on CPU-cached sized chunks, and merging the chunks together
as they're written out as a single on-disk run. The Alphasort paper is
probably the first place where the abbreviated keys technique is
described, and had a lot of good ideas.

> If we had the concept of in-memory batches, you could merge together
> in-memory and external batches. That might be handy. For example, when
> doing an external sort, instead of flushing the last run to disk before
> you start merging, you could keep it in memory. That might be
> significant in the cases where the input is only slightly too big to fit
> in memory.

The patch that I wrote to make tuplesort.c use quicksort in preference
to replacement selection sort for generating initial runs starting out
with an implementation of something that I called "quicksort with
spillover". The idea was that you could only spill a few extra tuples
to disk when you almost had enough workMem, and then merge the on-disk
run with the larger, quicksorted in memory run. It worked alright, but
it felt more important to make external sorts use quicksort in
general. Robert Haas really hated it at the time, because it relied on
various magic numbers, based on heuristics.

The easiest and least controversial way to make internal sorting
faster may be to update our Quicksort algorithm to use the same
implementation that was added to Java 7 [2]. It uses all of the same
tricks as our existing the Bentley & McIlroy implementation, but is
more cache efficient. It's considered the successor to B&M, and had
input from Bentley himself. It is provably faster than B&M for a wide
variety of inputs, at least on modern hardware.

[1] http://www.vldb.org/journal/VLDBJ4/P603.pdf
[2] https://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf
-- 
Peter Geoghegan