Re: Abbreviated keys for Numeric

Поиск
Список
Период
Сортировка
Искать

Re: Abbreviated keys for Numeric

От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:

Re: Abbreviated keys for Numeric

От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:

Re: Re: Abbreviated keys for Numeric

От:
Andres Freund <andres@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric

От:
"ktm@rice.edu" <ktm@rice.edu>
Дата:

Re: Abbreviated keys for Numeric

От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:

Re: Abbreviated keys for Numeric

От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:

Re: Abbreviated keys for Numeric

От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:

Re: Abbreviated keys for Numeric

От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:

Re: Re: Abbreviated keys for Numeric

От:
Petr Jelinek <petr@2ndquadrant.com>
Дата:

Re: Re: Abbreviated keys for Numeric

От:
Gavin Flower <GavinFlower@archidevsys.co.nz>
Дата:

Re: Abbreviated keys for Numeric (was: Re: B-Tree support function number 3 (strxfrm() optimization))

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric (was: Re: B-Tree support function number 3 (strxfrm() optimization))

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:
On 21.2.2015 00:14, Peter Geoghegan wrote:
> On Fri, Feb 20, 2015 at 1:33 PM, Tomas Vondra 
>  wrote:
>> For example with the same percentile_disc() test as in the other
>> thread:
>> 
>> create table stuff as select random()::numeric as randnum from
>> generate_series(1,1000000);
>> 
>> analyze stuff;
>> 
>> select percentile_disc(0) within group (order by randnum) from
>> stuff;
>> 
>> 
>> I get pretty much no difference in runtimes (not even for the
>> smallest dataset, where the Datum patch speedup was significant).
>> 
>> What am I doing wrong?
> 
> So you're testing both the patches (numeric + datum tuplesort) at the
> same time?

No, I was just testing two similar patches separately. I.e. master vs.
each patch separately.

> I can't think why this would make any difference. Did you forget to 
> initdb, so that the numeric sortsupport routine was used?

No, but just to be sure I repeated the benchmarks and I still get the
same results. Each test run does this:

1) remove data directory
2) initdb
3) copy postgresql.conf (with minor tweaks - work_mem/shared_buffers)
4) start
5) create database
6) create test table
7) run a query 5x

I repeated this, just to be sure, but nope - still no speedup :-(

For master vs. patch, I do get these results:

                                 master   patched   speedup
   ---------------------------------------------------------
    generate_series(1,1000000)     1.20      1.25   0.96
    generate_series(1,2000000)     2.75      2.75   1.00
    generate_series(1,3000000)     4.40      4.40   1.00

So, no difference :(

Scripts attached, but it's really trivial test - hopefully I haven't
done anything dumb.

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Abbreviated keys for Numeric (was: Re: B-Tree support function number 3 (strxfrm() optimization))

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric (was: Re: B-Tree support function number 3 (strxfrm() optimization))

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:
Hi,

On 21.2.2015 02:06, Tomas Vondra wrote:
> On 21.2.2015 02:00, Andrew Gierth wrote:
>>>>>>> "Tomas" == Tomas Vondra  writes:
>>
>>  >> Right...so don't test a datum sort case, since that isn't supported
>>  >> at all in the master branch. Your test case is invalid for that
>>  >> reason.
>>
>>  Tomas> What do you mean by 'Datum sort case'?
>>
>> A case where the code path goes via tuplesort_begin_datum rather than
>> tuplesort_begin_heap.
>>
>>  Tomas> The test I was using is this:
>>
>>  Tomas>    select percentile_disc(0) within group (order by randnum) from stuff;
>>
>> Sorting single columns in aggregate calls uses the Datum sort path (in
>> fact I think it's currently the only place that does).
>>
>> Do that test with _both_ the Datum and Numeric sort patches in place,
>> and you will see the effect. With only the Numeric patch, the numeric
>> abbrev code is not called.
> 
> D'oh! Thanks for the explanation.

OK, so I've repeated the benchmarks with both patches applied, and I
think the results are interesting. I extended the benchmark a bit - see
the SQL script attached.

  1) multiple queries

     select percentile_disc(0) within group (order by val) from stuff

     select count(distinct val) from stuff

     select * from
       (select * from stuff order by val offset 100000000000) foo

  2) multiple data types - int, float, text and numeric

  3) multiple scales - 1M, 2M, 3M, 4M and 5M rows

Each query was executed 10x, the timings were averaged. I do know some
of the data types don't benefit from the patches, but I included them to
get a sense of how noisy the results are.

I did the measurements for

  1) master
  2) master + datum_sort_abbrev.patch
  3) master + datum_sort_abbrev.patch + numeric_sortsup.patch

and then computed the speedup for each type/scale combination (the
impact on all the queries is almost exactly the same).

Complete results are available here: http://bit.ly/1EA4mR9

I'll post all the summary here, although some of the numbers are about
the other abbreviated keys patch.


1) datum_sort_abbrev.patch vs. master

    scale      float      int    numeric     text
    ---------------------------------------------
    1          101%       99%       105%     404%
    2          101%       98%        96%      98%
    3          101%      101%        99%      97%
    4          100%      101%        98%      95%
    5           99%       98%        93%      95%

2) numeric_sortsup.patch vs. master

    scale     float       int    numeric     text
    ---------------------------------------------
    1           97%       98%       374%     396%
    2          100%      101%       407%      96%
    3           99%      102%       407%      95%
    4           99%      101%       423%      92%
    5           95%       99%       411%      92%


I think the gains are pretty awesome - I mean, 400% speedup for Numeric
accross the board? Yes please!

The gains for text are also very nice, although in this case that only
happens for the smallest scale (1M rows), and for larger scales it's
actually slower than current master :-(

It's not just rainbows and unicorns, though. With both patches applied,
text sorts get even slower (up to ~8% slower than master), It also seems
to impact float (which gets ~5% slower, for some reason), but I don't
see how that could happen ... but I suspect this might be noise.

I'll repeat the tests on another machine after the weekend, and post an
update whether the results are the same or significantly different.

regards

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Abbreviated keys for Numeric

От:
Gavin Flower <GavinFlower@archidevsys.co.nz>
Дата:

Re: Abbreviated keys for Numeric

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Gavin Flower <GavinFlower@archidevsys.co.nz>
Дата:

Re: Abbreviated keys for Numeric

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Petr Jelinek <petr@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Petr Jelinek <petr@2ndquadrant.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Dunstan <andrew@dunslane.net>
Дата:

On 04/04/2015 10:27 AM, Tom Lane wrote:
> Andrew Gierth  writes:
>> "Tom" == Tom Lane  writes:
>>   Tom> ... btw, has anyone noticed that this patch broke hamerkop and
>>   Tom> bowerbird?  Or at least, it's hard to see what other recent commit
>>   Tom> would explain the failures they're showing.
>> Now that Robert committed the fix for 64bit Datum w/o USE_FLOAT8_BYVAL,
>> bowerbird seems fixed (hamerkop hasn't run yet).
>> I see nothing in the win32 stuff that tries to define USE_FLOAT8_BYVAL
>> on 64-bit windows, is this just an oversight or does it not actually
>> work there? or is it for on-disk compatibility with 32-bit windows?
> That flag doesn't affect on-disk compatibility.  It could certainly break
> third-party extensions, but we accept the same hazard on non-Windows with
> equanimity.  I suspect this point simply wasn't revisited when we added
> support for 64-bit Windows.
>
> Having said that, I'm fine with leaving this as-is, if only because
> it means we're exercising the --disable-float8-byval code paths in
> the buildfarm ;-)
>
> 			


This seems quite wrong. If we want those paths tested we should ensure 
that buildfarm members are set up with that explicit setting.

I think not making this the default for 64 bit MSVC builds was simply an 
error of omission.

The attached patch should set float8byval as the default on 64 bit MSVC 
builds and error out if it is explicitly set on 32 bit platforms.

cheers

andrew

Re: Abbreviated keys for Numeric

От:
Andrew Dunstan <andrew@dunslane.net>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Abbreviated keys for Numeric (was: Re: B-Tree support function number 3 (strxfrm() optimization))

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:
Another spinoff from the abbreviation discussion. Peter Geoghegan
suggested on IRC that numeric would benefit from abbreviation, and
indeed it does (in some cases by a factor of about 6-7x or more, because
numeric comparison is no speed demon).

This patch abbreviates numerics to a weight+initial digits
representation (the details differ slightly between 32bit and 64bit
builds to make the best use of the available bits).

On 32bit, numeric values that are between about 10^-44 and 10^83, and
which differ either in order of magnitude or in the leading 7
significant decimal digits (not base-10000 digits, single decimals) will
get distinct abbreviations. On 64bit the range is 10^-176 to 10^332 and
the first 4 base-10000 digits are kept, thus comparing 13 to 16 decimal
digits. This is expected to be ample for applications using numeric to
store numbers; applications that store things in numeric that aren't
actually numbers might not see the benefit, but I have not found any
detectable slowdown from the patch even on constructed pathological
data.

-- 
Andrew (irc:RhodiumToad)


Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:
So here's the latest (and, hopefully, last) version:

 - adds diagnostic output from numeric_abbrev_abort using the trace_sort
   GUC

 - fixed Datum cs. uint32 issues in hash_uint32

 - added a short comment about excess-k representation

 - tweaked the indenting and comments a bit

I'm not particularly committed to any specific way of handling the
DEC_DIGITS issue. (I moved away from the "transparently skip
abbreviations" approach of the original because it seemed that reducing
#ifdefism in the code was a desirable feature.)

The INT64_MIN/MAX changes should be committed fairly soon. (I haven't
posted a patch for TRACE_SORT)

-- 
Andrew (irc:RhodiumToad)

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:
So here's a version 3 patch:

1. #ifdefism is reduced to a minimum by (a) desupporting values of NBASE
other than 10000, and (b) keeping the 32bit code, so that there is now
no longer any question of abbreviation not being used (it always is).
So the only #ifs in the code body (rather than declarations) are to
select which implementation of numeric_abbrev_convert_var to use.

2. Some (but not all) stylistic changes have been made following Peter's
version.

3. Buffer management has been simplified by simply allocating the
maximum needed size upfront (since it's only needed for short varlenas).
The truncation behavior has been removed.

4. Updated oid choice etc.

The substance of the code is unchanged from my original patch.  I didn't
add diagnostic output to numeric_abbrev_abort, see my separate post
about the suggestion of a GUC for that.

-- 
Andrew (irc:RhodiumToad)

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Andrew Gierth <andrew@tao11.riddles.org.uk>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:
On Fri, Apr 3, 2015 at 1:39 PM, Andrew Gierth
 wrote:
> It already does; it changes how int64 values are expected to be stored
> in Datum variables. _Everything_ that currently stores an int64 value in
> a Datum is affected.

But this isn't a value of the SQL type "int64".  It's just a bit
pattern that has to fit inside however big a Datum happens to be.

> As I see it, you need a really good reason to override that in a
> specific case, and supporting 64-bit abbreviations on a
> --disable-float8-byval build really isn't a good reason (since 32-bit
> abbrevs work fine and such builds should be vanishingly rare anyway).

In my opinion, there is way too much crap around already just to cater
to --disable-float8-byval.  I think not adding more is a perfectly
reasonable goal.  If somebody wants to remove --disable-float8-byval
some day, I want to minimize the amount of stuff they have to change
in order to do that.  I think that keeping this off the list of stuff
that will require modification is a worthy goal.

> The fact that making this one low-benefit change has introduced no less
> than three separate bugs - the compile error with that #ifdef, the use
> of Int64GetDatum for NANs, and the use of Int64GetDatum for the return
> value of the abbreviation function should possibly be taken as a hint to
> how bad an idea is.

But all of those are trivial, and the first would have been caught by
my compiler if I weren't using such a crappy old compiler.  If
anything that might require as much as 10 lines of code churn to fix
is not worth doing, very little is worth doing.

> If you're determined to go this route - over my protest - then you need
> to do something like define a NumericAbbrevGetDatum(x) macro and use it
> in place of the Int64GetDatum / Int32GetDatum ones for both NAN and the
> return from numeric_abbrev_convert_var.

Patch for that attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:
Attached is a revision of this patch, that I'm calling v2. What do you
think, Andrew?

I've cut the int32 representation/alternative !USE_FLOAT8_BYVAL
encoding scheme entirely, which basically means that 32-bit systems
don't get to have this optimization. 64-bit systems have been
commonplace now for about a decade. This year, new phones came out
with 64-bit architectures, so increasingly even people that work with
embedded systems don't care about 32-bit. I'm not suggesting that
legacy doesn't matter - far from it - but I care much less about
having the latest performance improvements on what are largely legacy
systems. Experience suggests that this is a good time of the cycle to
cut scope. The last commitfest has a way of clarifying what is
actually important.

It seems unwise to include what is actually a fairly distinct encoding
scheme, which the int32/ !USE_FLOAT8_BYVAL variant really was (the
same can't really be said for text abbreviation, since that can
basically work the same way on 32-bit systems, with very little extra
code). This isn't necessarily the right decision in general, but I
feel it's the right decision in the present climate of everyone
frantically closing things out, and feeling burnt out. I'm sorry that
I threw some of your work away, but since we both have other pressing
concerns, perhaps this is understandable. It may be revisited, or I
may lose the argument on this point, but going this way cuts the code
by about 30%, and makes me feel a lot better about the risk of
regressing marginal cases, since I know we always have 8 bytes to work
with. There might otherwise be a danger of regressing under tested
32-bit platforms, or indeed missing other bugs, and frankly I don't
have time to think about that right now.

Other than that, I've tried to keep things closer to the text opclass.
For example, the cost model now has a few debugging traces (disabled
by default). I have altered the ad-hoc cost model so that it no longer
concerns itself with NULL inputs, which seemed questionable (not least
since the abbreviation conversion function isn't actually called for
NULL inputs. Any attempt to track the full count within numeric code
therefore cannot work.). I also now allocate a buffer of scratch
memory separately from the main sortsupport object - doing one
allocation for all sortsupport state, bunched together as a buffer
seemed like a questionable micro-optimization. For similar reasons, I
avoid playing tricks in the VARATT_IS_SHORT() case -- my preferred
approach to avoiding palloc()/pfree() cycles is to simply re-use the
same buffer across calls to numeric_abbrev_convert(), and maybe risk
having to enlarge the relatively tiny buffer once or twice. In other
words, it works more or less the same way as it does with text
abbreviation.

It seemed unwise to silently disable abbreviation when someone
happened to build with DEC_DIGITS != 4. A static assertion now gives
these unusual cases the opportunity to make an informed decision about
either disabling abbreviation or not changing DEC_DIGITS in light of
the performance penalty, in a self-documenting way.

The encoding scheme is unchanged. I think that your conclusions on
those details were sound. Numeric abbreviation has a more compelling
cost/benefit ratio than even that of text. I easily managed to get the
same 6x - 7x improvement that you reported when sorting 10 million
random numeric rows.

Thanks
-- 
Peter Geoghegan

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric (was: Re: B-Tree support function number 3 (strxfrm() optimization))

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric (was: Re: B-Tree support function number 3 (strxfrm() optimization))

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric (was: Re: B-Tree support function number 3 (strxfrm() optimization))

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:
On Sun, Mar 22, 2015 at 1:01 PM, Andrew Gierth
 wrote:
> The substance of the code is unchanged from my original patch.  I didn't
> add diagnostic output to numeric_abbrev_abort, see my separate post
> about the suggestion of a GUC for that.

I don't think that V2 really changed the substance, which seems to be
the implication of your remarks here. You disagreed with my decision
on NULL values - causing me to reconsider my position (so that's now
irrelevant) - and you disagreed with not including support for 32-bit
platforms. Those were the only non-stylistic changes, though. I
certainly didn't change any details of the algorithm that you
proposed, which, FWIW, I think is rather clever. I added a few
defensive assertions to the encoding/conversion routine (which I see
you've removed in V3, a long with a couple of other helpful
assertions), and restructured and expanded upon the comments, but
that's all.

You haven't really taken into my account my V2 feedback with this V3
revision. Even after you yourself specifically called out your
non-explanation of excess-44 as a possible point of confusion for
readers of your patch, you didn't change or expand upon your remarks
on that one iota.

I see that code like this (from your V1) appears in your V3, as if V2
never happened:

+/*
+ * non-fmgr interface to the comparison routine to allow sortsupport to elide
+ * the fmgr call.  The saving here is small given how slow numeric
+ * comparisons are, but there's no reason not to implement it.
+ */

This comment is totally redundant at best, and misleading at worst. Of
course we have a non-fmgr interface - it's needed to make abbreviation
work. And of course *any* opclass that performs abbreviation won't get
any great saving from cases where only the fmgr interface is elided
(e.g. numeric is not the leading sorttuple attribute). Text is unusual
in that there is a small additional saving over fmgr-elision for
obscure reasons.

Ditto for comments like this, which re-appear in V3:

+/*
+ * Ordinary (non-sortsupport) comparisons follow.
+ */

Your V3 has obsolete comments here:

+ nss = palloc(sizeof(NumericSortSupport));
+
+ /*
+ * palloc a buffer for handling unaligned packed values in addition to
+ * the support struct
+ */
+ nss->buf = palloc(VARATT_SHORT_MAX + VARHDRSZ + 1);

I still don't think you should be referencing the text opclass behavior here:

+ * number.  We make no attempt to estimate the cardinality of the real values,
+ * since it plays no part in the cost model here (if the abbreviation is equal,
+ * the cost of comparing equal and unequal underlying values is comparable).
+ * We discontinue even checking for abort (saving us the hashing overhead) if
+ * the estimated cardinality gets to 100k; that would be enough to support many
+ * billions of rows while doing no worse than breaking even.

This is dubious:

+#if DEC_DIGITS != 4
+#error "Numeric bases other than 10000 are no longer supported"
+#endif

Because there is a bunch of code within numeric.c that deals with the
DEC_DIGITS != 4 case. For example, this code has been within numeric.c
forever:

#if DEC_DIGITS == 4 || DEC_DIGITS == 2
static NumericDigit const_ten_data[1] = {10};
static NumericVar const_ten =
{1, 0, NUMERIC_POS, 0, NULL, const_ten_data};
#elif DEC_DIGITS == 1
static NumericDigit const_ten_data[1] = {1};
static NumericVar const_ten =
{1, 1, NUMERIC_POS, 0, NULL, const_ten_data};
#endif

As has this:

while (ndigits-- > 0)
{
#if DEC_DIGITS == 4
    *digits++ = ((decdigits[i] * 10 + decdigits[i + 1]) * 10 +
    decdigits[i + 2]) * 10 + decdigits[i + 3];
#elif DEC_DIGITS == 2
    *digits++ = decdigits[i] * 10 + decdigits[i + 1];
#elif DEC_DIGITS == 1
    *digits++ = decdigits[i];
#else
#error unsupported NBASE
#endif
    i += DEC_DIGITS;
}

I tend to think that when Tom wrote this code back in 2003, he thought
it might be useful to change DEC_DIGITS on certain builds. And so, we
ought to continue to support it to the extent that we already do,
allowing these cases to opt out of abbreviation in an informed manner
(since it seems hard to make abbreviation work with DEC_DIGITS != 4
builds). In any case, you should have deleted all this code (which
there is rather a lot of) in proposing to not support DEC_DIGITS != 4
builds generally.

Attached revision, V4, incorporates some of your V3 changes. As I
mentioned, I changed my mind on the counting of non-NULL values, which
V4 reflects...but there are also comments that now make it clear why
that might be useful.

I've retained your new allocate once approach to buffer sizing, which
seems sound if a little obscure.

This abort function code (from your V1 + V3) seems misleading:

+ if (memtupcount < 10000 || nss->input_count < 10000 || !nss->estimating)
+     return false;

It's impossible for "memtupcount < 10000" while "nss->input_count <
10000" as well, since the latter counts a subset of what the former
counts. So I only check the latter now.

I have not added back 32-bit support, which, IMV isn't worth it at
this time - it is, after all, almost a fully independent abbreviation
scheme to the 64-bit scheme. Right now we ought to be cutting scope,
or 9.5 will never reach feature freeze. I have already spent
considerably more time on this patch than I had intended to. I need to
catch a flight to New York at the crack of dawn tomorrow, to get to
pgConf US, and I have much work to do on my UPSERT patch, so I've
simply run out of time. If the committer that picks this up wants to
look at 32-bit support, that may make sense. I think that given the
current lack of cycles from everyone, we'll be doing well to even get
64-bit numeric abbreviation support in 9.5. I ask that you have a some
perspective on cutting 32-bit support, Andrew. In any case, I've
personally run out of time for this for 9.5.

We may add a patch to change things so that a GUC controls
abbreviation debug output, and we may add a patch that standardizes
that INT64_MIN and INT64_MAX are available everywhere. But unless and
until those other patches are committed, I see no reason to assume
that they will be, and so those aspects remain unchanged from my V2.
Unless there is strong opposition, or bugs come to light, or the
GUC/macros are added by independent commits to the master branch, I
expect to mark this revision "Ready for Committer" shortly.

Thanks
-- 
Peter Geoghegan

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric (was: Re: B-Tree support function number 3 (strxfrm() optimization))

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Peter Geoghegan <pg@heroku.com>
Дата:

Re: Abbreviated keys for Numeric

От:
Robert Haas <robertmhaas@gmail.com>
Дата:
FAQ