Обсуждение: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

Поиск
Список
Период
Сортировка

Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

От
Peter Geoghegan
Дата:
We lack SortSupport for many character-like-type cases. In full, the
cases within the core system are:

* char(n) opfamily (bpchar_ops).

* text_pattern_ops opfamily (includes text and varchar "pattern"
opclasses, which are generally recommended for accelerating LIKE
operator queries).

* bpchar_pattern_ops -- the operator family/class for char(n), used
where "pattern" style indexing is required for the char(n) type.

* bytea default opclass. This is a type that, like the others, shares
its representation with text (a varlena header and some data bytes --
a string, essentially). Its comparator already behaves identically to
that of the text comparator when the "C" collation is used. I've
actually seen a specific request for this [1].

These cases do matter. For one thing, even if they're less important
than the default text/varchar opclasses, having such large
inconsistencies in character type sort performance is a fairly major
POLA violation; opclasses like text_pattern_ops are *supposed* to be
faster though less capable alternatives to the defaults. For another,
char(n) is in the SQL standard, which may be why all TPC benchmarks
use char(n) for columns that are sorted on or used for grouping.
char(n) sorting can be made about 8x faster with
SortSupport/abbreviation, making it the best candidate for
abbreviation optimization that I've ever seen. It would be regrettable
if we accidentally lost a benchmark due to not having char(n)
SortSupport.

Attached patch adds SortSupport for all of the cases listed above.

I did not bother doing anything with contrib/citext, because I think
it's not worth it. I think that we should definitely invest in case
insensitive collations, and retire contrib/citext. The fact that the
*default* collation (and not the input collation) is passed by
citextcmp()'s call to str_tolower() suggests to me that the only way
to make citext do the right thing is to basically introduce case
insensitive collations to Postgres. Of course, doing so would make
contrib/citext obsolete.

I also didn't bother extending the name type's SortSupport (something
that has existed since 9.2) to perform abbreviation, although that
wouldn't be very hard. I saw no point.

I think I might also get around to adding abbreviated key support for
the network address types during the 9.6 cycle. That seems like
something a less experienced contributor could easily pick up, though
-- if anyone wants to take it off my hands, please do so.

[1] https://lwn.net/Articles/653721/
--
Peter Geoghegan

Вложения

Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

От
Peter Geoghegan
Дата:
On Thu, Nov 12, 2015 at 4:38 PM, Peter Geoghegan <pg@heroku.com> wrote:
> * bytea default opclass. This is a type that, like the others, shares
> its representation with text (a varlena header and some data bytes --
> a string, essentially). Its comparator already behaves identically to
> that of the text comparator when the "C" collation is used. I've
> actually seen a specific request for this [1].

I probably should have given an explanation for why it's okay that NUL
bytes can exist in strings from the point of view of the generalized
SortSupport for text worker function. The next revision will have
comments along those lines.

-- 
Peter Geoghegan



Re: Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

От
Alvaro Herrera
Дата:
Peter Geoghegan wrote:
> On Thu, Nov 12, 2015 at 4:38 PM, Peter Geoghegan <pg@heroku.com> wrote:
> > * bytea default opclass. This is a type that, like the others, shares
> > its representation with text (a varlena header and some data bytes --
> > a string, essentially). Its comparator already behaves identically to
> > that of the text comparator when the "C" collation is used. I've
> > actually seen a specific request for this [1].
> 
> I probably should have given an explanation for why it's okay that NUL
> bytes can exist in strings from the point of view of the generalized
> SortSupport for text worker function. The next revision will have
> comments along those lines.

Did you ever post "the next revision"?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

От
Alvaro Herrera
Дата:
Peter Geoghegan wrote:
> We lack SortSupport for many character-like-type cases. In full, the
> cases within the core system are:

You're stealthily introducing a new abstraction called "string",
including a typedef and DatumGetString support macros.  Is that really
necessary?  Shouldn't it be discussed specifically?  I don't necessarily
oppose it as is, mainly because it's limited to within varlena.c for
now, but I'm not sure it'd okay to make it more public.

Also, there's a lot of churn in this patch to just remove tss to sss.
Can't we just keep the original variable name?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

От
Peter Geoghegan
Дата:
On Thu, Jan 7, 2016 at 7:41 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> You're stealthily introducing a new abstraction called "string",
> including a typedef and DatumGetString support macros.  Is that really
> necessary?  Shouldn't it be discussed specifically?  I don't necessarily
> oppose it as is, mainly because it's limited to within varlena.c for
> now, but I'm not sure it'd okay to make it more public.

Note that a similar abstraction for the "unknown" type also exists
only within varlena.c.  So, DatumGetStringP() and so on appear right
alongside DatumGetUnknownP() and so on.

The idea of the "string" abstraction is that is advertises that
certain functions could equally well apply to a variety of "varlena
header + some bytes" types. I thought about just using the varlena
type instead, but preferred the "string" abstraction.

> Also, there's a lot of churn in this patch to just remove tss to sss.
> Can't we just keep the original variable name?

I think that minimizing lines changed in a mechanical way by a commit
is overrated as a goal for Postgres patches, but I don't feel too
strongly about holding on to the "churn" in this patch. I attach a new
revision, which has the changes I outlined to code comments. I haven't
minimized the differences in the way you suggest just yet.

--
Peter Geoghegan

Вложения

Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

От
Andreas Karlsson
Дата:
Hi,

I have reviewed this now and I think this is a useful addition even 
though these indexes are less common. Consistent behavior is worth a lot 
in my mind and this patch is reasonably small.

The patch no longer applies due to 1) oid collisions and 2) a trivial 
conflict. When I fixed those two the test suite passed.

I tested this patch by building indexes with all the typess and got nice 
measurable speedups.

Logically I think the patch makes sense and the code seems to be 
correct, but I have some comments on it.

- You use two names a lot "string" vs "varstr". What is the difference 
between those? Is there any reason for not using varstr consistently?

- You have a lot of renaming as has been mentioned previously in this 
thread. I do not care strongly for it either way, but it did make it 
harder to spot what the patch changed. If it was my own project I would 
have considered splitting the patch into two parts, one renaming 
everything and another adding the new feature, but the PostgreSQL seem 
to often prefer having one commit per feature. Do as you wish here.

- I think the comment about NUL bytes in varstr_abbrev_convert makes it 
seem like the handling is much more complicated than it actually is. I 
did not understand it after a couple of readings and had to read the 
code understand what it was talking about.

Nice work, I like your sorting patches.

Andreas



Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

От
Robert Haas
Дата:
On Sun, Jan 31, 2016 at 10:59 PM, Andreas Karlsson <andreas@proxel.se> wrote:
> I have reviewed this now and I think this is a useful addition even though
> these indexes are less common. Consistent behavior is worth a lot in my mind
> and this patch is reasonably small.
>
> The patch no longer applies due to 1) oid collisions and 2) a trivial
> conflict. When I fixed those two the test suite passed.
>
> I tested this patch by building indexes with all the typess and got nice
> measurable speedups.
>
> Logically I think the patch makes sense and the code seems to be correct,
> but I have some comments on it.
>
> - You use two names a lot "string" vs "varstr". What is the difference
> between those? Is there any reason for not using varstr consistently?
>
> - You have a lot of renaming as has been mentioned previously in this
> thread. I do not care strongly for it either way, but it did make it harder
> to spot what the patch changed. If it was my own project I would have
> considered splitting the patch into two parts, one renaming everything and
> another adding the new feature, but the PostgreSQL seem to often prefer
> having one commit per feature. Do as you wish here.
>
> - I think the comment about NUL bytes in varstr_abbrev_convert makes it seem
> like the handling is much more complicated than it actually is. I did not
> understand it after a couple of readings and had to read the code understand
> what it was talking about.
>
> Nice work, I like your sorting patches.

Thanks for the review.  I fixed the OID conflict, tweaked a few
comments, and committed this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

От
Peter Geoghegan
Дата:
On Sun, Jan 31, 2016 at 7:59 PM, Andreas Karlsson <andreas@proxel.se> wrote:
> Nice work, I like your sorting patches.

Thanks. I like your reviews of my sorting patches. :-)

-- 
Peter Geoghegan



Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

От
Peter Geoghegan
Дата:
On Wed, Feb 3, 2016 at 11:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Thanks for the review.  I fixed the OID conflict, tweaked a few
> comments, and committed this.

Thanks. I noticed a tiny, preexisting typo in the string abbreviated
key code. The attached patch fixes it.

--
Peter Geoghegan

Вложения

Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

От
Robert Haas
Дата:
On Fri, Feb 5, 2016 at 6:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Feb 3, 2016 at 11:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Thanks for the review.  I fixed the OID conflict, tweaked a few
>> comments, and committed this.
>
> Thanks. I noticed a tiny, preexisting typo in the string abbreviated
> key code. The attached patch fixes it.

Gosh, I must have looked at that line 10 times without seeing that
mistake.  Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company