Обсуждение: Latest on CITEXT 2.0

Поиск
Список
Период
Сортировка

Latest on CITEXT 2.0

От
"David E. Wheeler"
Дата:
Howdy,

I just wanted to report the latest on my pet project: implementing a  
new case-insensitive text type, "citext", to be locale-aware and to  
build and run on PostgreSQL 8.3. I'm not much of a C programmer (this  
is only the second time I've written *anything* in C), so I also have  
a few questions about my code, best practices, coverage, etc. You can  
grab the latest here:
  https://svn.kineticode.com/citext/trunk/

BTW, the tests in sql/citext.sql use the pgtap.sql file to run TAP  
regression tests. So you can run them using `make installcheck` or  
`make test`. The latter requires that pg_prove be installed; you can  
get it here:
  https://svn.kineticode.com/pgtap/trunk/

Anyway, I think I've got it pretty close to done. The tests cover a  
lot of stuff -- nearly everything I could figure out, anyway. But  
there are a few gaps.

As a result, I'd appreciate a little help with these questions, all in  
the name of making this a solid data type suitable for use on  
production systems:

* There seem to still be some implicit CASTS to text that I'd like to  
duplicate. For example,  select '192.168.1.2'::cidr::text;` works, but  
`select '192.168.1.2'::cidr::citext;` does not. Where can I find the C  
functions that do these casts for TEXT so that I can put them to work  
for citext, too? The internal cast functions used in the old citext  
distribution don't exist at all on 8.3.

* There are casts from text that I'd also like to harness for use by  
citext, like `cidr(text)`. Where can I find these C functions as well?  
(The upshot of this and the previous points is that I'd like citext to  
be as compatible with TEXT as possible, and I just need to figure out  
how to fill in the gaps in that compatibility.)

* Regular expression and LIKE comparisons using the the operators  
properly work case-insensitively, but functions like replace() and  
regexp_replace() do not. Should they? and if so, how can I make them  
do so?

* The tests assume that LC_COLLATE is set to en_US.UTF-8. Does that  
work well for standard PostgreSQL regression tests? How are locale- 
sensitive tests run in core regression tests?

* As for my C programming, well, what's broken? I'm especially  
concerned that I pfree variables appropriately, but I'm not at all  
clear on what needs to be freed. Martijn mentioned before that btree  
comparison functions free memory, but I'm such a C n00b that I don't  
know what that actually means for my implementation. I'd actually  
appreciate a bit of pedantry here. :-)

* Am I in fact getting an appropriate nul-terminated string in my  
cilower() function using this code?
    char * str  = DatumGetCString(        DirectFunctionCall1( textout, PointerGetDatum( arg ) )    );

Those are all the questions I had about my implementation. I'd like to  
get this thing done and released soon, so that I can be done with this  
particular Yak and get back to what I'm *supposed* to be doing with my  
time.

BTW, would there be any interest in this code going into contrib/ in  
the distribution? I think that, if we can ensure that it works just  
like LOWER() = LOWER(), but without requiring that code, then it would  
be a great type to point people to to use instead of that SQL hack  
(with all the usual caveats about it being locale-sensitive and not  
canonically case-insensitive in the Unicode sense). If so, I'd be  
happy to make whatever changes are necessary to make it fit in with  
the coding and organization standards of the core and to submit it.

But please, don't expect a civarchar type from me anytime soon. ;-)

Many thanks,

David


Re: Latest on CITEXT 2.0

От
Martijn van Oosterhout
Дата:
On Wed, Jun 25, 2008 at 11:47:39AM -0700, David E. Wheeler wrote:
> * There seem to still be some implicit CASTS to text that I'd like to
> duplicate. For example,  select '192.168.1.2'::cidr::text;` works, but
> `select '192.168.1.2'::cidr::citext;` does not. Where can I find the C
> functions that do these casts for TEXT so that I can put them to work
> for citext, too? The internal cast functions used in the old citext
> distribution don't exist at all on 8.3.

Hmm, casts to/from text are somewhat "magic" in postgres. They are
implemented by calling the usual type input/output function. I have no
idea how to extend that to other types.

> * There are casts from text that I'd also like to harness for use by
> citext, like `cidr(text)`. Where can I find these C functions as well?
> (The upshot of this and the previous points is that I'd like citext to
> be as compatible with TEXT as possible, and I just need to figure out
> how to fill in the gaps in that compatibility.)

As above, they're probably not as seperate functions but a special hack
inthe casting code.

> * Regular expression and LIKE comparisons using the the operators
> properly work case-insensitively, but functions like replace() and
> regexp_replace() do not. Should they? and if so, how can I make them
> do so?

Regexes have case-insensetive modifiers, don't they? In which case I
don't think it'd be becessary.

> * As for my C programming, well, what's broken? I'm especially
> concerned that I pfree variables appropriately, but I'm not at all
> clear on what needs to be freed. Martijn mentioned before that btree
> comparison functions free memory, but I'm such a C n00b that I don't
> know what that actually means for my implementation. I'd actually
> appreciate a bit of pedantry here. :-)

When creating an index, your comparison functions are going ot be
called O(N log N) times. If they leak into a context that isn't
regularly freed you may have a problem. I'd suggest loking at how the
text comparisons do it. PG_FREE_IF_COPY() is probably a good idea
because the incoming tuples may be detoasted.

> * Am I in fact getting an appropriate nul-terminated string in my
> cilower() function using this code?
>
>     char * str  = DatumGetCString(
>         DirectFunctionCall1( textout, PointerGetDatum( arg ) )
>     );

Yes.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Latest on CITEXT 2.0

От
"David E. Wheeler"
Дата:
On Jun 26, 2008, at 03:28, Martijn van Oosterhout wrote:

> Hmm, casts to/from text are somewhat "magic" in postgres. They are
> implemented by calling the usual type input/output function. I have no
> idea how to extend that to other types.

Oh. Okay. Perhaps I won't worry about it just now, then.

> As above, they're probably not as seperate functions but a special  
> hack
> inthe casting code.

Okay.

> Regexes have case-insensetive modifiers, don't they? In which case I
> don't think it'd be becessary.

They do, but replace(), split_part(), strpos(), and translate() do not.

> When creating an index, your comparison functions are going ot be
> called O(N log N) times. If they leak into a context that isn't
> regularly freed you may have a problem. I'd suggest loking at how the
> text comparisons do it. PG_FREE_IF_COPY() is probably a good idea
> because the incoming tuples may be detoasted.

Okay. I'll have a look at varlena.c, then.

>> * Am I in fact getting an appropriate nul-terminated string in my
>> cilower() function using this code?
>>
>>    char * str  = DatumGetCString(
>>        DirectFunctionCall1( textout, PointerGetDatum( arg ) )
>>    );
>
> Yes.

Great, I thought so (since it made the failures go away). Many thanks.

David



Re: Latest on CITEXT 2.0

От
"David E. Wheeler"
Дата:
Thanks a million for your answers, Martijn. I just have some more
stupid questions, if you could bear with me.

On Jun 26, 2008, at 03:28, Martijn van Oosterhout wrote:

> When creating an index, your comparison functions are going ot be
> called O(N log N) times. If they leak into a context that isn't
> regularly freed you may have a problem. I'd suggest loking at how the
> text comparisons do it. PG_FREE_IF_COPY() is probably a good idea
> because the incoming tuples may be detoasted.

Okay, I see that text_cmp in varlena doesn't use PG_FREE_IF_COPY(),
and neither do text_smaller nor text_larger (which just dispatch to
text_cmp anyway).

The operator functions *do* use PG_FREE_IF_COPY(). So I'm guessing
it's these functions you're talking about. However, my implementation
just looks like this:

Datum citext_ne (PG_FUNCTION_ARGS) {    // Fast path for different-length inputs. Okay for canonical
equivalence?    if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1)))        PG_RETURN_BOOL( 1 );
PG_RETURN_BOOL(citextcmp( PG_ARGS ) != 0 ); 
}

I don't *thinkI any variables are copied there. citextcmp() is just
this:

int citextcmp (PG_FUNCTION_ARGS) {    // XXX These are all just references to existing structures, right?    text *
left = PG_GETARG_TEXT_P(0);    text * right = PG_GETARG_TEXT_P(1);    return varstr_cmp(        cilower( left ),
VARSIZE_ANY_EXHDR(left),       cilower( right ),        VARSIZE_ANY_EXHDR(right)    ); 
}

Again, no copying. cilower() does copy:
    int    index, len;    char * result;
    index  = 0;    len    = VARSIZE(arg) - VARHDRSZ;    result = (char *) palloc( strlen( str ) + 1 );
    for (index = 0; index <= len; index++) {        result[index] = tolower((unsigned char) str[index] );    }    //
XXXI don't need to pfree result if I'm returning it, right?    return result; 

But the copied value is returned. Hrm…it should probably be pfreed
somewhere, yes?

So I'm wondering if I should change citextcmp to pfree values?
Something like this:
    text * left  = PG_GETARG_TEXT_P(0);    text * right = PG_GETARG_TEXT_P(1);    char * lcstr = cilower( left  );
char* rcstr = cilower( right ); 
    int result = varstr_cmp(        cilower( left ),        VARSIZE_ANY_EXHDR(left),        cilower( right ),
VARSIZE_ANY_EXHDR(right)   ); 
    pfree( lcstr );    pfree( rcstr );    return result;

This is the only function that calls cilower(). And I *think* it's the
only place where values are copied or memory is allocated needing to
be freed. Does that sound right to you?

On a side note, I've implemented this pretty differently from how the
text functions are implemented in varlena.c, just to try to keep
things succinct. But I'm wondering now if I shouldn't switch back to
the style used by varlena.c, if only to keep the style the same, and
thus perhaps to increase the chances that citext would be a welcome
contrib addition. Thoughts?

Many thanks again. You're a great help to this C n00b.

Best,

David

Re: Latest on CITEXT 2.0

От
Alvaro Herrera
Дата:
David E. Wheeler wrote:

> The operator functions *do* use PG_FREE_IF_COPY(). So I'm guessing it's 
> these functions you're talking about. However, my implementation just 
> looks like this:
>
> Datum citext_ne (PG_FUNCTION_ARGS) {
>     // Fast path for different-length inputs. Okay for canonical  
> equivalence?
>     if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1)))
>         PG_RETURN_BOOL( 1 );
>     PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 );
> }

PG_GETARG_TEXT_P can detoast the datum, which creates a copy.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Latest on CITEXT 2.0

От
"David E. Wheeler"
Дата:
On Jun 26, 2008, at 09:19, Alvaro Herrera wrote:

> PG_GETARG_TEXT_P can detoast the datum, which creates a copy.

Thanks. I've just completely refactored things to look more like the  
approach taken by varlena.c, both in terms of when stuff gets freed  
and in terms of coding style. It's more verbose, but I feel much more  
comfortable with memory management now that I'm following a known  
implementation more closely. :-)

So now I've changed citextcmp to this:

static int
citextcmp (text * left, text * right)
{    char * lcstr, * rcstr;    int    result;
    lcstr = cilower( left  );    rcstr = cilower( right );
    result = varstr_cmp(        cilower( left ),        VARSIZE_ANY_EXHDR(left),        cilower( right ),
VARSIZE_ANY_EXHDR(right)   );
 
    pfree( lcstr );    pfree( rcstr );    return result;
}

And now all of the operator functions are freeing memory using  
PG_FREE_IF_COPY() like this:

Datum
citext_cmp(PG_FUNCTION_ARGS)
{text * left  = PG_GETARG_TEXT_PP(0);text * right = PG_GETARG_TEXT_PP(1);int32  result;
result = citextcmp(left, right);
PG_FREE_IF_COPY(left, 0);PG_FREE_IF_COPY(right, 1);
        PG_RETURN_INT32( result );
}


The only functions that don't do that are citext_smaller() and  
citext_larger():

Datum
citext_smaller(PG_FUNCTION_ARGS)
{text * left  = PG_GETARG_TEXT_PP(0);text * right = PG_GETARG_TEXT_PP(1);text * result;
result = citextcmp(left, right) < 0 ? left : right;PG_RETURN_TEXT_P(result);
}

This is just how varlena.c does it, but I am wondering if something  
*should* be freed there.

Thanks a bunch!

Best,

David



Re: Latest on CITEXT 2.0

От
Tom Lane
Дата:
"David E. Wheeler" <david@kineticode.com> writes:
> Datum citext_ne (PG_FUNCTION_ARGS) {
>      // Fast path for different-length inputs. Okay for canonical  
> equivalence?
>      if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1)))
>          PG_RETURN_BOOL( 1 );
>      PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 );
> }

BTW, I don't think you can use that same-length optimization for
citext.  There's no reason to think that upper/lowercase pairs will
have the same length all the time in multibyte encodings.
        regards, tom lane


Re: Latest on CITEXT 2.0

От
"David E. Wheeler"
Дата:
On Jun 26, 2008, at 10:02, Tom Lane wrote:

> BTW, I don't think you can use that same-length optimization for
> citext.  There's no reason to think that upper/lowercase pairs will
> have the same length all the time in multibyte encodings.

I was wondering about that. I had been thinking of canonically-
equivalent stings and combining marks. Doing a quick test it looks
like combining marks are not equivalent. For example, this returns
false:

   SELECT 'Ä'::text = 'Ä'::text;

At least with en_US.UTF-8. Hrm. It looks like my client makes them
both canonical, so I've attached a script demonstrating this issue.

Anyway, I was aware of different byte counts for canonical
equivalence, but not for differences between upper- and lowercase
characters. I'd certainly defer to your knowledge of how these things
truly work in PostgreSQL, Tom, and can of course easily remove that
optimization. So, are your certain about this?

Many thanks,

David



Вложения

Re: Latest on CITEXT 2.0

От
Josh Berkus
Дата:
David,

> Thanks. I've just completely refactored things to look more like the
> approach taken by varlena.c, both in terms of when stuff gets freed
> and in terms of coding style. It's more verbose, but I feel much more
> comfortable with memory management now that I'm following a known
> implementation more closely. :-)

Will this be ready for the July CommitFest?

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Latest on CITEXT 2.0

От
"David E. Wheeler"
Дата:
On Jun 26, 2008, at 12:03, Josh Berkus wrote:

> Will this be ready for the July CommitFest?

When is it due, July 1? If so, yes, it should be. I could use a close  
review by someone who knows this shit a whole lot better than I do.

Thanks,

David



Re: Latest on CITEXT 2.0

От
Josh Berkus
Дата:
David,

> When is it due, July 1? If so, yes, it should be. I could use a close
> review by someone who knows this shit a whole lot better than I do.

Well, that's what the commitfest is for.  Go ahead and add yourself once you 
post the new patch.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Latest on CITEXT 2.0

От
Tom Lane
Дата:
"David E. Wheeler" <david@kineticode.com> writes:
> So, are your certain about this?

See Turkish --- in that locale i and I are not an upper/lower pair,
instead they pair with some non-ASCII letters.  There are likely
other cases but that's the counterexample I remember.
        regards, tom lane


Re: Latest on CITEXT 2.0

От
"David E. Wheeler"
Дата:
On Jun 26, 2008, at 13:59, Tom Lane wrote:

> "David E. Wheeler" <david@kineticode.com> writes:
>> So, are your certain about this?
>
> See Turkish --- in that locale i and I are not an upper/lower pair,
> instead they pair with some non-ASCII letters.  There are likely
> other cases but that's the counterexample I remember.

Perfect, thank you. I was able to add a failing test and make it pass  
by removing the string length optimization.

For future reference of anyone reading the list, this page has a great  
description of the problem:
  http://www.i18nguy.com/unicode/turkish-i18n.html

Best,

David


Re: Latest on CITEXT 2.0

От
"Marko Kreen"
Дата:
On 6/26/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "David E. Wheeler" <david@kineticode.com> writes:
>  > Datum citext_ne (PG_FUNCTION_ARGS) {
>  >      // Fast path for different-length inputs. Okay for canonical
>  > equivalence?
>  >      if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1)))
>  >          PG_RETURN_BOOL( 1 );
>  >      PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 );
>  > }
>
> BTW, I don't think you can use that same-length optimization for
>  citext.  There's no reason to think that upper/lowercase pairs will
>  have the same length all the time in multibyte encodings.

What about this code in current str_tolower():
       /* Output workspace cannot have more codes than input bytes */       workspace = (wchar_t *) palloc((nbytes + 1)
*sizeof(wchar_t));
 

Bug?

-- 
marko


Re: Latest on CITEXT 2.0

От
Tom Lane
Дата:
"Marko Kreen" <markokr@gmail.com> writes:
> On 6/26/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, I don't think you can use that same-length optimization for
>> citext.  There's no reason to think that upper/lowercase pairs will
>> have the same length all the time in multibyte encodings.

> What about this code in current str_tolower():

>         /* Output workspace cannot have more codes than input bytes */
>         workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));

That's working with wchars, not bytes.
        regards, tom lane


Re: Latest on CITEXT 2.0

От
"Marko Kreen"
Дата:
On 7/1/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Marko Kreen" <markokr@gmail.com> writes:
>  > On 6/26/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> >> BTW, I don't think you can use that same-length optimization for
>  >> citext.  There's no reason to think that upper/lowercase pairs will
>  >> have the same length all the time in multibyte encodings.
>
>  > What about this code in current str_tolower():
>
>  >         /* Output workspace cannot have more codes than input bytes */
>  >         workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));
>
>
> That's working with wchars, not bytes.

Ah, I missed the point of char2wchar() line.

I'm rather unfamiliar with various MB API-s, sorry.

There's another thing I'm probably missing: does current code handle
multi-wchar codepoints?  Or is it guaranteed they don't happen?
(Wasn't wchar_t usually 16bit value?)

-- 
marko


Re: Latest on CITEXT 2.0

От
Bruce Momjian
Дата:
Marko Kreen wrote:
> On 7/1/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > "Marko Kreen" <markokr@gmail.com> writes:
> >  > On 6/26/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > >> BTW, I don't think you can use that same-length optimization for
> >  >> citext.  There's no reason to think that upper/lowercase pairs will
> >  >> have the same length all the time in multibyte encodings.
> >
> >  > What about this code in current str_tolower():
> >
> >  >         /* Output workspace cannot have more codes than input bytes */
> >  >         workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));
> >
> >
> > That's working with wchars, not bytes.
> 
> Ah, I missed the point of char2wchar() line.
> 
> I'm rather unfamiliar with various MB API-s, sorry.
> 
> There's another thing I'm probably missing: does current code handle
> multi-wchar codepoints?  Or is it guaranteed they don't happen?
> (Wasn't wchar_t usually 16bit value?)

If you want a simple example of wide character use look at
oracle_compat.c::upper() which calls str_toupper() in CVS HEAD.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Latest on CITEXT 2.0

От
"Marko Kreen"
Дата:
On 7/1/08, Bruce Momjian <bruce@momjian.us> wrote:
> If you want a simple example of wide character use look at
>  oracle_compat.c::upper() which calls str_toupper() in CVS HEAD.

ATM I'm looking at str_tolower/upper internal implementation.

They do:
 workspace[curr_char] = towlower(workspace[curr_char]);

where workspace is wchar_t but towlower() operates on wint_t.

Is such inconsistency ok?

-- 
marko


Re: Latest on CITEXT 2.0

От
Tom Lane
Дата:
"Marko Kreen" <markokr@gmail.com> writes:
> ATM I'm looking at str_tolower/upper internal implementation.
> They do:
>   workspace[curr_char] = towlower(workspace[curr_char]);
> where workspace is wchar_t but towlower() operates on wint_t.

IIRC this is exactly comparable to the type situation for the
traditional <ctype.h> macros.  The reason is that they are defined
to accept EOF in addition to actual char (or wchar) values.
        regards, tom lane


Re: Latest on CITEXT 2.0

От
Tom Lane
Дата:
"Marko Kreen" <markokr@gmail.com> writes:
> There's another thing I'm probably missing: does current code handle
> multi-wchar codepoints?  Or is it guaranteed they don't happen?

AFAIK we disallow multi-wchar situations (by rejecting the UTF8
combining codes).

> (Wasn't wchar_t usually 16bit value?)

Hmm.  It's unsigned int on my ancient HPUX box.  I think we could have a
problem on any machines whose mbstowcs doesn't support 4-byte UTF8
codes, though ... in particular, what about Windows?
        regards, tom lane


Re: Latest on CITEXT 2.0

От
"Marko Kreen"
Дата:
On 7/1/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Marko Kreen" <markokr@gmail.com> writes:
> > ATM I'm looking at str_tolower/upper internal implementation.
>  > They do:
>  >   workspace[curr_char] = towlower(workspace[curr_char]);
>  > where workspace is wchar_t but towlower() operates on wint_t.
>
> IIRC this is exactly comparable to the type situation for the
>  traditional <ctype.h> macros.  The reason is that they are defined
>  to accept EOF in addition to actual char (or wchar) values.

I read SUS v3 and there is no hint on multi-wchar anything,
so for unix systems you are right, wint_t == wchar_t.

Seems stories how Windows and Java operate have affected me too much.

Then I browsed MSDN:
 http://msdn.microsoft.com/en-us/library/dtxesf6k.aspx

and they seem to strongly hint that wchar_t == 16 bits and
UTF-16 is used internally.

Probably some Windows developer should look into it
and decide if there is a #ifdef WIN32 branch needed.

-- 
marko