Обсуждение: how is text-equality handled in postgresql?
hi, when i create an unique-constraint on a varchar field, how exactly does postgresql compare the texts? i'm asking because in UNICODE there are a lot of complexities about this.. or in other words, when are two varchars equal in postgres? when their bytes are? or some algorithm is applied? thanks, gabor
On 15/01/2014 10:10, Gábor Farkas wrote: > hi, > > when i create an unique-constraint on a varchar field, how exactly > does postgresql compare the texts? i'm asking because in UNICODE there > are a lot of complexities about this.. > > or in other words, when are two varchars equal in postgres? when their > bytes are? or some algorithm is applied? By default, it is "whatever the operating system thinks it's right". PostgreSQL doesn't have its own collation code, it uses the OS's locale support for this. (which breaks on certain systems which don't have complete UTF-8 support - I'm in favour of importing ICU at least as an optional dependancy, similar to what the FreeBSD's patch does: http://people.freebsd.org/~girgen/postgresql-icu/).
Вложения
On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras <ivoras@freebsd.org> wrote: > On 15/01/2014 10:10, Gábor Farkas wrote: >> hi, >> >> when i create an unique-constraint on a varchar field, how exactly >> does postgresql compare the texts? i'm asking because in UNICODE there >> are a lot of complexities about this.. >> >> or in other words, when are two varchars equal in postgres? when their >> bytes are? or some algorithm is applied? > > By default, it is "whatever the operating system thinks it's right". > PostgreSQL doesn't have its own collation code, it uses the OS's locale > support for this. > Just to add to this, whenever strcoll() (a locale aware comparator) says two strings are equal, postgres re-compares them using strcmp(). See following code snippet off src/backend/utils/adt/varlena.c:varstr_cmp() - #ifdef HAVE_LOCALE_T if (mylocale) result = strcoll_l(a1p, a2p, mylocale); else #endif result = strcoll(a1p, a2p); /* * In some locales strcoll() can claim that nonidentical strings are * equal. Believing that would be bad news for a number of reasons, * so we follow Perl's lead and sort "equal" strings according to * strcmp(). */ if (result == 0) result = strcmp(a1p, a2p); -- Amit Langote
On 15/01/2014 12:36, Amit Langote wrote: > On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras <ivoras@freebsd.org> wrote: >> On 15/01/2014 10:10, Gábor Farkas wrote: >>> hi, >>> >>> when i create an unique-constraint on a varchar field, how exactly >>> does postgresql compare the texts? i'm asking because in UNICODE there >>> are a lot of complexities about this.. >>> >>> or in other words, when are two varchars equal in postgres? when their >>> bytes are? or some algorithm is applied? >> >> By default, it is "whatever the operating system thinks it's right". >> PostgreSQL doesn't have its own collation code, it uses the OS's locale >> support for this. >> > > Just to add to this, whenever strcoll() (a locale aware comparator) > says two strings are equal, postgres re-compares them using strcmp(). > See following code snippet off > src/backend/utils/adt/varlena.c:varstr_cmp() - > /* > * In some locales strcoll() can claim that > nonidentical strings are > * equal. Believing that would be bad news for a > number of reasons, > * so we follow Perl's lead and sort "equal" strings > according to > * strcmp(). > */ > if (result == 0) > result = strcmp(a1p, a2p); That seems odd and inefficient. Why would it be necessary? I would think indexing (and other collation-sensitive operations) don't care what the actual collation result is for arbitrary blobs of strings, as long as they are stable?
Вложения
On Wed, Jan 15, 2014 at 9:02 PM, Ivan Voras <ivoras@freebsd.org> wrote: > On 15/01/2014 12:36, Amit Langote wrote: >> On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras <ivoras@freebsd.org> wrote: >>> On 15/01/2014 10:10, Gábor Farkas wrote: >>>> hi, >>>> >>>> when i create an unique-constraint on a varchar field, how exactly >>>> does postgresql compare the texts? i'm asking because in UNICODE there >>>> are a lot of complexities about this.. >>>> >>>> or in other words, when are two varchars equal in postgres? when their >>>> bytes are? or some algorithm is applied? >>> >>> By default, it is "whatever the operating system thinks it's right". >>> PostgreSQL doesn't have its own collation code, it uses the OS's locale >>> support for this. >>> >> >> Just to add to this, whenever strcoll() (a locale aware comparator) >> says two strings are equal, postgres re-compares them using strcmp(). >> See following code snippet off >> src/backend/utils/adt/varlena.c:varstr_cmp() - > >> /* >> * In some locales strcoll() can claim that >> nonidentical strings are >> * equal. Believing that would be bad news for a >> number of reasons, >> * so we follow Perl's lead and sort "equal" strings >> according to >> * strcmp(). >> */ >> if (result == 0) >> result = strcmp(a1p, a2p); > > That seems odd and inefficient. Why would it be necessary? I would think > indexing (and other collation-sensitive operations) don't care what the > actual collation result is for arbitrary blobs of strings, as long as > they are stable? > This is the behavior since quite some time introduced by this commit http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad -- Amit Langote
On 15/01/2014 13:29, Amit Langote wrote: > On Wed, Jan 15, 2014 at 9:02 PM, Ivan Voras <ivoras@freebsd.org> wrote: >> On 15/01/2014 12:36, Amit Langote wrote: >>> * In some locales strcoll() can claim that >>> nonidentical strings are >>> * equal. Believing that would be bad news for a >>> number of reasons, >>> * so we follow Perl's lead and sort "equal" strings >>> according to >>> * strcmp(). >>> */ >>> if (result == 0) >>> result = strcmp(a1p, a2p); >> >> That seems odd and inefficient. Why would it be necessary? I would think >> indexing (and other collation-sensitive operations) don't care what the >> actual collation result is for arbitrary blobs of strings, as long as >> they are stable? >> > > This is the behavior since quite some time introduced by this commit > > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad Ok, the commit comment is: "Adjust string comparison so that only bitwise-equal strings are considered equal: if strcoll claims two strings are equal, check it with strcmp, and sort according to strcmp if not identical. This fixes inconsistent behavior under glibc's hu_HU locale, and probably under some other locales as well. Also, take advantage of the now-well-defined behavior to speed up texteq, textne, bpchareq, bpcharne: they may as well just do a bitwise comparison and not bother with strcoll at all." ... so it's just another workaround for OS specific locale issues - to me it looks like just another reason to use ICU.
Вложения
Ivan Voras <ivoras@freebsd.org> writes: > On 15/01/2014 12:36, Amit Langote wrote: >> Just to add to this, whenever strcoll() (a locale aware comparator) >> says two strings are equal, postgres re-compares them using strcmp(). > That seems odd and inefficient. Why would it be necessary? I would think > indexing (and other collation-sensitive operations) don't care what the > actual collation result is for arbitrary blobs of strings, as long as > they are stable? If we didn't do it like this, we could not use hashing techniques for text --- at least not unless we could find a hash function guaranteed to yield the same values for any two strings that strcoll() claims are equal. regards, tom lane
On Wed, Jan 15, 2014 at 4:10 AM, Gábor Farkas <gabor.farkas@gmail.com> wrote:
or in other words, when are two varchars equal in postgres? when their
bytes are? or some algorithm is applied?
On this topic, when I write my strings to the DB and search from the DB, should I canonicalize them first as NKFC (or some other), or just let the DB figure it out? In my specific case I use perl DBI with place holders to submit my queries.