Обсуждение: Backpatching of "Teach the regular expression functions to do case-insensitive matching"
Backpatching of "Teach the regular expression functions to do case-insensitive matching"
От
Andres Freund
Дата:
Hi, In my opinion this is actually a bug in < 9.0. As its a (imo) low impact fix thats constrained to two files it seems sensible to backpatch it now that the solution has proven itself in the field? The issue is hard to find and has come up several times in the field. And it has been slightly embarassing more than once ;) I am happy to provide patches for the supported releases < 9.0 if that helps the issue. Greetings, Andres
Andres Freund <andres@anarazel.de> writes: > In my opinion this is actually a bug in < 9.0. As its a (imo) low impact fix > thats constrained to two files it seems sensible to backpatch it now that the > solution has proven itself in the field? FWIW, I still don't trust that patch a lot (and I was the one who wrote it). The question is whether you believe that every platform uses Unicode code points directly as the wchar_t representation in UTF8-based locales. Although we've not heard any complaints suggesting that that assumption doesn't hold, I don't feel that 9.0 has been out long enough to prove much. The complaints about the old code were rare enough to make me think people just don't exercise the case too much, or don't notice if the behavior is wrong. So it likely hasn't had that much exercise in 9.0 either. In short, I'm pretty wary of back-patching it. regards, tom lane
On Thu, May 5, 2011 at 5:21 AM, Andres Freund <andres@anarazel.de> wrote: > In my opinion this is actually a bug in < 9.0. As its a (imo) low impact fix > thats constrained to two files it seems sensible to backpatch it now that the > solution has proven itself in the field? > > The issue is hard to find and has come up several times in the field. And it has > been slightly embarassing more than once ;) Can you share some more details about your experiences? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Backpatching of "Teach the regular expression functions to do case-insensitive matching"
От
Andres Freund
Дата:
On Friday, May 06, 2011 04:30:01 AM Robert Haas wrote: > On Thu, May 5, 2011 at 5:21 AM, Andres Freund <andres@anarazel.de> wrote: > > In my opinion this is actually a bug in < 9.0. As its a (imo) low impact > > fix thats constrained to two files it seems sensible to backpatch it now > > that the solution has proven itself in the field? > > The issue is hard to find and has come up several times in the field. And > > it has been slightly embarassing more than once ;) > Can you share some more details about your experiences? About the embarassing or hard to find part? One of the hard to find part parts involved a search (constraining word order after a tsearch search) where slightly fewer than usual search results were returned in production. Nobody had noticed during testing that case insensitive search worked for most things except multibyte chars as the tested case was something like: SELECT 'ÖFFENTLICHKEIT' ~* 'Öffentlichkeit' and the regex condition was only relevant when searching for multiple words. One of the emarassing examples was that I suggested moving away from a solution using several ILIKE rules to one case insenitive regular expression. Totally forgetting that I knew that this was only fixed in 9.0. This turned out to be faster. And it turned out to be wrong. In production :-(. Both sum up that the problem is often not noticed as most of the people realizing that that case could be a problem don't have a knowledge of the content and don't notice the problem until later... Andres
On Fri, May 6, 2011 at 9:22 AM, Andres Freund <andres@anarazel.de> wrote: > On Friday, May 06, 2011 04:30:01 AM Robert Haas wrote: >> On Thu, May 5, 2011 at 5:21 AM, Andres Freund <andres@anarazel.de> wrote: >> > In my opinion this is actually a bug in < 9.0. As its a (imo) low impact >> > fix thats constrained to two files it seems sensible to backpatch it now >> > that the solution has proven itself in the field? >> > The issue is hard to find and has come up several times in the field. And >> > it has been slightly embarassing more than once ;) >> Can you share some more details about your experiences? > About the embarassing or hard to find part? > > One of the hard to find part parts involved a search (constraining word order > after a tsearch search) where slightly fewer than usual search results were > returned in production. > Nobody had noticed during testing that case insensitive search worked for most > things except multibyte chars as the tested case was something like: SELECT > 'ÖFFENTLICHKEIT' ~* 'Öffentlichkeit' and the regex condition was only relevant > when searching for multiple words. > > One of the emarassing examples was that I suggested moving away from a > solution using several ILIKE rules to one case insenitive regular expression. > Totally forgetting that I knew that this was only fixed in 9.0. This turned out > to be faster. And it turned out to be wrong. In production :-(. > > > Both sum up that the problem is often not noticed as most of the people > realizing that that case could be a problem don't have a knowledge of the > content and don't notice the problem until later... After mulling this over a bit more, I guess I''m a little skeptical of back-patching this because it is clearly a behavior change. It seems unlikely, but not impossible, that someone is relying on the current behavior, and changing it in a minor release might be considered unfriendly. On the flip side, the risk of it flat-out blowing up seems pretty small. For someone to invent their own version of wchar_t that uses something other than Unicode code points would be pretty much pure masochism, wouldn't it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On the flip side, the risk of it flat-out blowing up seems pretty > small. For someone to invent their own version of wchar_t that uses > something other than Unicode code points would be pretty much pure > masochism, wouldn't it? Well, no, that's not clear. The C standard has pretty carefully avoided constraining the wchar_t representation, so implementors are free to do whatever is most convenient from the standpoint of their library routines. I could easily see somebody deciding to do something that wasn't quite Unicode because it let him re-use lookup tables designed for some other encoding, or some such. Now it's also perfectly possible, maybe even likely, that nobody's done that on any platform we care about. But I don't believe we know that with any degree of certainty. We definitely have not made any effort to establish whether it's true --- for example, we have no regression tests that address the point. (I think that collate.linux.utf8 touches on it, but we're a long way from being able to run that on non-glibc platforms...) regards, tom lane
On Sat, May 7, 2011 at 12:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On the flip side, the risk of it flat-out blowing up seems pretty >> small. For someone to invent their own version of wchar_t that uses >> something other than Unicode code points would be pretty much pure >> masochism, wouldn't it? > > Well, no, that's not clear. The C standard has pretty carefully avoided > constraining the wchar_t representation, so implementors are free to do > whatever is most convenient from the standpoint of their library routines. > I could easily see somebody deciding to do something that wasn't quite > Unicode because it let him re-use lookup tables designed for some other > encoding, or some such. > > Now it's also perfectly possible, maybe even likely, that nobody's done > that on any platform we care about. But I don't believe we know that > with any degree of certainty. We definitely have not made any effort to > establish whether it's true --- for example, we have no regression tests > that address the point. (I think that collate.linux.utf8 touches on it, > but we're a long way from being able to run that on non-glibc > platforms...) Well, since any problems in this are are going to bite us eventually in 9.0+ even without any further action on our part, maybe it would be wise to think up something we could add to the regression tests. That would give us some immediate feedback from the buildfarm, and also significantly improve the odds of someone compiling on a weird platform noticing if things are broken. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Well, since any problems in this are are going to bite us eventually > in 9.0+ even without any further action on our part, maybe it would be > wise to think up something we could add to the regression tests. That > would give us some immediate feedback from the buildfarm, and also > significantly improve the odds of someone compiling on a weird > platform noticing if things are broken. No objection here, but how will we do that? The regression tests are designed to work in any locale/encoding, and would become significantly less useful if they weren't. regards, tom lane
On Mon, May 9, 2011 at 10:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Well, since any problems in this are are going to bite us eventually >> in 9.0+ even without any further action on our part, maybe it would be >> wise to think up something we could add to the regression tests. That >> would give us some immediate feedback from the buildfarm, and also >> significantly improve the odds of someone compiling on a weird >> platform noticing if things are broken. > > No objection here, but how will we do that? The regression tests are > designed to work in any locale/encoding, and would become significantly > less useful if they weren't. I'm just shooting from the hip here, but maybe we could have a separate (probably smaller) set of tests that are only designed to work in a limited range of locales and/or encodings. I'm really pleased that we now have the src/test/isolation stuff, and I think some more auxilliary test suites would be quite excellent. Even if people didn't always want to run every single one when doing things manually, the buildfarm certainly could. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, May 9, 2011 at 10:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> No objection here, but how will we do that? �The regression tests are >> designed to work in any locale/encoding, and would become significantly >> less useful if they weren't. > I'm just shooting from the hip here, but maybe we could have a > separate (probably smaller) set of tests that are only designed to > work in a limited range of locales and/or encodings. I'm really > pleased that we now have the src/test/isolation stuff, and I think > some more auxilliary test suites would be quite excellent. Even if > people didn't always want to run every single one when doing things > manually, the buildfarm certainly could. Hmm. We don't need new infrastructure like the isolation tests do, so another subdirectory seems like overkill. I am thinking about a new target "installcheck-collations" in src/test/regress/GNUmakefile that creates a UTF8-encoding database and runs a different test schedule than the regular tests. The problem we'd have is that there's no way (at present) to make such a test pass on every platform. Windows has its own set of locale names (which initdb fails to install as collations anyway) and we also have the problem that OS X can be counted on to get UTF8 sorting wrong. (It might be okay for case-folding though; not sure.) Possibly we could just provide an alternate expected file for OS X, but I don't see a decent workaround for Windows --- it would pretty much have to have its very own test case. Andrew, what kinds of options have we got for persuading the buildfarm to run such tests only on a subset of platforms? regards, tom lane
Re: Backpatching of "Teach the regular expression functions to do case-insensitive matching"
От
"Kevin Grittner"
Дата:
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I am thinking about a new target "installcheck-collations" in > src/test/regress/GNUmakefile that creates a UTF8-encoding database > and runs a different test schedule than the regular tests. I don't know the best way to do this (or how many people agree we should), but I found that running the standard `make installcheck` against a database where the postgresql.conf file set default_transaction_isolation = serializable helped provide some baseline testing of SSI. None of the results change, but it helps protect against certain classes of dumb error. (I know this from personal experience.) I know *I'd* feel better if at least a few buildfarm animals were set up to do this. -Kevin
Re: Backpatching of "Teach the regular expression functions to do case-insensitive matching"
От
Peter Eisentraut
Дата:
On mån, 2011-05-09 at 10:56 -0400, Robert Haas wrote: > I'm just shooting from the hip here, but maybe we could have a > separate (probably smaller) set of tests that are only designed to > work in a limited range of locales and/or encodings. I'm really > pleased that we now have the src/test/isolation stuff, and I think > some more auxilliary test suites would be quite excellent. Even if > people didn't always want to run every single one when doing things > manually, the buildfarm certainly could. Well, the result of "people don't always run them" is the rest of src/test/. How much of that stuff even works anymore?
Re: Backpatching of "Teach the regular expression functions to do case-insensitive matching"
От
Peter Eisentraut
Дата:
On mån, 2011-05-09 at 12:42 -0400, Tom Lane wrote: > The problem we'd have is that there's no way (at present) to make such > a test pass on every platform. Windows has its own set of locale names > (which initdb fails to install as collations anyway) and we also have > the problem that OS X can be counted on to get UTF8 sorting wrong. > (It might be okay for case-folding though; not sure.) Possibly we could > just provide an alternate expected file for OS X, but I don't see a > decent workaround for Windows --- it would pretty much have to have its > very own test case. Windows >=Vista has locale names similar to Linux, and my cursory testing with some hacked up test suite indicates that it produces the same results as the Linux expected file, modulo some error message differences. So I think this could be made to work, it just needs someone to implement a few bits.
Peter Eisentraut <peter_e@gmx.net> writes: > On mån, 2011-05-09 at 12:42 -0400, Tom Lane wrote: >> The problem we'd have is that there's no way (at present) to make such >> a test pass on every platform. Windows has its own set of locale names >> (which initdb fails to install as collations anyway) and we also have >> the problem that OS X can be counted on to get UTF8 sorting wrong. >> (It might be okay for case-folding though; not sure.) Possibly we could >> just provide an alternate expected file for OS X, but I don't see a >> decent workaround for Windows --- it would pretty much have to have its >> very own test case. > Windows >=Vista has locale names similar to Linux, and my cursory > testing with some hacked up test suite indicates that it produces the > same results as the Linux expected file, modulo some error message > differences. So I think this could be made to work, it just needs > someone to implement a few bits. Well, that would be great, but the "someone" is not going to be me; I don't do Windows. I'd be willing to take responsibility for putting in the regression test once the necessary Windows-specific code was committed, though. regards, tom lane
On Tue, May 10, 2011 at 3:09 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On mån, 2011-05-09 at 10:56 -0400, Robert Haas wrote: >> I'm just shooting from the hip here, but maybe we could have a >> separate (probably smaller) set of tests that are only designed to >> work in a limited range of locales and/or encodings. I'm really >> pleased that we now have the src/test/isolation stuff, and I think >> some more auxilliary test suites would be quite excellent. Even if >> people didn't always want to run every single one when doing things >> manually, the buildfarm certainly could. > > Well, the result of "people don't always run them" is the rest of > src/test/. How much of that stuff even works anymore? I don't know. But I'm not sure I see your point. The fact that we haven't yet succeeded in doing something doesn't mean that it's either impossible or unimportant. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Backpatching of "Teach the regular expression functions to do case-insensitive matching"
От
Peter Eisentraut
Дата:
On tis, 2011-05-10 at 15:17 -0400, Tom Lane wrote: > Well, that would be great, but the "someone" is not going to be me; > I don't do Windows. Yeah, me neither. At least not for this release.
Re: Backpatching of "Teach the regular expression functions to do case-insensitive matching"
От
Peter Eisentraut
Дата:
On tis, 2011-05-10 at 15:48 -0400, Robert Haas wrote: > On Tue, May 10, 2011 at 3:09 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > > Well, the result of "people don't always run them" is the rest of > > src/test/. How much of that stuff even works anymore? > > I don't know. But I'm not sure I see your point. The fact that we > haven't yet succeeded in doing something doesn't mean that it's either > impossible or unimportant. Yes, but doing the same thing again in hope of different results isn't the right thing either. I'm all for more test suites, but we should make them as widely accessible and accessed as possible so that they get maintained.
Peter Eisentraut <peter_e@gmx.net> writes: > I'm all for more test suites, but we should make them as widely > accessible and accessed as possible so that they get maintained. Yeah. My preference would really be to push something like collate.linux.utf8 into the standard regression tests, but we'd first have to get it to where there was only one .sql file and not more than three or so variant expected files (one of which would be the one for platforms that don't support locale_t, analogous to the no-support expected file for the xml test). If we were at that point, then instead of having a separate make target, I'd be very strongly tempted to include the test in the standard tests by the expedient of having it create and \c to a separate database with suitable values of ENCODING, LC_COLLATE, etc. The lack of initdb support for getting more-or-less-standard collation entries into pg_collation on Windows seems to be the major missing piece from here (dunno if Peter is aware of others). If we don't fix that before release, we're going to regret it anyway IMO, because of the inevitable tide of questions/complaints from Windows users trying to use the collation feature. We've already seen at least one such from a beta tester. regards, tom lane
Re: Backpatching of "Teach the regular expression functions to do case-insensitive matching"
От
Peter Eisentraut
Дата:
On tis, 2011-05-10 at 18:05 -0400, Tom Lane wrote: > The lack of initdb support for getting more-or-less-standard collation > entries into pg_collation on Windows seems to be the major missing > piece from here (dunno if Peter is aware of others). If we don't fix > that before release, we're going to regret it anyway IMO, because of > the inevitable tide of questions/complaints from Windows users trying > to use the collation feature. We've already seen at least one such > from a beta tester. Well, someone who wants it will have to do it. It's pretty simple, but not simple enough to code it blindly. If someone wants to do it, I can tell them exactly what to do.
Peter Eisentraut <peter_e@gmx.net> writes: > On tis, 2011-05-10 at 18:05 -0400, Tom Lane wrote: >> The lack of initdb support for getting more-or-less-standard collation >> entries into pg_collation on Windows seems to be the major missing >> piece from here (dunno if Peter is aware of others). If we don't fix >> that before release, we're going to regret it anyway IMO, because of >> the inevitable tide of questions/complaints from Windows users trying >> to use the collation feature. We've already seen at least one such >> from a beta tester. > Well, someone who wants it will have to do it. It's pretty simple, but > not simple enough to code it blindly. If someone wants to do it, I can > tell them exactly what to do. Hm, do you know how to enumerate the available locales on Windows? (Still not volunteering, since I couldn't test it, but that's the only missing piece of information AFAIK.) regards, tom lane
Re: Backpatching of "Teach the regular expression functions to do case-insensitive matching"
От
Peter Eisentraut
Дата:
On ons, 2011-05-11 at 16:47 -0400, Tom Lane wrote: > Hm, do you know how to enumerate the available locales on Windows? EnumSystemLocalesEx() Reference: http://msdn.microsoft.com/en-us/library/dd317829(v=vs.85).aspx Example: http://msdn.microsoft.com/en-us/library/dd319091(v=vs.85).aspx As you can see in the example, this returns names like "en-US" and "es-ES". I would imagine we normalize this to the usual "en_US", "es_ES" (but we could also install the not normalized names, just like we install "en_US.utf8"). But you need to rearrange the code in initdb a bit because this thing works with callbacks. There is an older interface EnumSystemLocales() which returns locale IDs, which you then have to look up and convert into a name manually. There is code for that in the old installer CVS on pgfoundry. But it's very ugly, so I'd rather skip that and just concentrate on supporting the newer interface.
Peter Eisentraut <peter_e@gmx.net> writes: > On ons, 2011-05-11 at 16:47 -0400, Tom Lane wrote: >> Hm, do you know how to enumerate the available locales on Windows? > EnumSystemLocalesEx() > Reference: > http://msdn.microsoft.com/en-us/library/dd317829(v=vs.85).aspx > Example: http://msdn.microsoft.com/en-us/library/dd319091(v=vs.85).aspx Doesn't look too bad ... > There is an older interface EnumSystemLocales() which returns locale > IDs, which you then have to look up and convert into a name manually. > There is code for that in the old installer CVS on pgfoundry. But it's > very ugly, so I'd rather skip that and just concentrate on supporting > the newer interface. I guess the question is what happens on pre-Vista Windows if we use EnumSystemLocalesEx. I don't object to just not populating pg_collation in that case, but we probably don't want it to fail entirely. regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes: > On ons, 2011-05-11 at 16:47 -0400, Tom Lane wrote: >> Hm, do you know how to enumerate the available locales on Windows? > EnumSystemLocalesEx() > Reference: > http://msdn.microsoft.com/en-us/library/dd317829(v=vs.85).aspx > Example: http://msdn.microsoft.com/en-us/library/dd319091(v=vs.85).aspx > As you can see in the example, this returns names like "en-US" and > "es-ES". I would imagine we normalize this to the usual "en_US", > "es_ES" (but we could also install the not normalized names, just like > we install "en_US.utf8"). I poked around in Microsoft's documentation a bit. It's not entirely clear to me that the locale names enumerated by this API match up with the names accepted by setlocale() --- in particular, http://msdn.microsoft.com/en-us/library/hzz3tw78(v=VS.90).aspx does not show any two-letter abbreviations for most of the languages. That would need some testing, but it seems likely that what we'd have to do is construct the longer-form locale name using info obtained from GetLocaleInfoEx. Also, the locale names from EnumSystemLocalesEx definitely don't include any code page identifier. We could probably just enter the alias under UTF8, and again under the ANSI code page number from GetLocaleInfoEx, if the locale has one. > There is an older interface EnumSystemLocales() which returns locale > IDs, which you then have to look up and convert into a name manually. > There is code for that in the old installer CVS on pgfoundry. But it's > very ugly, so I'd rather skip that and just concentrate on supporting > the newer interface. Another thing I found out from the docs is that locale IDs aren't unique in Vista and later: they have "supplemental" locales that have distinct names but share the same ID as the primary. It's entirely unclear how that maps to setlocale names, but it does seem like we probably don't want to use EnumSystemLocales. regards, tom lane