Обсуждение: unicode searches failing that use % and LIKE operators
Dear all, I have the following problem: a compound search, involving 2 wildcarded character search terms, in which one search term consists of Latin characters and the other, of UTF-8 unicode Greek characters, fails. This is strange, because similar searches in which both terms are either unicode Greek or Latin characters succeed. Both terms query a column of type text. Searches of this kind fail both via JDBC and the PSQL interface via an xterm window (which handles unicode properly). The JDBC search will have put both terms in UTF-8. The column is called metadatafulltext. The search uses the LIKE operator in conjunction with the wildcard character: For example, the following search fails: SELECT .. FROM metadatafulltext... WHERE metadatafulltext LIKE '%Jones%' AND metadatafulltext LIKE '%ALPHABETA%' (where ALPHABETA is actually a unicode Greek string (\u03b1\u03b2). whereas searches using all Greek characters succeed: WHERE metadatafulltext LIKE '%BETAEPSILONDELTA%' AND metadatafulltext LIKE '%ALPHABETA%' and equally, all-Latin searches also succeed: WHERE metadatafulltext LIKE '%Jones%' AND metadatafulltext LIKE '%Smith%' What must I do to ensure that mixed-term searches of the first kind succeed? Thanks in advance, Ben Weaver -- Benjamin Weaver Faculty Research Associate, Imaging Papyri Projects, Herculaneum Society, Oxford email: benjamin.weaver@classics.ox.ac.uk phone: (0)1865 610236
Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes: > I have the following problem: a compound search, involving 2 wildcarded > character search terms, in which one search term consists of Latin characters > and the other, of UTF-8 unicode Greek characters, fails. This is strange, > because similar searches in which both terms are either unicode Greek or Latin > characters succeed. That's pretty strange all right. What PG version is this, what database encoding are you using, and what are the server's lc_collate and lc_ctype settings? regards, tom lane
Tom, Thanks. I am running: Postgres 8.1.4 server_encoding UTF8 lc_collate en_GB.UTF-8 lc_ctype en_GB.UTF-8 Ben In message <19842.1193077201@sss.pgh.pa.us> Tom Lane <tgl@sss.pgh.pa.us> writes: > Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes: > > I have the following problem: a compound search, involving 2 wildcarded > > character search terms, in which one search term consists of Latin characters > > and the other, of UTF-8 unicode Greek characters, fails. This is strange, > > because similar searches in which both terms are either unicode Greek or Latin > > characters succeed. > > That's pretty strange all right. What PG version is this, what database > encoding are you using, and what are the server's lc_collate and > lc_ctype settings? > > regards, tom lane -- Benjamin Weaver Faculty Research Associate, Imaging Papyri Projects, Herculaneum Society, Oxford email: benjamin.weaver@classics.ox.ac.uk phone: (0)1865 610236
Tom, To be more precise, the mixed queries "fail" in that they return hits of 0 rows, when they should return more than 0 rows. Ben In message <19842.1193077201@sss.pgh.pa.us> Tom Lane <tgl@sss.pgh.pa.us> writes: > Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes: > > I have the following problem: a compound search, involving 2 wildcarded > > character search terms, in which one search term consists of Latin characters > > and the other, of UTF-8 unicode Greek characters, fails. This is strange, > > because similar searches in which both terms are either unicode Greek or Latin > > characters succeed. > > That's pretty strange all right. What PG version is this, what database > encoding are you using, and what are the server's lc_collate and > lc_ctype settings? > > regards, tom lane -- Benjamin Weaver Faculty Research Associate, Imaging Papyri Projects, Herculaneum Society, Oxford email: benjamin.weaver@classics.ox.ac.uk phone: (0)1865 610236
Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes: > Postgres 8.1.4 > server_encoding UTF8 > lc_collate en_GB.UTF-8 > lc_ctype en_GB.UTF-8 Hmph, nothing strange-looking there. I tried to reproduce the problem here, without success. Now I was using 8.1.10 on Linux (I gather your platform is not Linux from the spelling of the locale names), but there hasn't been any change in the LIKE code in the 8.1 branch. One thing that would be worth trying is to turn on query logging (set log_statement = 'all') and verify that the problem query appears in the postmaster log as you expect; this is just to eliminate any doubt about whether the string encoding is getting messed up somewhere between client and server. Assuming you don't find a smoking gun that way, can you put together a self-contained test case? There must be some detail of what you're doing that I failed to reproduce. regards, tom lane
"Benjamin Weaver" <benjamin.weaver@classics.ox.ac.uk> writes: > Tom, > > Thanks. I am running: > > Postgres 8.1.4 > server_encoding UTF8 > lc_collate en_GB.UTF-8 > lc_ctype en_GB.UTF-8 Hm, I wonder what the en_GB locale on your machine does when it sees characters unused in English such as Greek characters. Is this Windows or some form of Unix? Do you have a particular LIKE pattern and a particular example for which it fails? Or does it work reasonably when you test individual values but fail when you perform large searches? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Dear Greg, Tom, I AM in fact running the db on Linux. Redhat 9. Are the encoding parameters wrong for Linux? I am sending the queries via JDBC from a windows machine. But I have also gotten the same results via psql when sending the queries from one local redhat 9 box to the redhat 9 database server box. Ben In message <87bqaqhme9.fsf@oxford.xeocode.com> Gregory Stark <stark@enterprisedb.com> writes: > "Benjamin Weaver" <benjamin.weaver@classics.ox.ac.uk> writes: > > > Tom, > > > > Thanks. I am running: > > > > Postgres 8.1.4 > > server_encoding UTF8 > > lc_collate en_GB.UTF-8 > > lc_ctype en_GB.UTF-8 > > Hm, I wonder what the en_GB locale on your machine does when it sees > characters unused in English such as Greek characters. Is this Windows or some > form of Unix? > > Do you have a particular LIKE pattern and a particular example for which it > fails? Or does it work reasonably when you test individual values but fail > when you perform large searches? > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com -- Benjamin Weaver Faculty Research Associate, Imaging Papyri Projects, Herculaneum Society, Oxford email: benjamin.weaver@classics.ox.ac.uk phone: (0)1865 610236
Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes: > I AM in fact running the db on Linux. Redhat 9. Are the encoding parameters > wrong for Linux? Hmm ... RH 9 is awfully old. It's at least conceivable that you're getting bit by some glibc bug. However, if these are just plain LIKE calls and not ILIKE then I don't think that theory is very viable --- AFAICS the regular LIKE code doesn't depend on anything from the platform. A slightly more viable theory is that you're getting bit by a gcc bug. Did you build the Postgres executables in-house, and if not where did you get them from? It might be helpful if you'd show us the output of "pg_config". Again, if you could put together a self-contained test case (preferably a short psql script) it would be helpful so we could try the case on other machines. It's not at all clear at this point whether the bug is Postgres' fault or something about the underlying platform. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes: >> Postgres 8.1.4 >> server_encoding UTF8 >> lc_collate en_GB.UTF-8 >> lc_ctype en_GB.UTF-8 > > Hmph, nothing strange-looking there. I tried to reproduce the problem > here, without success. Now I was using 8.1.10 on Linux (I gather your > platform is not Linux from the spelling of the locale names) Really? On my Debian system the locales are named precisely like that. What do they look like on Red Hat? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> Hmph, nothing strange-looking there. I tried to reproduce the problem >> here, without success. Now I was using 8.1.10 on Linux (I gather your >> platform is not Linux from the spelling of the locale names) > Really? On my Debian system the locales are named precisely like that. > What do they look like on Red Hat? $ locale -a | grep en_GB en_GB en_GB.iso88591 en_GB.iso885915 en_GB.utf8 This is on Fedora Core 6. It appears that you can *set* the locale to "en_GB.UTF-8" successfully, but it doesn't seem to be the preferred spelling. regards, tom lane
Thanks, Tom and Greg, for all your help. I agree that Redhat 9 is a bit creaky and that we here should upgrade. In the meantime I will put together a tiny test case that will reproduce the problem (at least on my platform). Ben In message <26870.1193091184@sss.pgh.pa.us> Tom Lane <tgl@sss.pgh.pa.us> writes: > Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes: > > I AM in fact running the db on Linux. Redhat 9. Are the encoding parameters > > wrong for Linux? > > Hmm ... RH 9 is awfully old. It's at least conceivable that you're > getting bit by some glibc bug. However, if these are just plain LIKE > calls and not ILIKE then I don't think that theory is very viable > --- AFAICS the regular LIKE code doesn't depend on anything from the > platform. > > A slightly more viable theory is that you're getting bit by a gcc bug. > Did you build the Postgres executables in-house, and if not where did > you get them from? It might be helpful if you'd show us the output > of "pg_config". > > Again, if you could put together a self-contained test case (preferably > a short psql script) it would be helpful so we could try the case on > other machines. It's not at all clear at this point whether the bug > is Postgres' fault or something about the underlying platform. > > regards, tom lane -- Benjamin Weaver Faculty Research Associate, Imaging Papyri Projects, Herculaneum Society, Oxford email: benjamin.weaver@classics.ox.ac.uk phone: (0)1865 610236
Tom, Greg, Please accept my considerable apologies. The fault was my own program that loaded metadata into the text column, metadatafulltext. My reason (not a sufficient excuse!) for overlooking the problem is that the Unicode is invisible in psql windows, therefore it was not a simple matter of displaying each row of metadatafulltext to examine its contents. Some rows got loaded with Latin metadata, others with Greek, whereas each and every row needed both. You both have been very helpful. In the future I will be careful as I can to ensure any problems don't exist with my down read-write-create-selete software! Thanks again for your help, Ben In message <26870.1193091184@sss.pgh.pa.us> Tom Lane <tgl@sss.pgh.pa.us> writes: > Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes: > > I AM in fact running the db on Linux. Redhat 9. Are the encoding parameters > > wrong for Linux? > > Hmm ... RH 9 is awfully old. It's at least conceivable that you're > getting bit by some glibc bug. However, if these are just plain LIKE > calls and not ILIKE then I don't think that theory is very viable > --- AFAICS the regular LIKE code doesn't depend on anything from the > platform. > > A slightly more viable theory is that you're getting bit by a gcc bug. > Did you build the Postgres executables in-house, and if not where did > you get them from? It might be helpful if you'd show us the output > of "pg_config". > > Again, if you could put together a self-contained test case (preferably > a short psql script) it would be helpful so we could try the case on > other machines. It's not at all clear at this point whether the bug > is Postgres' fault or something about the underlying platform. > > regards, tom lane -- Benjamin Weaver Faculty Research Associate, Imaging Papyri Projects, Herculaneum Society, Oxford email: benjamin.weaver@classics.ox.ac.uk phone: (0)1865 610236
I had the same problem. I solved it by doing query using the pattern matching operator ~*.
I paste an example query
select id_edb,transcription from epigraph where transcription ~* 'εἰρήνῃ'
and it works
best
Fabio