Обсуждение: SET client_encoding = 'UTF8'
Hello dear developers, The command SET client_encoding = 'UTF8' throws an exception in the driver, because the driver expects UNICODE. I understand exceptions for other encodings, but this is IMHO a must have. Out database scripts should contain this line at the beginning, to be able to dump them manually into the server, and which is actually more correct than a line that sets the encoding to 'UNICODE'. Thanks in advance and with best regards, Daniel Migowski
Daniel Migowski <dmigowski@ikoffice.de> writes: > The command > SET client_encoding = 'UTF8' > throws an exception in the driver, because the driver expects UNICODE. Er, what driver exactly? Perhaps you need a more up-to-date version of said driver? 'UTF8' has been our standard spelling of this encoding's name for quite some time now. regards, tom lane
On Sun, 18 May 2008, Daniel Migowski wrote: > The command SET client_encoding = 'UTF8' > > throws an exception in the driver, because the driver expects UNICODE. This has been discussed before and the problem is that there are a too many ways to say UTF8 [1]. You can say UTF8, UTF-8, UTF -- 8, and so on. Perhaps we should strip all spaces and dashes prior to comparison? [1] http://archives.postgresql.org/pgsql-jdbc/2008-02/threads.php#00174 Kris Jurka
Tom Lane wrote: > Daniel Migowski <dmigowski@ikoffice.de> writes: >> The command >> SET client_encoding = 'UTF8' >> throws an exception in the driver, because the driver expects UNICODE. > > Er, what driver exactly? Perhaps you need a more up-to-date version > of said driver? 'UTF8' has been our standard spelling of this > encoding's name for quite some time now. The driver requests client_encoding = UNICODE in the startup packet, and expects client_encoding to stay as UNICODE throughout. If client code goes off and manually sets it to UTF8 then the JDBC driver complains, because it doesn't know that UNICODE is equivalent to UTF8. -O
Kris Jurka schrieb: > On Sun, 18 May 2008, Daniel Migowski wrote: >> The command SET client_encoding = 'UTF8' >> >> throws an exception in the driver, because the driver expects UNICODE. > This has been discussed before and the problem is that there are a too > many ways to say UTF8 [1]. You can say UTF8, UTF-8, UTF -- 8, and so > on. Perhaps we should strip all spaces and dashes prior to comparison? This would be correct in my opinion. I think no one darse to declare a charset name the relies on charaters other than 0-9 and a-z to be identifiable. IMHO we should just allow the way postgres allowes by itself (we could dig into the parsing code of postgres). I tried at the command line, and got the following: set client_encoding='foobar'; FEHLER: Invalid value for parameter »client_encoding«: »foobar« set client_encoding='utf8'; OK set client_encoding='utf-8'; OK set client_encoding='utf -- 8'; OK set client_encoding='Utf -- 8'; OK set client_encoding='Utf -- 98'; FEHLER: Invalid value for parameter »client_encoding«: »Utf -- 98« set client_encoding='Utf_8'; OK But I think we should be right with userencoding.toLowercase().replaceall("[^0-9a-z]","").equals("utf8"); // untested prototype code or something like this. > > [1] http://archives.postgresql.org/pgsql-jdbc/2008-02/threads.php#00174 Thanks for the link. With best regards, Daniel Migowski
Daniel Migowski <dmigowski@ikoffice.de> writes: > Kris Jurka schrieb: >> On Sun, 18 May 2008, Daniel Migowski wrote: >>> The command SET client_encoding = 'UTF8' > >> throws an exception in the driver, because the driver expects UNICODE. >> This has been discussed before and the problem is that there are a too >> many ways to say UTF8 [1]. You can say UTF8, UTF-8, UTF -- 8, and so >> on. Perhaps we should strip all spaces and dashes prior to comparison? Perhaps we should make the backend return the values of client_encoding and server_encoding in canonical form (ie, "UTF8") regardless of the spelling variant the user used. I'm not thrilled with having JDBC thinking it knows the conversion algorithm the backend uses. Of course, such a change would break code relying on the older behavior :-( regards, tom lane
Tom Lane wrote: > Daniel Migowski <dmigowski@ikoffice.de> writes: >> Kris Jurka schrieb: >>> On Sun, 18 May 2008, Daniel Migowski wrote: >>>> The command SET client_encoding = 'UTF8' >>> throws an exception in the driver, because the driver expects UNICODE. >>> This has been discussed before and the problem is that there are a too >>> many ways to say UTF8 [1]. You can say UTF8, UTF-8, UTF -- 8, and so >>> on. Perhaps we should strip all spaces and dashes prior to comparison? > > Perhaps we should make the backend return the values of client_encoding > and server_encoding in canonical form (ie, "UTF8") regardless of the > spelling variant the user used. I'm not thrilled with having JDBC > thinking it knows the conversion algorithm the backend uses. > > Of course, such a change would break code relying on the older behavior > :-( Not sure if this is a big enough issue to warrant a server change. It only happens when a JDBC client issues a manual SET client_encoding to an encoding that's UTF8 but isn't spelled "UNICODE". That's going to be a no-op anyway, so I'm not entirely clear why the client needs to be sending it in the first place. It sounds like the root cause might be something like "let's feed pg_dump output to JDBC". So we could add a special case in the driver to allow exactly "UTF8" as well as "UNICODE", if that's the canonical way the server spells it these days. -O
Oliver Jowett <oliver@opencloud.com> writes: > It sounds like the root cause might be something like "let's feed > pg_dump output to JDBC". So we could add a special case in the driver to > allow exactly "UTF8" as well as "UNICODE", if that's the canonical way > the server spells it these days. +1 for that in any case, because UNICODE hasn't been the canonical spelling since 8.1. regards, tom lane
On Mon, 19 May 2008, Tom Lane wrote: > Oliver Jowett <oliver@opencloud.com> writes: >> So we could add a special case in the driver to allow exactly "UTF8" as >> well as "UNICODE", if that's the canonical way the server spells it >> these days. > > +1 for that in any case, because UNICODE hasn't been the canonical > spelling since 8.1. > OK, I'll make this happen. A work around for the immediate problem is to use the URL parameter allowEncodingChanges=true. Kris Jurka http://jdbc.postgresql.org/documentation/83/connect.html#connection-parameters
On Mon, 19 May 2008, Kris Jurka wrote: > On Mon, 19 May 2008, Tom Lane wrote: > >> Oliver Jowett <oliver@opencloud.com> writes: >>> So we could add a special case in the driver to allow exactly "UTF8" as >>> well as "UNICODE", if that's the canonical way the server spells it these >>> days. >> >> +1 for that in any case, because UNICODE hasn't been the canonical >> spelling since 8.1. >> > > OK, I'll make this happen. A work around for the immediate problem is to use > the URL parameter allowEncodingChanges=true. > Done.