Обсуждение: Is this error correct/possible?
Hi, I have a database which is created in PostgreSQL 8.0.3 which is filled with a PostgreSQL 7.4.7 database backup. Both databases where created with SQL_ASCII. Running a query against on of the tables gives this error (with the pg74.216.jdbc3.jar and postgresql-8.0-312.jdbc3.jar drivers): "Invalid character data was found. This is most likely caused by stored data containing characters that are invalid for the character set the database was created in. The most common example of this is storing 8bit data in a SQL_ASCII database." Inspection of the row (using pgadmin3) shows that there is the char "ü" in a char(40) columns. Questions: 1. Is a "ü" allowed in a SQL_ASCII database and a column of char(40)? 2. If so, is this a JDBC bug? 3. If not, is this a PostgreSQL bug, allowing a non-allowed character in a column? TIA Joost
Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes: > I have a database which is created in PostgreSQL 8.0.3 which is filled > with a PostgreSQL 7.4.7 database backup. Both databases where created > with SQL_ASCII. > > Running a query against on of the tables gives this error (with the > pg74.216.jdbc3.jar and postgresql-8.0-312.jdbc3.jar drivers): > > "Invalid character data was found. This is most likely caused by stored > data containing characters that are invalid for the character set the > database was created in. The most common example of this is storing > 8bit data in a SQL_ASCII database." Could you copy/paste the complete exception and stacktrace? > Inspection of the row (using pgadmin3) shows that there is the char "ü" > in a char(40) columns. > > Questions: > > 1. Is a "ü" allowed in a SQL_ASCII database and a column of char(40)? From a very recent discussion on pgsql-Odbc (check the archive), it is not allowed. Not allowed "anymore" from what I got. Which is quite sensible considering that ü is not ASCII: how would you sort it, upper it, etc.? I did not get if this has been recently enforced by the ODBC driver, the engine, both... check the archive. > 2. If so, is this a JDBC bug? The stacktrace and such debug stuff would help tracking what JDBC is doing here. > 3. If not, is this a PostgreSQL bug, allowing a non-allowed character in > a column? Maybe it was allowed in 747, not anymore in 8.0.3 ?
This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-1086122986-1124807671=:32740 Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Tue, 23 Aug 2005, Joost Kraaijeveld wrote: > I have a database which is created in PostgreSQL 8.0.3 which is filled > with a PostgreSQL 7.4.7 database backup. Both databases where created > with SQL_ASCII. > > "Invalid character data was found. This is most likely caused by stored > data containing characters that are invalid for the character set the > database was created in. The most common example of this is storing > 8bit data in a SQL_ASCII database." > > Inspection of the row (using pgadmin3) shows that there is the char "ü" > in a char(40) columns. > > Questions: > > 1. Is a "ü" allowed in a SQL_ASCII database and a column of char(40)? It is allowed to be stored in the database because SQL_ASCII is not a real encoding. SQL_ASCII allows you to store anything you want and doesn't require you to tell the server what character set it actually is. The problem is on the return end, the JDBC driver asks the server to always return data in UTF-8 by setting the client_encoding appropriately. The server has no idea what the original encoding of the data was, so it has no means of converting it to unicode. It may happen to look like u-double-dot in your particular pgadmin3 client's encoding, but if that client's encoding was different it would show up as a different character. This is why the JDBC driver bails out instead of just picking a random character. > 2. If so, is this a JDBC bug? No. The JDBC documentation clearly states not to choose a SQL_ASCII database for your data. http://jdbc.postgresql.org/documentation/80/your-database.html > 3. If not, is this a PostgreSQL bug, allowing a non-allowed character in > a column? > This is how the SQL_ASCII encoding works, for better or worse (mostly worse). The problem is that you've likely had two different clients connect with different client_encodings which ends up storing two different encodings in the database which is then going to break the other client when it tries to display it. Kris Jurka --0-1086122986-1124807671=:32740--
On Tue, 2005-08-23 at 15:42 +0200, Marc Herbert wrote: > Could you copy/paste the complete exception and stacktrace? Invalid character data was found. This is most likely caused by stored data containing characters that are invalid for thecharacter set the database was created in. The most common example of this is storing 8bit data in a SQL_ASCII database. org.postgresql.util.PSQLException: Invalid character data was found. This is most likely caused by stored data containing characters that are invalid for the character set the database was created in. The most common example of this is storing 8bit data in a SQL_ASCII database. at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:1904) at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:2203) at nl.muntpost.MuntpostApp.main(MuntpostApp.java:42) Caused by: java.io.IOException: UTF-8 string representation was truncated at org.postgresql.core.Encoding.decodeUTF8(Encoding.java:304) at org.postgresql.core.Encoding.decode(Encoding.java:179) at org.postgresql.core.Encoding.decode(Encoding.java:193) at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:1900) ... 2 more > > Inspection of the row (using pgadmin3) shows that there is the char "ü" > > in a char(40) columns. > > > > Questions: > > > > 1. Is a "ü" allowed in a SQL_ASCII database and a column of char(40)? > > From a very recent discussion on pgsql-Odbc (check the archive), it is > not allowed. Not allowed "anymore" from what I got. Which is quite > sensible considering that ü is not ASCII: how would you sort it, upper > it, etc.? I did not get if this has been recently enforced by the ODBC > driver, the engine, both... check the archive. OK, this is what I expected. > > > 2. If so, is this a JDBC bug? > > The stacktrace and such debug stuff would help tracking what JDBC is > doing here. See above. > > > 3. If not, is this a PostgreSQL bug, allowing a non-allowed character in > > a column? > > Maybe it was allowed in 747, not anymore in 8.0.3 ? From another mail I understood that is this is just the way it is. Joost
On Tue, 23 Aug 2005, Joost Kraaijeveld wrote: > What if you don't have a choise? E.g. the database is already in > SQL_ASCII? Can I convert the database from SQL_ASCII to UNCODE? > Yes, but this requires a dump and restore. Additionally since you have some non-unicode data in there, it may require running something like iconv over the data before reloading it. This can be especially messy if you've got multiple encodings in there already. It all depends on the uniformity of your existing data. Kris Jurka
On Tue, 2005-08-23 at 09:34 -0500, Kris Jurka wrote: > It is allowed to be stored in the database because SQL_ASCII is not a real > encoding. SQL_ASCII allows you to store anything you want and doesn't > require you to tell the server what character set it actually is. The > problem is on the return end, the JDBC driver asks the server to always > return data in UTF-8 by setting the client_encoding appropriately. The > server has no idea what the original encoding of the data was, so it has > no means of converting it to unicode. It may happen to look like > u-double-dot in your particular pgadmin3 client's encoding, but if that > client's encoding was different it would show up as a different character. > This is why the JDBC driver bails out instead of just picking a > random character. OK. > > 2. If so, is this a JDBC bug? > > No. The JDBC documentation clearly states not to choose a SQL_ASCII > database for your data. > > http://jdbc.postgresql.org/documentation/80/your-database.html What if you don't have a choise? E.g. the database is already in SQL_ASCII? Can I convert the database from SQL_ASCII toUNCODE? TIA Joost
On Tue, 2005-08-23 at 09:55 -0500, Kris Jurka wrote: > Yes, but this requires a dump and restore. Additionally since you have > some non-unicode data in there, it may require running something like > iconv over the data before reloading it. This can be especially messy if > you've got multiple encodings in there already. It all depends on the > uniformity of your existing data. Can you give me a pointer where I can find any info about such procedure? TIA Joost
Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes: > On Tue, 2005-08-23 at 09:55 -0500, Kris Jurka wrote: >> Yes, but this requires a dump and restore. Additionally since you have >> some non-unicode data in there, it may require running something like >> iconv over the data before reloading it. This can be especially messy if >> you've got multiple encodings in there already. It all depends on the >> uniformity of your existing data. > Can you give me a pointer where I can find any info about such > procedure? GNU recode is a simple but great tool to convert text files, you may find it useful. It seems to be similar to iconv in purpose (I don't know iconv).
My client has a standard that only a subset of the 7 bit ASCII character set is to be allowed in character columns in theirdatabase (decimal 32 to 126 in most columns, decimal 10 also allowed in some). They would prefer to see exceptionsfrom the JDBC driver on attempts to insert or retrieve any character outside the ASCII range than to have it silentlywritten or returned. In this particular case, do you see a problem with using the SQL_ASCII encoding? Thanks, -Kevin >>> Kris Jurka <books@ejurka.com> 08/23/05 9:34 AM >>> The JDBC documentation clearly states not to choose a SQL_ASCII database for your data. http://jdbc.postgresql.org/documentation/80/your-database.html
On Thu, 25 Aug 2005, Kevin Grittner wrote: > My client has a standard that only a subset of the 7 bit ASCII character > set is to be allowed in character columns in their database (decimal 32 > to 126 in most columns, decimal 10 also allowed in some). They would > prefer to see exceptions from the JDBC driver on attempts to insert or > retrieve any character outside the ASCII range than to have it silently > written or returned. The JDBC driver will throw an Exception upon reading data that it receives that is not valid UTF-8 data. This will not detect all problems because it always sends valid UTF-8 data to the server and expects it to convert it the server encoding or complain. This means if you've got a non-ascii character the JDBC driver will send it to the database and read it from the database as UTF-8 and will not complain. Problems arise when: 1) Another client adds data of another encoding to the database. Then the JDBC driver tries to read it and finds it isn't UTF-8 and bails out. 2) Sending non ascii data in as UTF-8 can convert one character into multiple bytes. In a real encoding the server realizes this and converts the multiple bytes back into one character for checks like VARCHAR(N). SQL_ASCII stores the multiple bytes as multiple characters and this check can fail when you don't expect it to. > In this particular case, do you see a problem with using the SQL_ASCII > encoding? > The question is, "why would you want to?" Why not select a real encoding that is a superset of ascii, like say LATIN1. It provides a much more reasonable fallback behavior if you ever do add non-ascii data. Kris Jurka