Обсуждение: Is this error correct/possible?

Поиск
Список
Период
Сортировка

Is this error correct/possible?

От
Joost Kraaijeveld
Дата:
Hi,

I have a database which is created in PostgreSQL 8.0.3 which is filled
with a PostgreSQL 7.4.7 database backup. Both databases where created
with SQL_ASCII.

Running a query against on of the tables gives this error (with the
pg74.216.jdbc3.jar and postgresql-8.0-312.jdbc3.jar drivers):

"Invalid character data was found.  This is most likely caused by stored
data containing characters that are invalid for the character set the
database was created in.  The most common example of this is storing
8bit data in a SQL_ASCII database."

Inspection of the row (using pgadmin3) shows that there is the char "ü"
in a char(40) columns.

Questions:

1. Is a "ü" allowed in a SQL_ASCII database and a column of char(40)?
2. If so, is this a JDBC bug?
3. If not, is this a PostgreSQL bug, allowing a non-allowed character in
a column?

TIA

Joost

Re: Is this error correct/possible?

От
Marc Herbert
Дата:
Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:

> I have a database which is created in PostgreSQL 8.0.3 which is filled
> with a PostgreSQL 7.4.7 database backup. Both databases where created
> with SQL_ASCII.
>
> Running a query against on of the tables gives this error (with the
> pg74.216.jdbc3.jar and postgresql-8.0-312.jdbc3.jar drivers):
>
> "Invalid character data was found.  This is most likely caused by stored
> data containing characters that are invalid for the character set the
> database was created in.  The most common example of this is storing
> 8bit data in a SQL_ASCII database."

Could you copy/paste the complete exception and stacktrace?


> Inspection of the row (using pgadmin3) shows that there is the char "ü"
> in a char(40) columns.
>
> Questions:
>
> 1. Is a "ü" allowed in a SQL_ASCII database and a column of char(40)?

From a very recent discussion on pgsql-Odbc (check the archive), it is
not allowed. Not allowed "anymore" from what I got. Which is quite
sensible considering that ü is not ASCII: how would you sort it, upper
it, etc.? I did not get if this has been recently enforced by the ODBC
driver, the engine, both... check the archive.


> 2. If so, is this a JDBC bug?

The stacktrace and such debug stuff would help tracking what JDBC is
doing here.


> 3. If not, is this a PostgreSQL bug, allowing a non-allowed character in
> a column?

Maybe it was allowed in 747, not anymore in 8.0.3 ?


Re: Is this error correct/possible?

От
Kris Jurka
Дата:
  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--0-1086122986-1124807671=:32740
Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE



On Tue, 23 Aug 2005, Joost Kraaijeveld wrote:

> I have a database which is created in PostgreSQL 8.0.3 which is filled
> with a PostgreSQL 7.4.7 database backup. Both databases where created
> with SQL_ASCII.
>
> "Invalid character data was found.  This is most likely caused by stored
> data containing characters that are invalid for the character set the
> database was created in.  The most common example of this is storing
> 8bit data in a SQL_ASCII database."
>
> Inspection of the row (using pgadmin3) shows that there is the char "ü"
> in a char(40) columns.
>
> Questions:
>
> 1. Is a "ü" allowed in a SQL_ASCII database and a column of char(40)?

It is allowed to be stored in the database because SQL_ASCII is not a real
encoding.  SQL_ASCII allows you to store anything you want and doesn't
require you to tell the server what character set it actually is.  The
problem is on the return end, the JDBC driver asks the server to always
return data in UTF-8 by setting the client_encoding appropriately.  The
server has no idea what the original encoding of the data was, so it has
no means of converting it to unicode.  It may happen to look like
u-double-dot in your particular pgadmin3 client's encoding, but if that
client's encoding was different it would show up as a different character.
This is why the JDBC driver bails out instead of just picking a
random character.

> 2. If so, is this a JDBC bug?

No.  The JDBC documentation clearly states not to choose a SQL_ASCII
database for your data.

http://jdbc.postgresql.org/documentation/80/your-database.html

> 3. If not, is this a PostgreSQL bug, allowing a non-allowed character in
> a column?
>

This is how the SQL_ASCII encoding works, for better or worse (mostly
worse).  The problem is that you've likely had two different clients
connect with different client_encodings which ends up storing two
different encodings in the database which is then going to break the other
client when it tries to display it.

Kris Jurka
--0-1086122986-1124807671=:32740--

Re: Is this error correct/possible?

От
Joost Kraaijeveld
Дата:
On Tue, 2005-08-23 at 15:42 +0200, Marc Herbert wrote:
> Could you copy/paste the complete exception and stacktrace?

Invalid character data was found.  This is most likely caused by stored data containing characters that are invalid for
thecharacter set the database was created in.  The most common example of this is storing 8bit data in a SQL_ASCII
database.
org.postgresql.util.PSQLException: Invalid character data was found.
This is most likely caused by stored data containing characters that are
invalid for the character set the database was created in.  The most
common example of this is storing 8bit data in a SQL_ASCII database.
    at
org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:1904)
    at
org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:2203)
    at nl.muntpost.MuntpostApp.main(MuntpostApp.java:42)
Caused by: java.io.IOException: UTF-8 string representation was
truncated
    at org.postgresql.core.Encoding.decodeUTF8(Encoding.java:304)
    at org.postgresql.core.Encoding.decode(Encoding.java:179)
    at org.postgresql.core.Encoding.decode(Encoding.java:193)
    at
org.postgresql.jdbc2.AbstractJdbc2ResultSet.getString(AbstractJdbc2ResultSet.java:1900)
    ... 2 more

> > Inspection of the row (using pgadmin3) shows that there is the char "ü"
> > in a char(40) columns.
> >
> > Questions:
> >
> > 1. Is a "ü" allowed in a SQL_ASCII database and a column of char(40)?
>
> From a very recent discussion on pgsql-Odbc (check the archive), it is
> not allowed. Not allowed "anymore" from what I got. Which is quite
> sensible considering that ü is not ASCII: how would you sort it, upper
> it, etc.? I did not get if this has been recently enforced by the ODBC
> driver, the engine, both... check the archive.
OK, this is what I expected.
>
> > 2. If so, is this a JDBC bug?
>
> The stacktrace and such debug stuff would help tracking what JDBC is
> doing here.
See above.

>
> > 3. If not, is this a PostgreSQL bug, allowing a non-allowed character in
> > a column?
>
> Maybe it was allowed in 747, not anymore in 8.0.3 ?
From another mail I understood that is this is just the way it is.

Joost

Re: Is this error correct/possible?

От
Kris Jurka
Дата:

On Tue, 23 Aug 2005, Joost Kraaijeveld wrote:

> What if you don't have a choise? E.g. the database is already in
> SQL_ASCII? Can I convert the database from SQL_ASCII to UNCODE?
>

Yes, but this requires a dump and restore.  Additionally since you have
some non-unicode data in there, it may require running something like
iconv over the data before reloading it.  This can be especially messy if
you've got multiple encodings in there already.  It all depends on the
uniformity of your existing data.

Kris Jurka

Re: Is this error correct/possible?

От
Joost Kraaijeveld
Дата:
On Tue, 2005-08-23 at 09:34 -0500, Kris Jurka wrote:
> It is allowed to be stored in the database because SQL_ASCII is not a real
> encoding.  SQL_ASCII allows you to store anything you want and doesn't
> require you to tell the server what character set it actually is.  The
> problem is on the return end, the JDBC driver asks the server to always
> return data in UTF-8 by setting the client_encoding appropriately.  The
> server has no idea what the original encoding of the data was, so it has
> no means of converting it to unicode.  It may happen to look like
> u-double-dot in your particular pgadmin3 client's encoding, but if that
> client's encoding was different it would show up as a different character.
> This is why the JDBC driver bails out instead of just picking a
> random character.
OK.

> > 2. If so, is this a JDBC bug?
>
> No.  The JDBC documentation clearly states not to choose a SQL_ASCII
> database for your data.
>
> http://jdbc.postgresql.org/documentation/80/your-database.html
What if you don't have a choise? E.g. the database is already in SQL_ASCII? Can I convert the database from SQL_ASCII
toUNCODE? 

TIA

Joost

Re: Is this error correct/possible?

От
Joost Kraaijeveld
Дата:
On Tue, 2005-08-23 at 09:55 -0500, Kris Jurka wrote:
> Yes, but this requires a dump and restore.  Additionally since you have
> some non-unicode data in there, it may require running something like
> iconv over the data before reloading it.  This can be especially messy if
> you've got multiple encodings in there already.  It all depends on the
> uniformity of your existing data.
Can you give me a pointer where I can find any info about such
procedure?

TIA

Joost

Re: Is this encoding error correct/possible?

От
Marc Herbert
Дата:
Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:

> On Tue, 2005-08-23 at 09:55 -0500, Kris Jurka wrote:
>> Yes, but this requires a dump and restore.  Additionally since you have
>> some non-unicode data in there, it may require running something like
>> iconv over the data before reloading it.  This can be especially messy if
>> you've got multiple encodings in there already.  It all depends on the
>> uniformity of your existing data.

> Can you give me a pointer where I can find any info about such
> procedure?

GNU recode is a simple but great tool to convert text files, you may
find it useful.  It seems to be similar to iconv in purpose (I don't
know iconv).


Re: Is this error correct/possible?

От
"Kevin Grittner"
Дата:
My client has a standard that only a subset of the 7 bit ASCII character set is to be allowed in character columns in
theirdatabase (decimal 32 to 126 in most columns, decimal 10 also allowed in some).  They would prefer to see
exceptionsfrom the JDBC driver on attempts to insert or retrieve any character outside the ASCII range than to have it
silentlywritten or returned. 

In this particular case, do you see a problem with using the SQL_ASCII encoding?

Thanks,

-Kevin


>>> Kris Jurka <books@ejurka.com> 08/23/05 9:34 AM >>>

The JDBC documentation clearly states not to choose a SQL_ASCII
database for your data.

http://jdbc.postgresql.org/documentation/80/your-database.html


Re: Is this error correct/possible?

От
Kris Jurka
Дата:

On Thu, 25 Aug 2005, Kevin Grittner wrote:

> My client has a standard that only a subset of the 7 bit ASCII character
> set is to be allowed in character columns in their database (decimal 32
> to 126 in most columns, decimal 10 also allowed in some).  They would
> prefer to see exceptions from the JDBC driver on attempts to insert or
> retrieve any character outside the ASCII range than to have it silently
> written or returned.

The JDBC driver will throw an Exception upon reading data that it
receives that is not valid UTF-8 data.  This will not detect all problems
because it always sends valid UTF-8 data to the server and expects it to
convert it the server encoding or complain.  This means if you've got
a non-ascii character the JDBC driver will send it to the database and
read it from the database as UTF-8 and will not complain.  Problems
arise when:

1) Another client adds data of another encoding to the database.  Then
the JDBC driver tries to read it and finds it isn't UTF-8 and bails
out.

2) Sending non ascii data in as UTF-8 can convert one character into
multiple bytes.  In a real encoding the server realizes this and converts
the multiple bytes back into one character for checks like VARCHAR(N).
SQL_ASCII stores the multiple bytes as multiple characters and this check
can fail when you don't expect it to.

> In this particular case, do you see a problem with using the SQL_ASCII
> encoding?
>

The question is, "why would you want to?"  Why not select a real encoding
that is a superset of ascii, like say LATIN1.  It provides a much more
reasonable fallback behavior if you ever do add non-ascii data.

Kris Jurka