Обсуждение: JDBC to load UTF8@psql to latin1@mysql
Good morning, Is there a simple way to load UTF8 data in psql to mysql(with latin1 encoding) through JDBC? Thanks a lot! Emi
> Is there a simple way to load UTF8 data in psql to mysql(with latin1 > encoding) through JDBC? > JAVA codes work for most of characters, but not "-È". Someone knows why the following codes cannot load "-È" to mysql@latin1? Thanks a lot! -- public static String utf8_to_latin1(String str) throws Exception { try { String stringToConvert = str; byte[] convertStringToByte = stringToConvert.getBytes("UTF-8"); return new String(convertStringToByte, "ISO-8859-1"); }catch(Exception e) { log.error("utf8_to_latin1 Error: " + e.getMessage()); log.error(e); throw e; } }
I don't think your Java code does what you think it does. You should read some more about how Java handles string encodings. Here is a method I wrote some years ago that might also help you. It converts streams, not strings, but what you need should be pretty close (and simpler):
/**
* Interprets in according to encIn, and converts it to encOut,
* writing to out. Allocates buffer for the buffer size.
*
* @param encIn The input encoding.
* @param encOut The output encoding.
* @param in The data to convert.
* @param out Where to send the converted data.
* @param buffer The size of the buffer or 0 for the default.
*
* @throws IOException
*/
public void run(String encIn, String encOut, InputStream in, OutputStream out, int buffer) throws IOException {
Reader r = null;
Writer w = null;
int len;
char[] b;
try {
if (buffer > 0) {
r = new BufferedReader(new InputStreamReader(in, encIn), buffer);
w = new BufferedWriter(new OutputStreamWriter(out, encOut), buffer);
} else {
r = new BufferedReader(new InputStreamReader(in, encIn));
w = new BufferedWriter(new OutputStreamWriter(out, encOut));
buffer = DEFAULT_BUFFER_SIZE;
}
b = new char[buffer];
while ((len = r.read(b, 0, buffer)) != -1) {
w.write(b, 0, len);
}
} finally {
try {
if (r != null) r.close();
} finally {
if (w != null) w.close();
}
}
}
Btw, none of this has anything to do with Postgres. :-)
Paul
--
_________________________________
Pulchritudo splendor veritatis.
/**
* Interprets in according to encIn, and converts it to encOut,
* writing to out. Allocates buffer for the buffer size.
*
* @param encIn The input encoding.
* @param encOut The output encoding.
* @param in The data to convert.
* @param out Where to send the converted data.
* @param buffer The size of the buffer or 0 for the default.
*
* @throws IOException
*/
public void run(String encIn, String encOut, InputStream in, OutputStream out, int buffer) throws IOException {
Reader r = null;
Writer w = null;
int len;
char[] b;
try {
if (buffer > 0) {
r = new BufferedReader(new InputStreamReader(in, encIn), buffer);
w = new BufferedWriter(new OutputStreamWriter(out, encOut), buffer);
} else {
r = new BufferedReader(new InputStreamReader(in, encIn));
w = new BufferedWriter(new OutputStreamWriter(out, encOut));
buffer = DEFAULT_BUFFER_SIZE;
}
b = new char[buffer];
while ((len = r.read(b, 0, buffer)) != -1) {
w.write(b, 0, len);
}
} finally {
try {
if (r != null) r.close();
} finally {
if (w != null) w.close();
}
}
}
Btw, none of this has anything to do with Postgres. :-)
Paul
On Wed, Dec 12, 2012 at 10:19 AM, Emi Lu <emilu@encs.concordia.ca> wrote:
JAVA codes work for most of characters, but not "-È". Someone knows why the following codes cannot load "-È" to mysql@latin1?Is there a simple way to load UTF8 data in psql to mysql(with latin1
encoding) through JDBC?
Thanks a lot!
--
public static String utf8_to_latin1(String str)
throws Exception
{
try
{
String stringToConvert = str;
byte[] convertStringToByte = stringToConvert.getBytes("UTF-8");
return new String(convertStringToByte, "ISO-8859-1");
}catch(Exception e)
{
log.error("utf8_to_latin1 Error: " + e.getMessage());
log.error(e);
throw e;
}
}
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
_________________________________
Pulchritudo splendor veritatis.
Emi Lu wrote on 12.12.2012 17:17: > Good morning, > > Is there a simple way to load UTF8 data in psql to mysql(with latin1 encoding) through JDBC? All you need to do is to query the source database, then use ResultSet.getString() to obtain the data and use a PreparedStatementand PreparedStatement.setString() to insert/update the data on the target database. The JDBC drivers will handle all the conversion. Do NOT manually convert the data. getString() and setString() will do everything correctly. Thomas
>> Is there a simple way to load UTF8 data in psql to mysql(with latin1 >> encoding) through JDBC? > > All you need to do is to query the source database, then use > ResultSet.getString() to obtain the data and use a PreparedStatement and > PreparedStatement.setString() to insert/update the data on the target > database. > > The JDBC drivers will handle all the conversion. > Do NOT manually convert the data. > > getString() and setString() will do everything correctly. I am not using stmt directly but through Mybatis for all db transactions. So, this approach will not work. Thanks. -- Emi
Em 13/12/2012 12:00, Emi Lu escreveu: > >>> Is there a simple way to load UTF8 data in psql to mysql(with latin1 >>> encoding) through JDBC? >> >> All you need to do is to query the source database, then use >> ResultSet.getString() to obtain the data and use a PreparedStatement and >> PreparedStatement.setString() to insert/update the data on the target >> database. >> >> The JDBC drivers will handle all the conversion. >> Do NOT manually convert the data. >> >> getString() and setString() will do everything correctly. > > I am not using stmt directly but through Mybatis for all db transactions. Should not this a Mybatis problem instead? As stated, JDBC drivers does all the conversion needed automatically, but if you have a middleware messing with your enconding, then the problem is the middleware, not databases or drivers. Edson > > So, this approach will not work. > > Thanks. > -- > Emi > > > >
> I don't think your Java code does what you think it does. You should > read some more about how Java handles string encodings. Here is a method > I wrote some years ago that might also help you. It converts streams, > not strings, but what you need should be pretty close (and simpler): > /** > * Interprets in according to encIn, and converts it to encOut, > * writing to out. Allocates buffer for the buffer size. > * @param encIn The input encoding. > * @param encOut The output encoding. > * @param in The data to convert. > * @param out Where to send the converted data. > * @param buffer The size of the buffer or 0 for the default. > * @throws IOException > */ > public void run(String encIn, String encOut, InputStream in, > OutputStream out, int buffer) throws IOException { > Reader r = null; > Writer w = null; > int len; > char[] b; > try { > if (buffer > 0) { > r = new BufferedReader(new InputStreamReader(in, encIn), buffer); > w = new BufferedWriter(new OutputStreamWriter(out, encOut), > buffer); > } else { > r = new BufferedReader(new InputStreamReader(in, encIn)); > w = new BufferedWriter(new OutputStreamWriter(out, encOut)); > buffer = DEFAULT_BUFFER_SIZE; > } > b = new char[buffer]; > > while ((len = r.read(b, 0, buffer)) != -1) { > w.write(b, 0, len); > } > } finally { > try { > if (r != null) r.close(); > } finally { > if (w != null) w.close(); > } > } > } > Btw, none of this has anything to do with Postgres. :-) Thank you for the code first. I will try it later. The problem I had as mentioned in the subject is: (1) psql@utf8 (2) mysql@latin1 When I load data from (1) to (2) through Mybatis, french characters could not be mapped correctly in (2). I was thinking that psql may have methods could help this. But it seems that I have to try from java coding side :-( -- Emi
> The JDBC drivers will handle all the conversion.
> Do NOT manually convert the data.
Yeah, I agree this is the right answer here, since you're using JDBC. By the time you get a String from the MySQL driver, it's already in Java's 2-bytes-per-char format. And the Postgres driver will deal with the encoding on the output side. So the code I provided won't help you. I'm afraid I don't know about Mybatis, but if it's built on JDBC I'd think you've just got a configuration problem with what encoding the client expects at either end.
Paul
--
_________________________________
Pulchritudo splendor veritatis.
> Do NOT manually convert the data.
Yeah, I agree this is the right answer here, since you're using JDBC. By the time you get a String from the MySQL driver, it's already in Java's 2-bytes-per-char format. And the Postgres driver will deal with the encoding on the output side. So the code I provided won't help you. I'm afraid I don't know about Mybatis, but if it's built on JDBC I'd think you've just got a configuration problem with what encoding the client expects at either end.
Paul
--
_________________________________
Pulchritudo splendor veritatis.
Emi Lu wrote on 13.12.2012 15:00: > >>> Is there a simple way to load UTF8 data in psql to mysql(with latin1 >>> encoding) through JDBC? >> >> All you need to do is to query the source database, then use >> ResultSet.getString() to obtain the data and use a PreparedStatement and >> PreparedStatement.setString() to insert/update the data on the target >> database. >> >> The JDBC drivers will handle all the conversion. >> Do NOT manually convert the data. >> >> getString() and setString() will do everything correctly. > > I am not using stmt directly but through Mybatis for all db transactions. > > So, this approach will not work. Then it's a problem of that MyBatis thing. Thomas
Paul Jungwirth <pj@illuminatedcomputing.com> writes: > Yeah, I agree this is the right answer here, since you're using JDBC. By > the time you get a String from the MySQL driver, it's already in Java's > 2-bytes-per-char format. And the Postgres driver will deal with the > encoding on the output side. So the code I provided won't help you. I'm > afraid I don't know about Mybatis, but if it's built on JDBC I'd think > you've just got a configuration problem with what encoding the client > expects at either end. I was wondering if the problem wasn't lots simpler than that. Is the character the OP is trying to convert actually part of LATIN1? regards, tom lane
>> Yeah, I agree this is the right answer here, since you're using JDBC. By >> the time you get a String from the MySQL driver, it's already in Java's >> 2-bytes-per-char format. And the Postgres driver will deal with the >> encoding on the output side. So the code I provided won't help you. I'm >> afraid I don't know about Mybatis, but if it's built on JDBC I'd think >> you've just got a configuration problem with what encoding the client >> expects at either end. > From: Tom lane > I was wondering if the problem wasn't lots simpler than that. Is the > character the OP is trying to convert actually part of LATIN1? First, the data loading is from psql(unix) to mysql(Unix). Second, DB transactions are through JAVA+MyBatis. Steps: (1) Read utf8.data@psql from psql.xml into java.ArrayList<Bean> (2) For each list.rec, save into mysql@latin through mysql.xml Tried "jdbc:mysql://.../mysql_db?...unicode...encoding...=ISO..." No. This does not work. For now, through the following method, all letters are correctly transformed except "È". What does OP stand for? Emi -- public static String utf8_to_latin1(String str) throws Exception { try { if(str.indexOf("È")>=0) { str = str.replaceAll("È", "E"); } byte[] convertStringToByte = str.getBytes("UTF-8"); str = new String(convertStringToByte, "ISO-8859-1"); return str; }catch(Exception e) { log.error("utf8_to_latin1 Error: " + e.getMessage()); log.error(e); throw e; } }
On 12/14/2012 06:06 AM, Emi Lu wrote: > > What does OP stand for? Original Poster. > > Emi > -- > -- Adrian Klaver adrian.klaver@gmail.com
On 12/14/2012 09:49 AM, Adrian Klaver wrote: > Original Poster Thanks. And more info: Mysql +--------------------------+--------+ | Variable_name | Value | +--------------------------+--------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | +--------------------------+--------+ SHOW VARIABLES LIKE "character\_set\_database"; +------------------------+--------+ | Variable_name | Value | +------------------------+--------+ | character_set_database | latin1 | +------------------------+--------+ Psql everywhere is utf8. Where could be the problem located? Also by using the java encoding methods, all characters except "È" are transformed correctly. Thanks alot! Emi -- public static String utf8_to_latin1(String str) throws Exception { try { if(str.indexOf("È")>=0) { str = str.replaceAll("È", "E"); } byte[] convertStringToByte = str.getBytes("UTF-8"); str = new String(convertStringToByte, "ISO-8859-1"); return str; }catch(Exception e) { log.error("utf8_to_latin1 Error: " + e.getMessage()); log.error(e); throw e; } }
Emi Lu <emilu@encs.concordia.ca> writes: > For now, through the following method, all letters are correctly > transformed except "�". Meh. That character renders as \310 in your mail, which is not an assigned code in ISO 8859-1. The numerically corresponding Unicode value would be U+0090, which is an unspecified control character. I surmise that your source data is not actually either Unicode or ISO 8859-1, but one of the random "extended" character sets that Microsoft has loosed upon the world, perhaps windows-1252 http://en.wikipedia.org/wiki/Windows-1252 The conversion code that you're using is quite right to reject the character as not being valid LATIN1. What you need to do is figure out what the data actually is and correct its encoding. It's evidently stored wrong in the UTF8 data, if you believe that this code is a letter. regards, tom lane
I wrote: > Meh. That character renders as \310 in your mail, which is not an > assigned code in ISO 8859-1. The numerically corresponding Unicode > value would be U+0090, which is an unspecified control character. Oh, scratch that, apparently I can't do hex/octal arithmetic in my head first thing in the morning. It's really U+00C8 which is perfectly valid. I can't see a reason why that character and only that character would be problematic --- have you done systematic testing to confirm that that's the only should-be-LATIN1 character that fails? regards, tom lane
On 12/14/2012 07:35 AM, Tom Lane wrote: > I wrote: >> Meh. That character renders as \310 in your mail, which is not an >> assigned code in ISO 8859-1. The numerically corresponding Unicode >> value would be U+0090, which is an unspecified control character. > > Oh, scratch that, apparently I can't do hex/octal arithmetic in my > head first thing in the morning. It's really U+00C8 which is perfectly > valid. I can't see a reason why that character and only that character > would be problematic --- have you done systematic testing to confirm > that that's the only should-be-LATIN1 character that fails? This is where I am confused, in one of the original posts the OP said: "JAVA codes work for most of characters, but not "-È"." > > regards, tom lane > > -- Adrian Klaver adrian.klaver@gmail.com
Hello All, >> Meh. That character renders as \310 in your mail, which is not an >> assigned code in ISO 8859-1. The numerically corresponding Unicode >> value would be U+0090, which is an unspecified control character. > > Oh, scratch that, apparently I can't do hex/octal arithmetic in my > head first thing in the morning. It's really U+00C8 which is perfectly > valid. I can't see a reason why that character and only that character > would be problematic --- have you done systematic testing to confirm > that that's the only should-be-LATIN1 character that fails? Finally, the problem is resolved: SHOW VARIABLES LIKE "character\_set\_%"; +--------------------------+--------+ | Variable_name | Value | +--------------------------+--------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | -- here mysql uses utf8 for character_set_system. Change my java code to: ======================== public static String utf8_to_mysql(String str) throws Exception { try { byte[] convertStringToByte = str.getBytes("UTF-8"); str = new String(convertStringToByte, "UTF-8"); return str; }catch(Exception e) { log.error("utf8_to_latin1 Error: " + e.getMessage()); log.error(e); throw e; } Have to explicitly specify "UTF-8", but cannot leave as empty. Larry's comments(from MyBatis mailing list) and I tried both "from/to" by "UTF8". It works. This is still little bit strange to me. But it works! >> My guess is that it's correct but the client you're using is messing >> it up. If not, then you need to look at your connection strings to >> the 2 databases to make sure they are handling the encodings >> correctly.Unless you set them specifically, I suspect they are using >> your default system encoding - so both may be using utf8 or iso8859. Thank you very much for all of your help for this! Emi
On 12/14/2012 01:37 PM, Emi Lu wrote: > Hello All, >>> Meh. That character renders as \310 in your mail, which is not an >>> assigned code in ISO 8859-1. The numerically corresponding Unicode >>> value would be U+0090, which is an unspecified control character. >> >> Oh, scratch that, apparently I can't do hex/octal arithmetic in my >> head first thing in the morning. It's really U+00C8 which is perfectly >> valid. I can't see a reason why that character and only that character >> would be problematic --- have you done systematic testing to confirm >> that that's the only should-be-LATIN1 character that fails? > > Finally, the problem is resolved: > > SHOW VARIABLES LIKE "character\_set\_%"; > +--------------------------+--------+ > | Variable_name | Value | > +--------------------------+--------+ > | character_set_client | latin1 | > | character_set_connection | latin1 | > | character_set_database | latin1 | > | character_set_filesystem | binary | > | character_set_results | latin1 | > | character_set_server | latin1 | > | character_set_system | utf8 | -- here mysql uses utf8 for > character_set_system. Another try is that if I change my client tool encoding set, I do not even need my java transition. All right, good to learn from this. Emi