Обсуждение: utf8 issue
Hello everybody. I recently converted my db from ASCII encoding to UTF8 (we have a lot of spanish-speaking members, and need the extra character support). Everything was working great, but I noticed this error, while trying to COPY one of our tables from a csv. ERROR: invalid byte sequence for encoding "UTF8": 0xb9 SQL state: 22021 Hint: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". Context: COPY transaction_import, line 59358 Is there anything I can do to look for these sorts of errors and fix them? I'm really not great with character encodings and I'm not sure where to go on this. Any help? -- Tom Hart IT Specialist Cooperative Federal 723 Westcott St. Syracuse, NY 13210 (315) 471-1116 ext. 202 (315) 476-0567 (fax)
Tom Hart wrote: > Hello everybody. I recently converted my db from ASCII encoding to UTF8 > (we have a lot of spanish-speaking members, and need the extra character > support). Everything was working great, but I noticed this error, while > trying to COPY one of our tables from a csv. > > ERROR: invalid byte sequence for encoding "UTF8": 0xb9 Not converted, I fear. Why not use latin9 rather than utf8 - that should cover everything and is probably what it's in anyway. > Is there anything I can do to look for these sorts of errors and fix > them? I'm really not great with character encodings and I'm not sure > where to go on this. Any help? Google a bit for iconv (and postgresql) to get your characters valid. -- Richard Huxton Archonet Ltd
Tom Hart wrote: > Hello everybody. I recently converted my db from ASCII encoding to UTF8 > (we have a lot of spanish-speaking members, and need the extra character > support). Everything was working great, but I noticed this error, while > trying to COPY one of our tables from a csv. > > ERROR: invalid byte sequence for encoding "UTF8": 0xb9 > SQL state: 22021 > Hint: This error can also happen if the byte sequence does not match the > encoding expected by the server, which is controlled by > "client_encoding". > Context: COPY transaction_import, line 59358 Did you try setting client_encoding beforehand? I think a simple PGOPTIONS="client_encoding=latin9" pg_restore ... should suffice. Now, if you have mixed UTF8 and Latin1/Latin9 data, you are hosed (meaning you'll have to wade through the stuff and figure out what data is in what encoding, and fix it). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Richard Huxton wrote: > Tom Hart wrote: >> Hello everybody. I recently converted my db from ASCII encoding to >> UTF8 (we have a lot of spanish-speaking members, and need the extra >> character support). Everything was working great, but I noticed this >> error, while trying to COPY one of our tables from a csv. >> >> ERROR: invalid byte sequence for encoding "UTF8": 0xb9 > > Not converted, I fear. > > Why not use latin9 rather than utf8 - that should cover everything and > is probably what it's in anyway. I was specifically requested to use UTF-8, and the work's already been done, so converting again is not my favorite option. > >> Is there anything I can do to look for these sorts of errors and fix >> them? I'm really not great with character encodings and I'm not sure >> where to go on this. Any help? > > Google a bit for iconv (and postgresql) to get your characters valid. > I already have a php script that does some data scrubbing before the copy. I added this line to the script and things seem to be working better now $line = iconv("ISO-8859-1", "UTF-8", $line); Thanks for the help guys :-) -- Tom Hart IT Specialist Cooperative Federal 723 Westcott St. Syracuse, NY 13210 (315) 471-1116 ext. 202 (315) 476-0567 (fax)
On 2008-02-26 13:04, Tom Hart wrote: >> > I already have a php script that does some data scrubbing before the > copy. I added this line to the script and things seem to be working > better now > > $line = iconv("ISO-8859-1", "UTF-8", $line); > > Thanks for the help guys :-) > Read up on the difference between PostgreSQL's server_encoding and client_encoding. The "server_encoding" is how the data is stored in the server, and can be anything compatible (UTF-8, ISO-8859-1, whatever will hold your character set). The "client_encoding" is how the incoming (or outgoing) data is treated/assumed. PostgreSQL does the necessary conversion for you. You can set/change the "client_encoding" in so many ways, it gives you total flexibility, in order of increasing priority: 1. You can set it as the default for any database (see ALTER DATABASE ...). 2. You can set it in an environment variable, which means the client utilities (and I believe the libraries) use that. 3. In PSQL, you can set it with the "\encoding" statement (which applies to the session or until changed), or the "SET [SESSION | LOCAL ] client_encoding TO ...", which will set it for the session or just the current transaction. I just went through this, and while I initially used "iconv" to get up and running, I've removed most of those in my scripts and just use the PostgreSQL conversion instead. -- Mail to my list address MUST be sent via the mailing list. All other mail to my list address will bounce.