Обсуждение: not valid character for Unicode
Hi, Im trying to upgrade from 7.4 -> 8.1 but it is failing with Unicode errors. The offending character is the greek character mu (often used for micro). Here is an offending string "BµG@S" (in case it doesn't appear in the email, the mu is between the B and the G) Any ideas why this character is not valid in Unicode? thanks for any help adam -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
On Fri, Jun 09, 2006 at 03:59:52PM +0100, Adam Witney wrote: > > Hi, > > Im trying to upgrade from 7.4 -> 8.1 but it is failing with Unicode > errors. The offending character is the greek character mu (often used > for micro). Here is an offending string "BµG@S" (in case it doesn't > appear in the email, the mu is between the B and the G) > > Any ideas why this character is not valid in Unicode? It's a valid unicode character, it's just you havn't encoded it in unicode. It's probably in Latin-1. In that case, you need to specify it in the client encoding... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
Martijn van Oosterhout wrote: > On Fri, Jun 09, 2006 at 03:59:52PM +0100, Adam Witney wrote: >> Hi, >> >> Im trying to upgrade from 7.4 -> 8.1 but it is failing with Unicode >> errors. The offending character is the greek character mu (often used >> for micro). Here is an offending string "BµG@S" (in case it doesn't >> appear in the email, the mu is between the B and the G) >> >> Any ideas why this character is not valid in Unicode? > > It's a valid unicode character, it's just you havn't encoded it in > unicode. It's probably in Latin-1. In that case, you need to specify it > in the client encoding... Hi Martijn, thanks for your quick response. Ok i am a bit confused by all this encoding stuff... i don't really know how to encode it in unicode? this is a text string that is extracted from a text file, i just put it in an INSERT statement. I have to replace fields with this in it with a valid string that will load into 8.1, do you know who i would do the conversion? thanks adam -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
On Fri, Jun 09, 2006 at 04:17:50PM +0100, Adam Witney wrote: > > It's a valid unicode character, it's just you havn't encoded it in > > unicode. It's probably in Latin-1. In that case, you need to specify it > > in the client encoding... > > Hi Martijn, > > thanks for your quick response. > > Ok i am a bit confused by all this encoding stuff... i don't really know > how to encode it in unicode? this is a text string that is extracted > from a text file, i just put it in an INSERT statement. The database will do the encoding for you, you just have to tell it what encoding it is. By default it assumes you're using the same encoding as the backend. So: # set client_encoding='latin1'; -- Now all my strings are considered to be in latin1 # set client_encoding='sjis'; -- Now my strings are SJIS # set client_encoding='unicode'; -- Now my strings need to be utf-8 > I have to replace fields with this in it with a valid string that will > load into 8.1, do you know who i would do the conversion? The database will do it for you. Note that the client encoding affects input *and* output. So if you set it to latin1, the database will convert all strings to latin1 before sending them to you... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
Adam Witney wrote: > > Martijn van Oosterhout wrote: > >>On Fri, Jun 09, 2006 at 03:59:52PM +0100, Adam Witney wrote: >> >>>Hi, >>> >>>Im trying to upgrade from 7.4 -> 8.1 but it is failing with Unicode >>>errors. The offending character is the greek character mu (often used >>>for micro). Here is an offending string "BµG@S" (in case it doesn't >>>appear in the email, the mu is between the B and the G) >>> >>>Any ideas why this character is not valid in Unicode? >> >>It's a valid unicode character, it's just you havn't encoded it in >>unicode. It's probably in Latin-1. In that case, you need to specify it >>in the client encoding... > > > Hi Martijn, > > thanks for your quick response. > > Ok i am a bit confused by all this encoding stuff... i don't really know > how to encode it in unicode? this is a text string that is extracted > from a text file, i just put it in an INSERT statement. > > I have to replace fields with this in it with a valid string that will > load into 8.1, do you know who i would do the conversion? > What did you use to extract it from the text file? If you're using some text editor, ensure that it is set to UTF-8. brian
On Fri, June 9, 2006 11:17 am, Adam Witney wrote: > > > Martijn van Oosterhout wrote: > >> On Fri, Jun 09, 2006 at 03:59:52PM +0100, Adam Witney wrote: >> >>> Hi, >>> >>> >>> Im trying to upgrade from 7.4 -> 8.1 but it is failing with Unicode >>> errors. The offending character is the greek character mu (often used >>> for micro). Here is an offending string "BµG@S" (in case it doesn't >>> appear in the email, the mu is between the B and the G) >>> >>> Any ideas why this character is not valid in Unicode? >>> >> >> It's a valid unicode character, it's just you havn't encoded it in >> unicode. It's probably in Latin-1. In that case, you need to specify it >> in the client encoding... > > Hi Martijn, > > > thanks for your quick response. > > Ok i am a bit confused by all this encoding stuff... i don't really know > how to encode it in unicode? this is a text string that is extracted from a > text file, i just put it in an INSERT statement. > > I have to replace fields with this in it with a valid string that will > load into 8.1, do you know who i would do the conversion? For migration, you should pg_dump- it's not clear from your email whether you are doing that. If you typed up some sql in Windows which you want to load into postgres, you might try: set client_encoding to 'LATIN1'; at the top of your script. -M
>> I have to replace fields with this in it with a valid string that will >> load into 8.1, do you know who i would do the conversion? > > The database will do it for you. Note that the client encoding affects > input *and* output. So if you set it to latin1, the database will > convert all strings to latin1 before sending them to you... ok, so my current database (7.4.12) is UNICODE, but from psql when i run this show client_encoding; client_encoding ----------------- UNICODE SELECT identifier from dba_data_base where bioassay_id = 1291 and identifier ilike '%G@S%'; identifier -------------- BG@S (0A11) so the mu chatacter is not showing up. So im not sure if the database is converting the output? (sorry, i am probably sounding very dim here!) thanks again for your help adam -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
> For migration, you should pg_dump- it's not clear from your email whether > you are doing that. If you typed up some sql in Windows which you want to > load into postgres, you might try: > set client_encoding to 'LATIN1'; > at the top of your script. yes this was how i spotted the problem. If i pg_dump from 7.4 and then try to load into 8.1 these characters cause errors. This data was generated on windows though as you say -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
On Fri, Jun 09, 2006 at 04:32:35PM +0100, Adam Witney wrote: > > The database will do it for you. Note that the client encoding affects > > input *and* output. So if you set it to latin1, the database will > > convert all strings to latin1 before sending them to you... > > ok, so my current database (7.4.12) is UNICODE, but from psql when i run > this <snip> > SELECT identifier from dba_data_base where bioassay_id = 1291 and > identifier ilike '%G@S%'; > identifier > -------------- > BG@S (0A11) > > so the mu chatacter is not showing up. So im not sure if the database is > converting the output? Is the character actually there? Do a length(identifier) on it to see how many characters there are. When doing an interactive session it's important that the client_encoding matches your display, otherwise you might find it dropping characters or messing up in other ways. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
Martijn van Oosterhout wrote: > On Fri, Jun 09, 2006 at 04:32:35PM +0100, Adam Witney wrote: >>> The database will do it for you. Note that the client encoding affects >>> input *and* output. So if you set it to latin1, the database will >>> convert all strings to latin1 before sending them to you... >> ok, so my current database (7.4.12) is UNICODE, but from psql when i run >> this > > <snip> > >> SELECT identifier from dba_data_base where bioassay_id = 1291 and >> identifier ilike '%G@S%'; >> identifier >> -------------- >> BG@S (0A11) >> >> so the mu chatacter is not showing up. So im not sure if the database is >> converting the output? > > Is the character actually there? Do a length(identifier) on it to see > how many characters there are. When doing an interactive session it's > important that the client_encoding matches your display, otherwise you > might find it dropping characters or messing up in other ways. yep it is there, when i display the data from the application (PHP) it shows the character on the web page. Also this causes errors when i dump from 7.4 and try to load into 8.1 (i've read that the UNICODE checking became more stringent in 8) so basically 8.1 won't accept this character... im just not entirely sure what to do about that? thanks again for your help adam -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Em Sábado 10 Junho 2006 05:31, Adam Witney escreveu: > yep it is there, when i display the data from the application (PHP) it > shows the character on the web page. Also this causes errors when i dump > from 7.4 and try to load into 8.1 (i've read that the UNICODE checking > became more stringent in 8) > > so basically 8.1 won't accept this character... im just not entirely > sure what to do about that? Are you on a Unix/Linux machine? You can dump the file there and run "file dump.sql" to see what type of file it reports. If it says something other than a string containing "text" and "utf-8", then you can edit the dump manually and set the client encoding to whatever it is reported and try restoring it or you can run "iconv" on the file and see if the conversion to utf-8 works. -- Jorge Godoy <jgodoy@gmail.com>