Обсуждение: error while trying to change the database encoding on a database

Поиск
Список
Период
Сортировка

error while trying to change the database encoding on a database

От
Geoffrey Myers
Дата:
We need to change the database encoding on our databases as they were
created with the wrong encoding.  They were created as SQL_ASCII and we
are changing them to UTF8.

When testing this Friday, I received the following error:

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 5225; 0 16990 TABLE
DATA cust postgres
pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
for encoding "UTF8": 0xb0
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT:  COPY cust, line 778

Is there any easy way to figure out which record caused this error?

Thanks.

--
Until later, Geoffrey

"I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them."
- Thomas Jefferson

Re: error while trying to change the database encoding on a database

От
Adrian Klaver
Дата:
On Monday 24 January 2011 6:38:55 am Geoffrey Myers wrote:
> We need to change the database encoding on our databases as they were
> created with the wrong encoding.  They were created as SQL_ASCII and we
> are changing them to UTF8.
>
> When testing this Friday, I received the following error:
>
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 5225; 0 16990 TABLE
> DATA cust postgres
> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
> for encoding "UTF8": 0xb0
> HINT:  This error can also happen if the byte sequence does not match
> the encoding expected by the server, which is controlled by
> "client_encoding".
> CONTEXT:  COPY cust, line 778
                        ^^^^^^^ In the COPY command for that table.

>
> Is there any easy way to figure out which record caused this error?
>
> Thanks.
>
> --
> Until later, Geoffrey




--
Adrian Klaver
adrian.klaver@gmail.com

Re: error while trying to change the database encoding on a database

От
Geoffrey Myers
Дата:
Adrian Klaver wrote:
> On Monday 24 January 2011 6:38:55 am Geoffrey Myers wrote:
>> We need to change the database encoding on our databases as they were
>> created with the wrong encoding.  They were created as SQL_ASCII and we
>> are changing them to UTF8.
>>
>> When testing this Friday, I received the following error:
>>
>> pg_restore: [archiver (db)] Error while PROCESSING TOC:
>> pg_restore: [archiver (db)] Error from TOC entry 5225; 0 16990 TABLE
>> DATA cust postgres
>> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
>> for encoding "UTF8": 0xb0
>> HINT:  This error can also happen if the byte sequence does not match
>> the encoding expected by the server, which is controlled by
>> "client_encoding".
>> CONTEXT:  COPY cust, line 778
>                         ^^^^^^^ In the COPY command for that table.

I picked up ont that, but the dump is binary, thus I can not view the
actual code.

>
>> Is there any easy way to figure out which record caused this error?
>>
>> Thanks.
>>
>> --
>> Until later, Geoffrey
>
>
>
>


--
Until later, Geoffrey

"I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them."
- Thomas Jefferson

Re: error while trying to change the database encoding on a database

От
Adrian Klaver
Дата:
On Monday 24 January 2011 7:57:52 am Geoffrey Myers wrote:
> Adrian Klaver wrote:
> > On Monday 24 January 2011 6:38:55 am Geoffrey Myers wrote:
> >> We need to change the database encoding on our databases as they were
> >> created with the wrong encoding.  They were created as SQL_ASCII and we
> >> are changing them to UTF8.
> >>
> >> When testing this Friday, I received the following error:
> >>
> >> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> >> pg_restore: [archiver (db)] Error from TOC entry 5225; 0 16990 TABLE
> >> DATA cust postgres
> >> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
> >> for encoding "UTF8": 0xb0
> >> HINT:  This error can also happen if the byte sequence does not match
> >> the encoding expected by the server, which is controlled by
> >> "client_encoding".
> >> CONTEXT:  COPY cust, line 778
> >
> >                         ^^^^^^^ In the COPY command for that table.
>
> I picked up ont that, but the dump is binary, thus I can not view the
> actual code.

Actually you can :) I should have mentioned it before. You can have pg_restore
restore to a file instead of a database by using the -f switch. When you do
that it creates plain text output. You could restore the entire dump to the
file or use the -t switch to get only the table you need.

>
> >> Is there any easy way to figure out which record caused this error?
> >>
> >> Thanks.
> >>
> >> --
> >> Until later, Geoffrey
>
> --
> Until later, Geoffrey
>
> "I predict future happiness for America if they can prevent
> the government from wasting the labors of the people under
> the pretense of taking care of them."
> - Thomas Jefferson



--
Adrian Klaver
adrian.klaver@gmail.com

Re: error while trying to change the database encoding on a database

От
Geoffrey Myers
Дата:
Adrian Klaver wrote:
> On Monday 24 January 2011 7:57:52 am Geoffrey Myers wrote:
>> Adrian Klaver wrote:
>>> On Monday 24 January 2011 6:38:55 am Geoffrey Myers wrote:
>>>> We need to change the database encoding on our databases as they were
>>>> created with the wrong encoding.  They were created as SQL_ASCII and we
>>>> are changing them to UTF8.
>>>>
>>>> When testing this Friday, I received the following error:
>>>>
>>>> pg_restore: [archiver (db)] Error while PROCESSING TOC:
>>>> pg_restore: [archiver (db)] Error from TOC entry 5225; 0 16990 TABLE
>>>> DATA cust postgres
>>>> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
>>>> for encoding "UTF8": 0xb0
>>>> HINT:  This error can also happen if the byte sequence does not match
>>>> the encoding expected by the server, which is controlled by
>>>> "client_encoding".
>>>> CONTEXT:  COPY cust, line 778
>>>                         ^^^^^^^ In the COPY command for that table.
>> I picked up ont that, but the dump is binary, thus I can not view the
>> actual code.
>
> Actually you can :) I should have mentioned it before. You can have pg_restore
> restore to a file instead of a database by using the -f switch. When you do
> that it creates plain text output. You could restore the entire dump to the
> file or use the -t switch to get only the table you need.

Thanks for the suggestion.  As it stands, we are getting different
errors for different hex characters, thus the solution we need is the
ability to identify the characters that won't convert from SQL_ASCII to
UTF8.  Is there a resource that would identify these characters?

>
>>>> Is there any easy way to figure out which record caused this error?
>>>>
>>>> Thanks.
>>>>
>>>> --
>>>> Until later, Geoffrey
>> --
>> Until later, Geoffrey
>>
>> "I predict future happiness for America if they can prevent
>> the government from wasting the labors of the people under
>> the pretense of taking care of them."
>> - Thomas Jefferson
>
>
>


--
Until later, Geoffrey

"I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them."
- Thomas Jefferson

Re: error while trying to change the database encoding on a database

От
Adrian Klaver
Дата:
On Monday 24 January 2011 8:06:38 am Geoffrey Myers wrote:
> Adrian Klaver wrote:
> > On Monday 24 January 2011 7:57:52 am Geoffrey Myers wrote:
> >> Adrian Klaver wrote:
> >>> On Monday 24 January 2011 6:38:55 am Geoffrey Myers wrote:
> >>>> We need to change the database encoding on our databases as they were
> >>>> created with the wrong encoding.  They were created as SQL_ASCII and
> >>>> we are changing them to UTF8.
> >>>>
> >>>> When testing this Friday, I received the following error:
> >>>>
> >>>> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> >>>> pg_restore: [archiver (db)] Error from TOC entry 5225; 0 16990 TABLE
> >>>> DATA cust postgres
> >>>> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
> >>>> for encoding "UTF8": 0xb0
> >>>> HINT:  This error can also happen if the byte sequence does not match
> >>>> the encoding expected by the server, which is controlled by
> >>>> "client_encoding".
> >>>> CONTEXT:  COPY cust, line 778
> >>>
> >>>                         ^^^^^^^ In the COPY command for that table.
> >>
> >> I picked up ont that, but the dump is binary, thus I can not view the
> >> actual code.
> >
> > Actually you can :) I should have mentioned it before. You can have
> > pg_restore restore to a file instead of a database by using the -f
> > switch. When you do that it creates plain text output. You could restore
> > the entire dump to the file or use the -t switch to get only the table
> > you need.
>
> Thanks for the suggestion.  As it stands, we are getting different
> errors for different hex characters, thus the solution we need is the
> ability to identify the characters that won't convert from SQL_ASCII to
> UTF8.  Is there a resource that would identify these characters?
>

Well the issue is that SQL_ASCII is not an encoding. From the docs:
http://www.postgresql.org/docs/9.0/interactive/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED
"Thus, this setting is not so much a declaration that a specific encoding is in
use, as a declaration of ignorance about the encoding. In most cases, if you
are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting
because PostgreSQL will be unable to help you by converting or validating
non-ASCII characters. "

What you need to do is determine what applications where putting data into the
database and what encoding they are using. I ran into this a couple of years
back with an app that was using WIN1252 for data being inserted into a couple
of tables in a SQL_ASCII database . Once I knew the encoding I dumped the table
schema only for those tables into a new UTF8 database. Using psql I set the
client_encoding to WIN1252 and then used \i to pull in a plain text data only
dump for each table.


>
> --
> Until later, Geoffrey
>
> "I predict future happiness for America if they can prevent
> the government from wasting the labors of the people under
> the pretense of taking care of them."
> - Thomas Jefferson



--
Adrian Klaver
adrian.klaver@gmail.com

Re: error while trying to change the database encoding on a database

От
Geoffrey Myers
Дата:
Adrian Klaver wrote:
> On Monday 24 January 2011 8:06:38 am Geoffrey Myers wrote:
>> Adrian Klaver wrote:
>>> On Monday 24 January 2011 7:57:52 am Geoffrey Myers wrote:
>>>> Adrian Klaver wrote:
>>>>> On Monday 24 January 2011 6:38:55 am Geoffrey Myers wrote:
>>>>>> We need to change the database encoding on our databases as they were
>>>>>> created with the wrong encoding.  They were created as SQL_ASCII and
>>>>>> we are changing them to UTF8.
>>>>>>
>>>>>> When testing this Friday, I received the following error:
>>>>>>
>>>>>> pg_restore: [archiver (db)] Error while PROCESSING TOC:
>>>>>> pg_restore: [archiver (db)] Error from TOC entry 5225; 0 16990 TABLE
>>>>>> DATA cust postgres
>>>>>> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
>>>>>> for encoding "UTF8": 0xb0
>>>>>> HINT:  This error can also happen if the byte sequence does not match
>>>>>> the encoding expected by the server, which is controlled by
>>>>>> "client_encoding".
>>>>>> CONTEXT:  COPY cust, line 778
>>>>>                         ^^^^^^^ In the COPY command for that table.
>>>> I picked up ont that, but the dump is binary, thus I can not view the
>>>> actual code.
>>> Actually you can :) I should have mentioned it before. You can have
>>> pg_restore restore to a file instead of a database by using the -f
>>> switch. When you do that it creates plain text output. You could restore
>>> the entire dump to the file or use the -t switch to get only the table
>>> you need.
>> Thanks for the suggestion.  As it stands, we are getting different
>> errors for different hex characters, thus the solution we need is the
>> ability to identify the characters that won't convert from SQL_ASCII to
>> UTF8.  Is there a resource that would identify these characters?
>>
>
> Well the issue is that SQL_ASCII is not an encoding. From the docs:
> http://www.postgresql.org/docs/9.0/interactive/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED
> "Thus, this setting is not so much a declaration that a specific encoding is in
> use, as a declaration of ignorance about the encoding. In most cases, if you
> are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting
> because PostgreSQL will be unable to help you by converting or validating
> non-ASCII characters. "
>
> What you need to do is determine what applications where putting data into the
> database and what encoding they are using. I ran into this a couple of years
> back with an app that was using WIN1252 for data being inserted into a couple
> of tables in a SQL_ASCII database . Once I knew the encoding I dumped the table
> schema only for those tables into a new UTF8 database. Using psql I set the
> client_encoding to WIN1252 and then used \i to pull in a plain text data only
> dump for each table.

We hope to identify the characters and fix them in the existing
database, then convert.  It appears to be very limited, but it would
help if there was some way to identify these characters outside of
simply doing the reload of the data and finding the errors.

Hence the reason I asked about a resource that might identify the
characters.

>
>
>> --
>> Until later, Geoffrey
>>
>> "I predict future happiness for America if they can prevent
>> the government from wasting the labors of the people under
>> the pretense of taking care of them."
>> - Thomas Jefferson
>
>
>


--
Until later, Geoffrey

"I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them."
- Thomas Jefferson

Re: error while trying to change the database encoding on a database

От
Martijn van Oosterhout
Дата:
On Mon, Jan 24, 2011 at 12:16:46PM -0500, Geoffrey Myers wrote:
> We hope to identify the characters and fix them in the existing
> database, then convert.  It appears to be very limited, but it would
> help if there was some way to identify these characters outside of
> simply doing the reload of the data and finding the errors.
>
> Hence the reason I asked about a resource that might identify the
> characters.

Short answer, any byte with the high bit set.

You're going to need to assign them a meaning. Additionally you're
going to have to fix your code to only output correct encoded data.

The suggestion to simply reload the database as if all the current data
was WIN1251 or Latin-9 is a fairly easy way to getting the database
into a reasonable format. The data would have to be checked though.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Вложения

Re: error while trying to change the database encoding on a database

От
Adrian Klaver
Дата:
On 01/24/2011 09:16 AM, Geoffrey Myers wrote:

>
> We hope to identify the characters and fix them in the existing
> database, then convert. It appears to be very limited, but it would help
> if there was some way to identify these characters outside of simply
> doing the reload of the data and finding the errors.
>
> Hence the reason I asked about a resource that might identify the
> characters.

The problem is that from the standpoint of the SQL_ASCII database there
is nothing wrong with the characters per se. AFAIK there is no built in
function to validate characters. The reason is that valid is determined
by the encoding and if you know the encoding then you really don't need
to determine validity. If you want to see one way others have tackled
this, search on iconv in the mailing list archive. This requires working
on an external copy of the data and knowing something about the
encodings involved. The nearest I could ever find to an encoding
detector is:

http://chardet.feedparser.org/

It is a Python program and the encodings it detects are limited but it
might work for you.

Given all the above, when I was faced with the problem you are facing I
found it easiest to make an educated guess as to the original encoding
and then do test restores with client_encoding set to my guess.

>
>>
>>
>>> --
>>> Until later, Geoffrey



--
Adrian Klaver
adrian.klaver@gmail.com

Re: error while trying to change the database encoding on a database

От
Geoffrey Myers
Дата:
Adrian Klaver wrote:
> On 01/24/2011 09:16 AM, Geoffrey Myers wrote:
>
>>
>> We hope to identify the characters and fix them in the existing
>> database, then convert. It appears to be very limited, but it would help
>> if there was some way to identify these characters outside of simply
>> doing the reload of the data and finding the errors.
>>
>> Hence the reason I asked about a resource that might identify the
>> characters.
>
> The problem is that from the standpoint of the SQL_ASCII database there
> is nothing wrong with the characters per se. AFAIK there is no built in
> function to validate characters. The reason is that valid is determined
> by the encoding and if you know the encoding then you really don't need
> to determine validity. If you want to see one way others have tackled
> this, search on iconv in the mailing list archive. This requires working
> on an external copy of the data and knowing something about the
> encodings involved. The nearest I could ever find to an encoding
> detector is:
>
> http://chardet.feedparser.org/
>
> It is a Python program and the encodings it detects are limited but it
> might work for you.
>
> Given all the above, when I was faced with the problem you are facing I
> found it easiest to make an educated guess as to the original encoding
> and then do test restores with client_encoding set to my guess.

Understood.  We had figured the problem to be small, and it appears it
is and thus felt we could address it a character at a time.  Then get
this error:

pg_restore: [archiver (db)] Error from TOC entry 5258; 0 17549 TABLE
DATA fax postgres
pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
for encoding "UTF8": 0xe28053

That hex value doesn't translate to a single character.  I've dumped the
data to a file as you suggested, but reviewing the identified line
brings no joy.

--
Until later, Geoffrey

"I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them."
- Thomas Jefferson

Re: error while trying to change the database encoding on a database

От
Adrian Klaver
Дата:
On 01/24/2011 10:57 AM, Geoffrey Myers wrote:
> Adrian Klaver wrote:
>> On 01/24/2011 09:16 AM, Geoffrey Myers wrote:
>>
>>>
>>> We hope to identify the characters and fix them in the existing
>>> database, then convert. It appears to be very limited, but it would help
>>> if there was some way to identify these characters outside of simply
>>> doing the reload of the data and finding the errors.
>>>
>>> Hence the reason I asked about a resource that might identify the
>>> characters.
>>
>> The problem is that from the standpoint of the SQL_ASCII database
>> there is nothing wrong with the characters per se. AFAIK there is no
>> built in function to validate characters. The reason is that valid is
>> determined by the encoding and if you know the encoding then you
>> really don't need to determine validity. If you want to see one way
>> others have tackled this, search on iconv in the mailing list archive.
>> This requires working on an external copy of the data and knowing
>> something about the encodings involved. The nearest I could ever find
>> to an encoding detector is:
>>
>> http://chardet.feedparser.org/
>>
>> It is a Python program and the encodings it detects are limited but it
>> might work for you.
>>
>> Given all the above, when I was faced with the problem you are facing
>> I found it easiest to make an educated guess as to the original
>> encoding and then do test restores with client_encoding set to my guess.
>
> Understood. We had figured the problem to be small, and it appears it is
> and thus felt we could address it a character at a time. Then get this
> error:
>
> pg_restore: [archiver (db)] Error from TOC entry 5258; 0 17549 TABLE
> DATA fax postgres
> pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence
> for encoding "UTF8": 0xe28053
>
> That hex value doesn't translate to a single character. I've dumped the
> data to a file as you suggested, but reviewing the identified line
> brings no joy.
>

The only thing I can think of is to use iconv like:

iconv -c -t utf8 -f utf8 -o converted_txt.txt 'original.txt'

where original.txt is your plain text data dump. The -c switch causes
iconv not to convert any illegal characters.

You could then run a diff against converted_txt.txt and 'original.txt'
to see what characters in the original text are causing the problem.


--
Adrian Klaver
adrian.klaver@gmail.com

Re: error while trying to change the database encoding on a database

От
Jasen Betts
Дата:
On 2011-01-24, Geoffrey Myers <lists@serioustechnology.com> wrote:
> Adrian Klaver wrote:

> Thanks for the suggestion.  As it stands, we are getting different
> errors for different hex characters, thus the solution we need is the
> ability to identify the characters that won't convert from SQL_ASCII to
> UTF8.  Is there a resource that would identify these characters?

use "iconv" to strip out the invalid chacaters from the SQL and then
compare before and after.

 I think the iconv command is

  iconv -f UTF8 -t UTF8 -c


--
⚂⚃ 100% natural