Обсуждение: COPY 'invalid byte sequence for encoding "UTF8": 0xff'

Поиск
Список
Период
Сортировка

COPY 'invalid byte sequence for encoding "UTF8": 0xff'

От
Chris Worley
Дата:
Hello,

I get the following error when running a sql script containing a COPY command:

ERROR:  invalid byte sequence for encoding "UTF8": 0xff
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".

What I have done is write a perl script that selects out certain data
from a database and places it tab separated output between a line with
COPY from stdin and .\.  Just like the pg_dump output looks.

The data I have contains binary data from a tcp dump.  After running
my perl script I put the result through this command, to escape the
octets.  (thought i needed to do this)

cat data.dump | iconv -c -f utf8 -t utf8 > fixed.dump

Does anybody know how the dump pulls a column with binary data?  It is
a varchar column.

-chris worley

Re: COPY 'invalid byte sequence for encoding "UTF8": 0xff'

От
Steve Crawford
Дата:
Chris Worley wrote:
> Hello,
>
> I get the following error when running a sql script containing a COPY command:
>
> ERROR:  invalid byte sequence for encoding "UTF8": 0xff...
>
> The data I have contains binary data from a tcp dump....
>
> Does anybody know how the dump pulls a column with binary data?  It is
> a varchar column
No, if it it binary data, you need a column of type bytea and you need
to escape the appropriate characters in your input. For example, if I
create a table with a single bytea column called "foo" and insert the a
record with value "^F^O^O" then dump it, the dump will have the following:

COPY byteatest (foo) FROM stdin;
\\006\\017\\017
\.

See http://www.postgresql.org/docs/8.3/static/datatype-binary.html for
the characters that need to be escaped.

Cheers,
Steve


Re: COPY 'invalid byte sequence for encoding "UTF8": 0xff'

От
Chris Worley
Дата:
On Tue, Apr 21, 2009 at 1:39 PM, Steve Crawford
<scrawford@pinpointresearch.com> wrote:
> Chris Worley wrote:
>>
>> Hello,
>>
>> I get the following error when running a sql script containing a COPY
>> command:
>>
>> ERROR:  invalid byte sequence for encoding "UTF8": 0xff...
>>
>> The data I have contains binary data from a tcp dump....
>>
>> Does anybody know how the dump pulls a column with binary data?  It is
>> a varchar column
>
> No, if it it binary data, you need a column of type bytea and you need to
> escape the appropriate characters in your input. For example, if I create a
> table with a single bytea column called "foo" and insert the a record with
> value "^F^O^O" then dump it, the dump will have the following:
>
> COPY byteatest (foo) FROM stdin;
> \\006\\017\\017
> \.

How does pg_dump and pg_restonre handle everything with no errors?
The column is basically telnet buffers which contains escape sequences
so i have bytes such as 0xff, 0x1b...   Piping the output through
iconv helped formatting some of the data, but it appears I am still
left with some, such as 0xff.

It would be nice to know what sql pg_dump uses to create the contents
inside of the copy command.  Any idea what that is?

-chris worley

>
> See http://www.postgresql.org/docs/8.3/static/datatype-binary.html for the
> characters that need to be escaped.
>
> Cheers,
> Steve
>
>

Re: COPY 'invalid byte sequence for encoding "UTF8": 0xff'

От
Steve Crawford
Дата:
Chris Worley wrote:
> ...
>> For example, if I create a
>> table with a single bytea column called "foo" and insert the a record with
>> value "^F^O^O" then dump it, the dump will have the following:
>>
>> COPY byteatest (foo) FROM stdin;
>> \\006\\017\\017
>> \.
>>
>
> How does pg_dump and pg_restonre handle everything with no errors?
>
See above. It escapes the data so it can be represented in a query. The
Ctrl-F in the bytea column is converted for insertion purposes to its
3-digit octal equivalent of 006 represented as single escaped character
in the data as \\006. The Ctrl-O is \\017. Both \\006 and \\017 are
single characters. You _can_ escape all characters - ie instead of 'A'
you could use '\\101' but you are only _required_ to escape those
certain characters listed in the documentation. Again, see
http://www.postgresql.org/docs/8.3/static/datatype-binary.html.
> The column is basically telnet buffers which contains escape sequences
> so i have bytes such as 0xff, 0x1b...   Piping the output through
> iconv helped formatting some of the data, but it appears I am still
> left with some, such as 0xff.
>
This makes no sense. Binary data could be an executable, a video, a
photo, audio data, etc. but it is not linguistic characters where a
character-set interpretation is appropriate. You may pipe it through
iconv and find there are fewer errors in the attempted import but that
does not imply that output is in any way improved by the transformation.
(I can also feed sox to a JPEG photo, tell it that the data is mu-law,
and have it create an MP3 but the result would just be painful noise.)
> It would be nice to know what sql pg_dump uses to create the contents
> inside of the copy command.  Any idea what that is?
>

Yes, it escapes those bytea characters that require escaping and leaves
the others alone. Here is the same example from before with the string
"^F^O^Obar" (foo in control-characters, bar in lowercase) in the bytea
column:

COPY byteatest (foo) FROM stdin;
\\006\\017\\017bar
\.

You need to do the same in your program. Most client interfaces have an
escape-bytea function that will handle it for you. In C you could use
PQescapeByteaConn while PHP uses pg_escape_bytea. Find the appropriate
function for your language.

Cheers,
Steve


Re: COPY 'invalid byte sequence for encoding "UTF8": 0xff'

От
Steve Crawford
Дата:
> hmm, I was shelling out and using psql and piping the data to another
> file.  Not using the dbi stuff with perl.
>
> Guess i can use a regular expression and review the link you sent me
> and escape them my self.
>
Don't *ahem* quote me on this as I haven't been using Perl for a while
but it should be something similar to:
$quoted_data = $dbh->quote($data, PG_BYTEA)

If not, I'm sure the documentation for the "quote" function will set
things straight.

Cheers,
Steve