Обсуждение: Hex characters in COPY input

Поиск
Список
Период
Сортировка

Hex characters in COPY input

От
Melvin Call
Дата:
Good evening list,

I am taking the output of a MySQL query and saving it into a text file for COPY
input into a PostgreSQL database. The query gives me a list of addresses. One of
the addresses is in Montreal, and was input using the correct spelling of
Montreal where the e is an accented e. The output ends up in the text file as
Montr\xe9al, where the xe9 is a single character. When I try to copy that into
my PostgreSQL table, I get an error "ERROR:  invalid byte sequence for encoding
"UTF8": 0xe9616c", which makes sense since the hex character has not been
sanitized.

Now if I run the output through sed first, and substitute the \xe9 character for
something, say the word TEST, I end up with MontrTESTal in my input file, and
naturally that imports just fine. So this tells me that I can perform a
substitute on the hex character. But I have been pulling my hair out trying to
figure out how to substitute in a properly escaped representation of the
accented e. For instance, this: s/\(\xe9\)/U\&'\1'/g gives me MontrU&'\xe9'al
in my input file, but that just causes a different invalid byte sequence error.

So my question is, how do I sanitize the hex character in the middle of a word
to be able to copy in Montreal with an accented e? Or am I going about this at
the wrong point?

Thanks,
Melvin


Re: Hex characters in COPY input

От
Vick Khera
Дата:

On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com> wrote:
Montreal where the e is an accented e. The output ends up in the text file as
Montr\xe9al, where the xe9 is a single character. When I try to copy that into
my PostgreSQL table, I get an error "ERROR:  invalid byte sequence for encoding

Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if you're not.

Re: Hex characters in COPY input

От
Melvin Call
Дата:
On 2/26/15, Vick Khera <vivek@khera.org> wrote:
> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com>
> wrote:
>
>> Montreal where the e is an accented e. The output ends up in the text
>> file
>> as
>> Montr\xe9al, where the xe9 is a single character. When I try to copy that
>> into
>> my PostgreSQL table, I get an error "ERROR:  invalid byte sequence for
>> encoding
>>
>
> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if
> you're not.
>
Thank you Vic, adding the ENCODING 'latin1' option to the COPY command worked
perfectly.

If you don't mind a follow up to your reply, I have tried to understand the
different character sets and collations, but I guess I still have a lot to
learn. Your suggestion did not even come close to crossing my mind because the
MySQL table and database are encoded in UTF8. I assume the conversion to latin1
happened because I was putting the MySQL query output into a locally stored
text file? Regardless, can you point me to some reading that would have clued
me in that e9 is not a UTF8 character? Or is the clue the fact that it was not
preceeded with 0x00?

Regards,
Melvin


Re: Hex characters in COPY input

От
Melvin Call
Дата:
On 2/27/15, Adam Hooper <adam@adamhooper.com> wrote:
> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com>
> wrote:
>
>> So my question is, how do I sanitize the hex character in the middle of a
>> word
>> to be able to copy in Montreal with an accented e? Or am I going about
>> this at
>> the wrong point?
>
> Hi Melvin,
>
> This is not a Postgres problem, and it is not a regex problem. So yes,
> you're going about it at the wrong point: you're trying to modify a
> _character_ at a time, but you _should_ be trying to modify a _byte_
> at a time. Text replacement cannot do what you want it to do.
>
> If you're on Linux or Mac, uconv will work -- for instance, `iconv
> --from-code=windows-1252 --to-code=utf-8 < input-file.txt >
> output-file.txt`
>
> Otherwise, you can use a text editor. Be sure to open the file
> properly (such that é appears) and then save it as utf-8.
>
> Alternatively, you could tell Postgres to use your existing encoding
> -- judging from the \xe9, any of "windows-1252", "iso-8859-15" or
> "iso-8859-1" will be accurate. But I always prefer my data to be
> stored as "utf-8", and you should, too.
>
> Read up on character sets here:
> http://www.joelonsoftware.com/articles/Unicode.html
>
> Enjoy life,
> Adam


Thank you Adam. I was able to make this work by adding the ENCODING 'latin1'
option to the COPY command per Vic's suggestion, and as you correctly pointed
out as well. However iconv would probably do the trick too, now that I know
where the problem actually lies. I failed to realize that I was not dealing
with UTF8 because the MySQL data is encoded in UTF8, but you saw what I wasn't
seeing. Your suggested reading is also most appreciated. Maybe one of these
days I will actually make sense of this encoding issue. Thanks for the
link.

Regards,
Melvin


Re: Hex characters in COPY input

От
Adrian Klaver
Дата:
On 02/27/2015 06:39 AM, Melvin Call wrote:
> On 2/26/15, Vick Khera <vivek@khera.org> wrote:
>> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com>
>> wrote:
>>
>>> Montreal where the e is an accented e. The output ends up in the text
>>> file
>>> as
>>> Montr\xe9al, where the xe9 is a single character. When I try to copy that
>>> into
>>> my PostgreSQL table, I get an error "ERROR:  invalid byte sequence for
>>> encoding
>>>
>>
>> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if
>> you're not.
>>
> Thank you Vic, adding the ENCODING 'latin1' option to the COPY command worked
> perfectly.
>
> If you don't mind a follow up to your reply, I have tried to understand the
> different character sets and collations, but I guess I still have a lot to
> learn. Your suggestion did not even come close to crossing my mind because the
> MySQL table and database are encoded in UTF8. I assume the conversion to latin1
> happened because I was putting the MySQL query output into a locally stored
> text file? Regardless, can you point me to some reading that would have clued
> me in that e9 is not a UTF8 character? Or is the clue the fact that it was not
> preceeded with 0x00?

For UTF8 characters see here:

http://www.utf8-chartable.de/


For the MySQL part, you are going to detail how you got the data out?

>
> Regards,
> Melvin
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Hex characters in COPY input

От
Melvin Call
Дата:
On Fri, Feb 27, 2015 at 8:58 AM, Adam Hooper <adam@adamhooper.com> wrote:
> On Fri, Feb 27, 2015 at 9:39 AM, Melvin Call <melvincall979@gmail.com> wrote:
>> On 2/26/15, Vick Khera <vivek@khera.org> wrote:
>
>>> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if
>>> you're not.
>>>
>> Thank you Vic, adding the ENCODING 'latin1' option to the COPY command worked
>> perfectly.
>>
>> Your suggestion did not even come close to crossing my mind because the
>> MySQL table and database are encoded in UTF8. I assume the conversion to latin1
>> happened because I was putting the MySQL query output into a locally stored
>> text file?
>
> The reality is confusing: when moving text around, every step of the
> process involves a character set conversion.
>
> I'm guessing your MySQL client is defaulting to character_set_client =
> 'iso-8859-1' or some-such. Depending on your client, that could be
> because your _terminal_ is set to iso-8859-1 encoding. (If you're on
> Unix, type `locale` and if you don't see lots of "UTF-8"s your
> terminal probably isn't using UTF-8.)

Ah, that makes perfect sense.

>
> But really, there are so many variables it's only an issue if you're
> trying to change the way the MySQL client is behaving. And I don't
> think this is the list for that.

Agreed. I didn't realize that the MySQL client would be the issue, but I think
you've hit it. This is in preparation for moving away from MySQL so no need to
pursue it much further. I just need to get the information out once, and y'all
have helped me get there. Thanks!

>
> MySQL's encoding logic is written up here:
> http://dev.mysql.com/doc/refman/5.7/en/charset-connection.html. (The
> way the data is stored is in a completely separate web page:
> http://dev.mysql.com/doc/refman/5.7/en/charset-applications.html ...
> which only has a tiny hint at the bottom of the page about UTF-8 in
> the MySQL command-line client.)
>
> Enjoy life,
> Adam
>
> --
> Adam Hooper
> +1-613-986-3339
> http://adamhooper.com


Re: Hex characters in COPY input

От
Melvin Call
Дата:
On Fri, Feb 27, 2015 at 9:03 AM, Adrian Klaver
<adrian.klaver@aklaver.com> wrote:
> On 02/27/2015 06:39 AM, Melvin Call wrote:
>>
>> On 2/26/15, Vick Khera <vivek@khera.org> wrote:
>>>
>>> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com>
>>> wrote:
>>>
>>>> Montreal where the e is an accented e. The output ends up in the text
>>>> file
>>>> as
>>>> Montr\xe9al, where the xe9 is a single character. When I try to copy
>>>> that
>>>> into
>>>> my PostgreSQL table, I get an error "ERROR:  invalid byte sequence for
>>>> encoding
>>>>
>>>
>>> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8
>>> if
>>> you're not.
>>>
>> Thank you Vic, adding the ENCODING 'latin1' option to the COPY command
>> worked
>> perfectly.
>>
>> If you don't mind a follow up to your reply, I have tried to understand
>> the
>> different character sets and collations, but I guess I still have a lot to
>> learn. Your suggestion did not even come close to crossing my mind because
>> the
>> MySQL table and database are encoded in UTF8. I assume the conversion to
>> latin1
>> happened because I was putting the MySQL query output into a locally
>> stored
>> text file? Regardless, can you point me to some reading that would have
>> clued
>> me in that e9 is not a UTF8 character? Or is the clue the fact that it was
>> not
>> preceeded with 0x00?
>
>
> For UTF8 characters see here:
>
> http://www.utf8-chartable.de/

Thank you for the link. Bookmarked.

>
>
> For the MySQL part, you are going to detail how you got the data out?

This is in preparation of moving away from MySQL. I inherited this MySQL
database and it is in horrible shape, no referential integrity, no constraints
other than arbitrarily chosen VARCHAR lengths, no indexes, and inconsistent
entity and attribute naming. I have to pull the fields out through a query that
is being redirected to a local file, as opposed to a dump, because I'm having
to filter out a lot of useless rows (TestCity in a production system!?). I just
realized I could have put the usable data into similar tables and then used
mysqldump with the encoding specified, but no need now. I have my extraction
and import done, so this should be the last I need to touch the MySQL system.

>
>>
>> Regards,
>> Melvin
>>
>>
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com


Re: Hex characters in COPY input

От
Adrian Klaver
Дата:
On 02/27/2015 07:55 AM, Melvin Call wrote:
> On Fri, Feb 27, 2015 at 9:03 AM, Adrian Klaver
> <adrian.klaver@aklaver.com> wrote:
>> On 02/27/2015 06:39 AM, Melvin Call wrote:
>>>
>>> On 2/26/15, Vick Khera <vivek@khera.org> wrote:
>>>>
>>>> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com>
>>>> wrote:
>>>>
>>>>> Montreal where the e is an accented e. The output ends up in the text
>>>>> file
>>>>> as
>>>>> Montr\xe9al, where the xe9 is a single character. When I try to copy
>>>>> that
>>>>> into
>>>>> my PostgreSQL table, I get an error "ERROR:  invalid byte sequence for
>>>>> encoding
>>>>>
>>>>
>>>> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8
>>>> if
>>>> you're not.
>>>>
>>> Thank you Vic, adding the ENCODING 'latin1' option to the COPY command
>>> worked
>>> perfectly.
>>>
>>> If you don't mind a follow up to your reply, I have tried to understand
>>> the
>>> different character sets and collations, but I guess I still have a lot to
>>> learn. Your suggestion did not even come close to crossing my mind because
>>> the
>>> MySQL table and database are encoded in UTF8. I assume the conversion to
>>> latin1
>>> happened because I was putting the MySQL query output into a locally
>>> stored
>>> text file? Regardless, can you point me to some reading that would have
>>> clued
>>> me in that e9 is not a UTF8 character? Or is the clue the fact that it was
>>> not
>>> preceeded with 0x00?
>>
>>
>> For UTF8 characters see here:
>>
>> http://www.utf8-chartable.de/
>
> Thank you for the link. Bookmarked.
>
>>
>>
>> For the MySQL part, you are going to detail how you got the data out?
>
> This is in preparation of moving away from MySQL. I inherited this MySQL
> database and it is in horrible shape, no referential integrity, no constraints
> other than arbitrarily chosen VARCHAR lengths, no indexes, and inconsistent
> entity and attribute naming. I have to pull the fields out through a query that
> is being redirected to a local file, as opposed to a dump, because I'm having
> to filter out a lot of useless rows (TestCity in a production system!?). I just
> realized I could have put the usable data into similar tables and then used
> mysqldump with the encoding specified, but no need now. I have my extraction
> and import done, so this should be the last I need to touch the MySQL system.


Gotcha, I recently did something similar.

>
>>
>>>
>>> Regards,
>>> Melvin
>>>
>>>
>>
>>
>> --
>> Adrian Klaver
>> adrian.klaver@aklaver.com
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Hex characters in COPY input

От
Adam Hooper
Дата:
On Fri, Feb 27, 2015 at 9:39 AM, Melvin Call <melvincall979@gmail.com> wrote:
> On 2/26/15, Vick Khera <vivek@khera.org> wrote:
>> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com>
>> wrote:
>>
>>> I get an error "ERROR:  invalid byte sequence for
>>> encoding "UTF8": 0xe9616c"
>>
>> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if
>> you're not.
>
> Regardless, can you point me to some reading that would have clued
> me in that e9 is not a UTF8 character? Or is the clue the fact that it was not
> preceeded with 0x00?

The error message Postgres gave you is a pretty darned good clue :).

But your question has an error, and it's worth expounding a bit. 0xe9
is not a UTF8 character because 0xe9 is not a _character_. 0xe9 is a
_byte_. Characters are not bytes. Characters can be _encoded_ into
bytes, and that's not the same thing.

UTF-8 encoding is a bit confusing: any byte in the range [0x00-0x7f]
represents the same character as in ASCII encoding (an extremely
popular encoding). Any byte in the range [0x80-0xff] is the "leading
byte" in a sequence of bytes that represents a single character.
"Continuation" bytes are in the range [0x80-0xbf]. (Why not the range
[0x00-0xff]? Because UTF-8 was designed to generate errors when fed
non-UTF8 byte sequences.) The first four bits of the leading byte
describe how many continuation bytes there are. If you care to read up
on the how and why of UTF-8 (a red herring in this discussion), try:
See http://en.wikipedia.org/wiki/UTF-8

Back to 0xe9. 0xe9 is '1110 1001' in binary. Postgres' UTF-8 decoder
sees that initial '1110' and determines that it needs to inspect three
bytes to read one character. The second byte is 0x61, which is not in
the range [0x80-0xbf], so Postgres reports an invalid byte sequence.
Hooray: it produces exactly the error message it should.

You don't need to concern yourself with the complications of UTF-8.
You only need to know that bytes are not characters; if you don't know
the encoding of a sequence of bytes, you've made a logic error.
Postgres told you the error, though it didn't hint at how to fix it.
(Text editors and web browsers use heuristics to guess file encodings,
and they're usually right, though it's impossible to create a perfect
heuristic. See
http://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and
for further discussion there.)

If you're looking for take-away lessons, the main one is: "read the
error message" :).

Next time you have the "wrong encoding" problem, you have two options:
1) figure out the encoding and tell Postgres; or 2) regenerate the
file in the correct encoding (UTF-8). The "why" is here:
http://www.joelonsoftware.com/articles/Unicode.html

We on this list jumped strait to option 1. We've memorized 0xe9 in
particular, because we've been through your pain before. In the
Americas and Western Europe, if a file contains the byte 0xe9 it
probably contains the character "é" encoded as
windows-1252/ISO-8859-1/ISO-8859-15. That's very common. MySQL in
particular is a ghastly Internet denizen, in that it defaults to
ISO-8859-15 in an apparent crusade against globalization and modern
standards.

Enjoy life,
Adam

--
Adam Hooper
+1-613-986-3339
http://adamhooper.com


Re: Hex characters in COPY input

От
Melvin Call
Дата:
On Fri, Feb 27, 2015 at 12:02 PM, Adam Hooper <adam@adamhooper.com> wrote:
> On Fri, Feb 27, 2015 at 9:39 AM, Melvin Call <melvincall979@gmail.com> wrote:
>> On 2/26/15, Vick Khera <vivek@khera.org> wrote:
>>> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com>
>>> wrote:
>>>
>>>> I get an error "ERROR:  invalid byte sequence for
>>>> encoding "UTF8": 0xe9616c"
>>>
>>> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if
>>> you're not.
>>
>> Regardless, can you point me to some reading that would have clued
>> me in that e9 is not a UTF8 character? Or is the clue the fact that it was not
>> preceeded with 0x00?
>
> The error message Postgres gave you is a pretty darned good clue :).
>
> But your question has an error, and it's worth expounding a bit. 0xe9
> is not a UTF8 character because 0xe9 is not a _character_. 0xe9 is a
> _byte_. Characters are not bytes. Characters can be _encoded_ into
> bytes, and that's not the same thing.
>
> UTF-8 encoding is a bit confusing: any byte in the range [0x00-0x7f]
> represents the same character as in ASCII encoding (an extremely
> popular encoding). Any byte in the range [0x80-0xff] is the "leading
> byte" in a sequence of bytes that represents a single character.
> "Continuation" bytes are in the range [0x80-0xbf]. (Why not the range
> [0x00-0xff]? Because UTF-8 was designed to generate errors when fed
> non-UTF8 byte sequences.) The first four bits of the leading byte
> describe how many continuation bytes there are. If you care to read up
> on the how and why of UTF-8 (a red herring in this discussion), try:
> See http://en.wikipedia.org/wiki/UTF-8
>
> Back to 0xe9. 0xe9 is '1110 1001' in binary. Postgres' UTF-8 decoder
> sees that initial '1110' and determines that it needs to inspect three
> bytes to read one character. The second byte is 0x61, which is not in
> the range [0x80-0xbf], so Postgres reports an invalid byte sequence.
> Hooray: it produces exactly the error message it should.
>
> You don't need to concern yourself with the complications of UTF-8.
> You only need to know that bytes are not characters; if you don't know
> the encoding of a sequence of bytes, you've made a logic error.
> Postgres told you the error, though it didn't hint at how to fix it.
> (Text editors and web browsers use heuristics to guess file encodings,
> and they're usually right, though it's impossible to create a perfect
> heuristic. See
http://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and
> for further discussion there.)
>
> If you're looking for take-away lessons, the main one is: "read the
> error message" :).
>
> Next time you have the "wrong encoding" problem, you have two options:
> 1) figure out the encoding and tell Postgres; or 2) regenerate the
> file in the correct encoding (UTF-8). The "why" is here:
> http://www.joelonsoftware.com/articles/Unicode.html

Timing is everything: I was just starting to read this in fact.

Thank you for taking the time to provide this information. It is proving to be
most useful. And yes, I did misspeak. At least the one part I have learned
about all of this is that more than one byte is required to represent the
majority of characters out there. So yeah, I did know that e9 was a byte, but I
appreciate you re-iterating that for me. Hoping to represent each character
with a single byte is an artifact of most early computer development being
dominated by English speaking persons way back when, eh?

I must say, at least as far as the MySQL outputting goes, I like option 1. It
was easy enough to find and fix that way once I knew what to look for. The
MySQL documentation is sometimes difficult to wade through. But no
documentation is ever perfect, is it?

One of the things that has endeared me to PostgreSQL is the fact that it seems
to strive for global usefulness while remaining standards compliant. Or are
those one and the same?

Regards

>
> We on this list jumped strait to option 1. We've memorized 0xe9 in
> particular, because we've been through your pain before. In the
> Americas and Western Europe, if a file contains the byte 0xe9 it
> probably contains the character "é" encoded as
> windows-1252/ISO-8859-1/ISO-8859-15. That's very common. MySQL in
> particular is a ghastly Internet denizen, in that it defaults to
> ISO-8859-15 in an apparent crusade against globalization and modern
> standards.
>
> Enjoy life,
> Adam
>
> --
> Adam Hooper
> +1-613-986-3339
> http://adamhooper.com