Обсуждение: Importing data - possible UTF8 import bug?

Поиск
Список
Период
Сортировка

Importing data - possible UTF8 import bug?

От
"Mikel Lindsaar"
Дата:
Hi all,

I am importing a bunch of data exported from an Oracle database.

This came out as a flat file, tab delimited.

I have parsed this through a ruby script that does the following:

For each line
  split it at the tab
  for each column
    parse it through Iconv and convert from UTF8 to UTF8
      on failure dump the line into a log for inspection and continue
    check to see if the column value ends in a single \
      if so, add another \ to escape it and not the tab that will follow
    check to see if the column value is blank
      if so, replace with \N
  join the array back together with tabs
  write it out to the import file.


This works quite well.  The ruby script handles about 10,000 lines
every 15 seconds or so.  Theoretically no invalid utf-8 sequences
could end up in the output file as everything is passed through Iconv
and if it raises an except it dumps the offending line into a separate
file for manual handling.

But I am hitting a problem.  When I try to import the result with COPY
I am getting:

ERROR: invalid byte sequence for encoding "UTF8": 0xa2

And gives me a line reference.

I then grabbed that line out of the file, unpacked each byte into a
decimal representation and 162 does not appear anywhere.

Additionally I did:

Iconv.iconv('UTF-8', 'UTF-8', line).join == line

Which takes the line and converts it from UTF8 to UTF8 using the Iconv
library (--f UTF-8 -t UTF-8) and this returns true.

Interestingly, the following also returns true:

Iconv.iconv('ASCII', 'UTF-8', line).join == line #=> true

So there can't be any non ASCII characters in there.

Unpacking the line and uniq'ing it and sorting it gives the following:

@result.unpack('U'*@result.length).uniq.sort
=> [9, 10, 32, 45, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 65, 66,
67, 68, 69, 70, 72, 73, 74, 76, 77, 78, 79, 80, 82, 83, 84, 85, 87,
91, 92, 93, 97, 98, 99, 101, 109, 110, 111, 114, 116, 117, 121]

Anyone got any ideas how I can hunt down what is in this row, visually
looking at it hasn't helped?

Mikel


--
http://lindsaar.net/
Rails, RSpec and Life blog....

Re: Importing data - possible UTF8 import bug?

От
"Mikel Lindsaar"
Дата:
> Anyone got any ideas how I can hunt down what is in this row, visually
> looking at it hasn't helped?

As a followup, I am getting this same sort of error at various points
on the file.

So I tried taking the line, and cleaning it with Iconv and turning it
into an ASCII sequence and inserting it back into the file.

I still get the same error with the same invalid character.... too weird.

I have a few lines that fail, all with different invalid character encodings:

0xe92042
0xe42050
0xa2
0xc220
0xdb20
0xe42050
0xe92042

It's at the rate of about one every 50,000 rows... so no real big
deal... but still...

--
http://lindsaar.net/
Rails, RSpec and Life blog....

Re: Importing data - possible UTF8 import bug?

От
Tom Lane
Дата:
"Mikel Lindsaar" <raasdnil@gmail.com> writes:
> But I am hitting a problem.  When I try to import the result with COPY
> I am getting:
> ERROR: invalid byte sequence for encoding "UTF8": 0xa2
> And gives me a line reference.

> I then grabbed that line out of the file, unpacked each byte into a
> decimal representation and 162 does not appear anywhere.

Sure you're counting the lines right?  I believe the number COPY
gives you is the N'th line of COPY data (ie, N lines after the
COPY command) --- not the N'th line of the whole file.  Another
thing to check is whether embedded newlines &such are throwing
off your count.

            regards, tom lane

Re: Importing data - possible UTF8 import bug?

От
"Mikel Lindsaar"
Дата:
OK, I'm mailing the list the results of my problem so future people can find it.

The error was

ERROR: invalid byte sequence for encoding "UTF8": 0xa2

with many different types of 0x... lines.

The problem was indeed a bug, but one that sat between the keyboard
and screen (that is, me), not with the COPY command.  I didn't read
the COPY docs well enough, in there it clearly states that a backslash
followed by digits will be interpreted as a character with that
numeric code (in the table).

As the data I was importing contained addresses, it had a unit number
and street number, like this; 2\554, so this was being interpreted as
the number 2 followed by a character represented by \554 which was an
invalid sequence and so rightly so, Copy failed and complained about
an invalid char sequence.

Going through the data set and replacing the backslashes with forward
slashes (which works in my case) or if you need to be non destructive,
replcaing the single backslash with a double backslash, handles the
problem.

Sorry all for the noise.

Mikel


--
http://lindsaar.net/
Rails, RSpec and Life blog....