Обсуждение: Error "invalid byte sequence for encoding UTF8" on insert into BYTEA column

Поиск
Список
Период
Сортировка

Error "invalid byte sequence for encoding UTF8" on insert into BYTEA column

От
Alan Millington
Дата:
I am running Postgres 8.4.1 on Windows XP Professional Service Pack 3. My database is UTF8. My program code is written in Python, and to interface to Postgres I use mxODBC 3.0 and the PostgreSQL Unicode driver PSQLODBCW.DLL version 8.01.02.00 dated 31/01/2006.
 
Today for the first time since upgrading to Postgres 8.4.1 I tried out part of the code which inserts some binary data into a table. The insert failed with the error "invalid byte sequence for encoding UTF8". That is odd, because the column into which the insert was made is of type bytea: the data is meant to be binary data, not UTF8-encoded Unicode.
 
When I was using Postgres 8.1.4 the same code worked. My code, the mxODBC code and the driver are all unaltered: only the Postgres version has changed. Is there something I can tweak to get it to work on Postgres 8.4.1?
 
I confirm that the driver option "bytea as LO" is checked. That has always been necessary in the past.
 

Re: Error "invalid byte sequence for encoding UTF8" on insert into BYTEA column

От
Martijn van Oosterhout
Дата:
On Thu, Jan 21, 2010 at 11:54:32AM -0800, Alan Millington wrote:
> Today for the first time since upgrading to Postgres 8.4.1 I tried
> out part of the code which inserts some binary data into a table. The
> insert failed with the error "invalid byte sequence for encoding
> UTF8". That is odd, because the column into which the insert was made
> is of type bytea: the data is meant to be binary data, not
> UTF8-encoded Unicode.

Inserting in bytea needs an extra level of escaping when the parameters
are sent inline. See

http://www.postgresql.org/docs/8.4/interactive/datatype-binary.html

> When I was using Postgres 8.1.4 the same code worked. My code, the
> mxODBC code and the driver are all unaltered: only the Postgres
> version has changed.. Is there something I can tweak to get it to
> work on Postgres 8.4.1?

Very odd. The syntax for this hasn't changed in a long time. And I
would have thought you'd be sending your paramters out of line anyway.
Can you check that?

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Вложения

Re: Error "invalid byte sequence for encoding UTF8" on insert into BYTEA column

От
Alan Millington
Дата:
>> Today for the first time since upgrading to Postgres 8.4.1 I tried
>> out part of the code which inserts some binary data into a table. The
>> insert failed with the error "invalid byte sequence for encoding
>> UTF8". That is odd, because the column into which the insert was made
>> is of type bytea: the data is meant to be binary data, not
>> UTF8-encoded Unicode.
 
>Inserting in bytea needs an extra level of escaping when the parameters
>are sent inline. See
 
>http://www.postgresql.org/docs/8.4/interactive/datatype-binary.html
 
I have compared the 8.4 documentation for bytea with the 8.1 documentation, and as far as I can see, nothing of significance has changed.
 
>> When I was using Postgres 8.1.4 the same code worked. My code, the
>> mxODBC code and the driver are all unaltered: only the Postgres
>> version has changed.. Is there something I can tweak to get it to
>> work on Postgres 8.4.1?
 
>Very odd. The syntax for this hasn't changed in a long time. And I
>would have thought you'd be sending your paramters out of line anyway.
>Can you check that?
 
The difficulty here is that several layers of software are involved. It is not as though I were writing in C and calling Postgres interface functions directly.
 
What I send to mxODBC is the command as a string containing placeholders (question marks), e.g. "insert into mytable (intcol, byteacol) values (?, ?)", plus the actual values as separate arguments, e.g. 1, data (where 1 is a literal and data is a Python variable containing binary data). What mxODBC does with this I have no idea. It is a black box as far as I am concerned.
 
However, when I was running Postgres 8.1.4 it used to work. If I knew that something had changed in Postgres between 8.1.4 and 8.4.1, I could go to eGenix and ask whether the latest mxODBC version supports the change. However, if you are telling me that nothing in this area has changed in Postgres, I am baffled!

Re: Error "invalid byte sequence for encoding UTF8" on insert into BYTEA column

От
Alan Millington
Дата:
Having done some investigation, I can shed further light on the problem.
 
Within an interactive Python session I connected to the database using mxODBC: the variable csr is an mxODBC cursor object. I made the following assignments:
 
sql = 'insert into mytable (seq_num, data) values (?, ?)'
data = '\x00\x00\x00'
 
i.e. data is three NULLs.
 
I then executed the command
 
csr.execute(sql, (1, data))
 
The insert succeeded, but when I selected length(data) back from the table (using pgAdmin III) it was zero.
 
I tried again, using data = '\x01\x01\x00\x01' and seq_num 2. This time, when I selected back the length of the data it was 2, namely the first two bytes.
 
So it seems that at some point the string is being parsed as though it were a C-style NULL-terminated string. Having checked the dates on all my files, I think I may have downloaded a new mxODBC version since I last successfully did a binary insert, so it may be that mxODBC is the culprit, but it would be helpful if you could confirm that the Postgres parsing has not changed. I recall that I came across a similar problem when I tried to write a PL/Python function which returned binary data to a PL/pgSQL function: in order to get it to work I had to convert the data to base64Binary in the Python function and then convert it back in the PL/pgSQL function. That option is not open to me in this case.
 
 

 

Re: Error "invalid byte sequence for encoding UTF8" on insert into BYTEA column

От
Tom Lane
Дата:
Alan Millington <admillington@yahoo.co.uk> writes:
> What I send to mxODBC is the command as a string containing placeholders (question marks), e.g. "insert into mytable
(intcol,byteacol) values (?, ?)", plus the actual values as separate arguments, e.g. 1, data (where 1 is a literal and
datais a Python variable containing binary data). What mxODBC does with this I have no idea. It is a black box as far
asI am concerned. 

You probably need to ask the mxODBC developers (who AFAIK don't hang out
on this list) what they are doing with that data.  It sounds fairly
likely to me that the bytea value is just being sent as a string without
any special encoding.  That would explain both the null sensitivity you
mention later in the thread, and the encoding validity complaints ---
PG 8.1 was much less picky about string encoding validity than recent
versions are.

There are basically two ways that you could make this work reliably:
arrange for the bytea value to be sent as an out-of-line binary
parameter, or encode it using backslash sequences (eg, '\000' for a
null).  Whether the former is possible with mxODBC I dunno.  The latter
might be something that mxODBC will do for you if it knows the value
is supposed to be bytea, but without that knowledge I don't see how it
could.  You might end up having to do the encoding yourself.

            regards, tom lane

Re: Error "invalid byte sequence for encoding UTF8" on insert into BYTEA column

От
Alan Millington
Дата:
>You probably need to ask the mxODBC developers (who AFAIK don't hang out
>on this list) what they are doing with that data.  It sounds fairly
>likely to me that the bytea value is just being sent as a string without
>any special encoding.  That would explain both the null sensitivity you
>mention later in the thread, and the encoding validity complaints ---
>PG 8.1 was much less picky about string encoding validity than recent
>versions are.
>There are basically two ways that you could make this work reliably:
>arrange for the bytea value to be sent as an out-of-line binary
>parameter, or encode it using backslash sequences (eg, '\000' for a
>null).  Whether the former is possible with mxODBC I dunno.  The latter
>might be something that mxODBC will do for you if it knows the value
>is supposed to be bytea, but without that knowledge I don't see how it
>could.  You might end up having to do the encoding yourself.

Preliminary notes:
 
1. I have now confirmed that at some point I upgraded from mxODBC 3.0 to 3.0.3. The statement in my original posting that my mxODBC installation had not changed was wrong.
 
2. The Python 'str' datatype is used for any sequence of single bytes, like C's array of char. One cannot tell from the datatype what these bytes are intended to represent: it could be ASCII characters, characters in any single-byte encoding, Unicode in any encoding, or binary data.
 
I have discovered a workaround, which is to pass the data to mxODBC in a Python buffer object, which clearly identifies the data as binary.
 
I wrote to eGenix about this as follows:

For some years I have used a Postgres 8.1.4 database (UTF8) with the Postgres Unicode ODBC driver. At first I used mxODBC 2.0 as my interface with Python 2.4, then I upgraded to mxODBC 3.0 for Python 2.5. I am now on Python 2.6.
  
With mxODBC 2.0 I had to use the Binary() function to convert binary data to a buffer object if I was to insert it successfully into a bytea column. With mxODBC 3.0 I found that that was no longer necessary, though now I had to check the “bytea as LO” option of the driver.
  
At some point I upgraded to mxODBC 3.0.3, and recently I upgraded to Postgres 8.4.1. On trying to insert into a bytea column yesterday I found I was no longer able to do so. Either I received an "invalid byte sequence for encoding UTF8" error from Postgres, or the data was truncated at the first NULL, as though it was a C-style string.
  
I find, however, that if I use Binary() again, I can insert the data correctly.
  
It seems to make no difference whether I use BIND_USING_SQLTYPE or BIND_USING_PYTHONTYPE.
  
Possibilities:
  
i) Postgres has changed its processing of bytea input.
ii) mxODBC has changed the way it handles such data.
iii) Both have changed.
  
Can you shed any light on this?

Marc-Andre Lemburg replied as follows:

mxODBC has not changed in this respect between 3.0.0 and 3.0.3,
so it must be the PostgreSQL code that has changed.

 

Note that changing the bind type doesn't help in this case, since
the PostgreSQL ODBC driver does not provide the necessary APIs
to do SQL type binding. As a result, mxODBC will always use
Python type binding.

 

By using the Binary() wrapper for binary data (which really is just
an alias for buffer()), you tell mxODBC to send the data as
binary data to PostgreSQL. Without the wrapper, mxODBC will send
the data as text data and since it is not possible to store
binary data as UTF-8 you get the errors you are seeing.

What puzzles me is hinted at in the last sentence: why does Postgres 8.4.1 (though apparently not 8.1.4) try to interpret the bytes as UTF8 when they are being sent to a column that is typed as bytea?

 

I apologise if this posting is excessively long, but I like to understand the reasons for things, and others may find the information useful.