Обсуждение: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3

Поиск
Список
Период
Сортировка

Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3

От
"Karl O. Pinc"
Дата:
Hi,

I'm writing a program which takes a text file from the user,
constructs an INPUT statement (the table and column names are not
known in advance), and uses the constructed statement
to insert rows of the text file into PG.

psycopg3 v3.1.18 (installed in a venv)
PG v15.5
Python 3.11.2
Debian 12.5 (stable, bookworm)

Should the user supply text that cannot be converted to the
server's encoding I want to report exactly where in the data
the problem lies.  The sample code below contains (some of)
what I want to report when there is a problem.

Because I want to know more than just "this particular
row of data caused a problem", after an error is
discovered the code below calls a psycopg3 dumper
on each data value to try to determine which data
value is problematic.

This does not work.  What is wrong with what I'm doing
and how do I do what I want?  (And how am I supposed to
know why this does not work and what works?)  I call the
dumper because I want to rely on psycopg3's mechanisms
and not have to query the server for its encoding
and figure out the PG->Python encoding mappings myself.

Thanks for the help.


Setup:

Client side encoding is UTF-8.

create database test_latin1 template template0 encoding 'LATIN1' locale 'C';

-- In the latin1 db
create table foo (col1 text, col2 text, col3 text);


Mock up of desired code (only slightly mangled by my email client):
-----------------<snip>-----------
#!/usr/bin/python3
#
# Test handing of database insertion encoding exception

import psycopg
import psycopg.adapt
import sys

# Latin1 does not have the euro symbol
data_row = ("ascii", "euro € symbol", "more ascii")

# Connect to an existing database
with psycopg.connect("dbname=test_latin1 user=test") as conn:
    text_dumper = conn.adapters.get_dumper(
                      str, psycopg.adapt.PyFormat.TEXT)

    # Open a cursor to perform database operations
    with conn.cursor() as cur:
        try:
            cur.execute(
                """
                insert into foo (col1, col2, col3)
                  values (%s, %s, %s)
                """,
                data_row,
            )
        except UnicodeEncodeError:
            # Call the text dumper on each data value to find which
            # caused the problem and provide a detailed report.
            col = 1
            for col_val in data_row:
                try:
                    text_dumper.dump(col_val)
                except UnicodeEncodeError as err:
                    print(
                        f"Column number {col} ({col_val}) contains a"
                        " character"
                        f" ({col_val[err.start, err.end]}), position"
                        f" {err.start + 1} through {err.end}, that"
                        " cannot be converted to the server-side"
                        " encoding"
                        f" ({err.encoding})"
                    )
                    print(str(err))
                    sys.exit()
                col += 1
            print("Cannot find cause of error")
            sys.exit()

print("Unexpectedly got no error")
sys.exit()
-----------------<snip>-----------

The relevant traceback (after the traceback of the initial
exception that is caught) is:

Traceback (most recent call last):
  File "/home/kop/projects/devel/test_encoding.py", line 32, in <module>
    text_dumper.dump(col_val)
TypeError: _StrDumper.dump() missing 1 required positional argument: 'obj'

Regards,

Karl <kop@karlpinc.com>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein



Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3

От
Daniele Varrazzo
Дата:
Hello Karl,

On Wed, 14 Feb 2024 at 02:37, Karl O. Pinc <kop@karlpinc.com> wrote:

> This does not work.  What is wrong with what I'm doing
> and how do I do what I want?  (And how am I supposed to
> know why this does not work and what works?)  I call the
> dumper because I want to rely on psycopg3's mechanisms
> and not have to query the server for its encoding
> and figure out the PG->Python encoding mappings myself.

Keep in mind that you are playing with objects that are somewhat
internal, so it wouldn't be impossible that these interfaces will
change in the future. It's not planned at the moment and it wouldn't
happen in a minor version anyway.

However, the main problem I see there is that
`conn.adapters.get_dumper()` returns a class. If you want a dumper you
must instantiate it. The following works as you expect:

    >>> conn.execute("set client_encoding to 'latin1'")
    >>> dumper = conn.adapters.get_dumper(str,
psycopg.adapt.PyFormat.TEXT)(str, conn)
    >>> dumper.dump('€')
    ...
    UnicodeEncodeError: 'latin-1' codec can't encode character
'\u20ac' in position 0: ordinal not in range(256)

Note however that if you just want to know the Python codec you can
find it in `conn.info.encoding`
(https://www.psycopg.org/psycopg3/docs/api/objects.html#psycopg.ConnectionInfo.encoding):

    >>> conn.info.encoding
    'iso8859-1'
    >>> "€".encode(conn.info.encoding)
    ...
    UnicodeEncodeError: 'latin-1' codec can't encode character
'\u20ac' in position 0: ordinal not in range(256)

Hope this helps

-- Daniele



Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3

От
"Karl O. Pinc"
Дата:
Hi Daniele,

On Wed, 14 Feb 2024 15:30:33 +0100
Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:

> Note however that if you just want to know the Python codec you can
> find it in `conn.info.encoding`
> (https://www.psycopg.org/psycopg3/docs/api/objects.html#psycopg.ConnectionInfo.encoding):
>
>     >>> conn.info.encoding
>     'iso8859-1'
>     >>> "€".encode(conn.info.encoding)
>     ...
>     UnicodeEncodeError: 'latin-1' codec can't encode character
> '\u20ac' in position 0: ordinal not in range(256)

Thanks very much for the help.  Working directly with the encoding
of the server side, translated to python, is indeed a more
direct approach.

I did not use conn.info.encoding because the docs say that it
contains the _client_ encoding, not the server-side encoding
used to store the db content.

From the link above:

```
encoding

The Python codec name of the connection’s client encoding.

The value returned is always normalized to the Python codec name:

conn.execute("SET client_encoding TO LATIN9")
conn.info.encoding
'iso8859-15'
```

Confirming the encodings, connecting to the  "latin1" db with psql shows:

```
$ psql -U kop latin1
psql (15.5 (Debian 15.5-0+deb12u1))
Type "help" for help.

kop_latin1=> show client_encoding;
 client_encoding
-----------------
 UTF8
(1 row)

kop_latin1=> show server_encoding;
 server_encoding
-----------------
 LATIN1
(1 row)
```

But, conn.info.encoding does return iso8859-1.

So I think your documentation has confused client
and server in this case.  If you can confirm this
for me I'll go ahead and use conn.info.encoding.

Thanks again.

Regards,

Karl <kop@karlpinc.com>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein



Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3

От
Daniele Varrazzo
Дата:
Hello,

On Wed, 14 Feb 2024 at 16:42, Karl O. Pinc <kop@karlpinc.com> wrote:

> I did not use conn.info.encoding because the docs say that it
> contains the _client_ encoding, not the server-side encoding
> used to store the db content.
> [...]
> Confirming the encodings, connecting to the  "latin1" db with psql shows:

> kop_latin1=> show client_encoding;
>  UTF8
>
> kop_latin1=> show server_encoding;
>  LATIN1
>
> But, conn.info.encoding does return iso8859-1.
>
> So I think your documentation has confused client
> and server in this case.  If you can confirm this
> for me I'll go ahead and use conn.info.encoding.

No, I am pretty sure that this is the client encoding that is
reported. It comes from here:


https://github.com/psycopg/psycopg/blob/ef6941df5b94997f79b429347c5d9b84e600bdd3/psycopg/psycopg/_encodings.py#L100-L101

which is a wrapper for PQparameterStatus:
https://www.postgresql.org/docs/current/libpq-status.html#LIBPQ-PQPARAMETERSTATUS
(so that the setting can be retrieved without running a query).

Maybe the way you are connecting via psql sets the client_encoding?
Can you try to get the result of `SHOW client_encoding` from psycopg?

From psycopg PoV, the client encoding is more important, because it's
how strings must be encoded to send them to the server; the server
encoding is relatively less important. So what you can actually store
is the smallest set of characters between server encoding and client
encoding. What you could do is to set the client encoding equal to the
server's:

    SELECT set_config('client_encoding',
current_setting('server_encoding'), false);

and then proceed using `conn.info.encoding`.

Cheers

-- Daniele



Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3

От
"Karl O. Pinc"
Дата:
On Wed, 14 Feb 2024 19:28:57 +0100
Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:

> On Wed, 14 Feb 2024 at 16:42, Karl O. Pinc <kop@karlpinc.com> wrote:
>
> > I did not use conn.info.encoding because the docs say that it
> > contains the _client_ encoding, not the server-side encoding
> > used to store the db content.
> > [...]
> > Confirming the encodings, connecting to the  "latin1" db with psql
> > shows:
>
> > kop_latin1=> show client_encoding;
> >  UTF8
> >
> > kop_latin1=> show server_encoding;
> >  LATIN1
> >
> > But, conn.info.encoding does return iso8859-1.
> >
> > So I think your documentation has confused client
> > and server in this case.

> No, I am pretty sure that this is the client encoding that is
> reported.

You are right.

> Maybe the way you are connecting via psql sets the client_encoding?
> Can you try to get the result of `SHOW client_encoding` from psycopg?

psycopg, as well as libpq -- called directly from C, report LATIN1
as the client encoding.  More on this below.

> From psycopg PoV, the client encoding is more important, because it's
> how strings must be encoded to send them to the server; the server
> encoding is relatively less important. So what you can actually store
> is the smallest set of characters between server encoding and client
> encoding. What you could do is to set the client encoding equal to the
> server's:
>
>     SELECT set_config('client_encoding',
> current_setting('server_encoding'), false);
>
> and then proceed using `conn.info.encoding`.

This is what is happening:

Unless you open a connection with a connection string (etc.)
that does client_encoding=auto, or otherwise set a specific
client encoding, the connection defaults to setting the
client encoding to the server encoding.

This means to me that there are 2(+) ways to use psycopg.
You can have psycopg do all the encoding conversion, client
side, as is probably typical because I don't expect that
people are setting client_encoding.  In this case
you get a UnicodeEncodeError raised when there's a
problem.  Or you use client_encoding=auto, the server
does the converting, and you get a
psycopg.errors.UntranslatableCharacter when there's
a problem.

(I was surprised when I first wrote my test code to
get a UnicodeEncodingError instead of an
UntranslateableCharacter exception.)


This leads me a problem with your documentation.
https://www.psycopg.org/psycopg3/docs/api/adapt.html#psycopg.adapt.Dumper.dump
Says:
abstract dump(obj: Any) → Union[bytes, bytearray, memoryview]

Convert the object obj to PostgreSQL representation.

When in fact dump() additionally converts the object to the
connection's client encoding.  To me "PostgreSQL representation"
means server encoding.  Even if it means something else to you,
it wouldn't hurt to be as specific as possible and write
"to the PostgreSQL representation, in the connection's
client encoding when this is text" to avoid ambiguity.  There's 2 things
going on here, encoding and "representation", seemingly
kinda separate.


Anyway, your docs talk "type" and "representation"
and I think it'd be helpful to talk "type", "representation"
and also "encoding" when the representation/presentation
is text.  I find the current docs unclear
when it comes to exactly what encoding or decoding might
happen where and what affects what can happen where.


Perhaps something as simple as, in the "Passing Parameters
to SQL Queries" section,
https://www.psycopg.org/psycopg3/docs/basic/params.html#passing-parameters-to-sql-queries
above the "Binary Parameters and Results" section,
https://www.psycopg.org/psycopg3/docs/basic/params.html#binary-parameters-and-results
you have a "Text Parameters and Results" section:

  Once upon a time, when you bought a computer it came
with only one set of characters and only one way to represent,
encode, each of those characters into exactly one byte.  These days
life is not so simple and character encoding is more complicated.

  Today there is no substitute for knowing the encoding of the
text your application obtains from the outside world.
This can be highly system dependent because when reading
files open()-ed as text, Python decodes (into UTF-8) the bytes read.
By default decoding from the system locale's character encoding.
And when writing files open()-ed as text Python encodes (from UTF-8)
the bytes written.  Again by default encoding into the
system locale's character encoding.

  Encoding is also application dependent.  Your application
might have to accept multiple text encodings, or at least
handle exceptions when given text in an unexpected encoding.

  No matter how you get your data, to put your data into
the database as text, its bytes must first have their external
encoding decoded to UTF-8.  Because Python strings are
UTF-8.  You must manage this initial conversion into Python
because only you knows the encoding of the text supplied to your
application.  Although there are conventions, markers, etc.,
there is no way to automatically know an encoding just by
looking at the bytes input.

  Once in Python, psycopg converts the UTF-8 text to the database
connection's client encoding and the PostgreSQL server converts
that to the server encoding.  Reverse the entire process when
retrieving textual data from PostgreSQL and outputting it.

  All text is passed to and received from the server in the
connection's client encoding.  Unless the client_encoding
connection parameter is set, PostgreSQL defaults to the database's
encoding, the server encoding.  Setting client_encoding=auto
automatically sets the client encoding to the client
locale's character encoding.

  Adaption converts all text to the client encoding when sent,
and back when received.  So when the client encoding is the
server encoding, all conversion is done on the client-side.
Any exceptions raised should the conversion fail are Python
UnicodeError exceptions, or one of its encode or decode
subclasses.  Whereas when client and server encoding differ,
encoding conversion is also done server-side.  An
UntranslatableCharacter exception is raised when server-side
conversion fails.  This is a subclass of the standard DBAPI
DataException.  Setting client_encoding=UTF8,
the same as Python's encoding, covers the final use-case
where all encoding conversion, except, possibly, the initial
reading of the text into Python, is done server-side.

  See also: PEP 686 & PEP 540.


To be clear, I'm not an expert here.  But I'm not entirely
ignorant and this is my understanding.  It's important
to get the encoding right so I think it'd be good to
talk about it.  There's a lot of Python-language
explanation in the above.  And some PostgreSQL
specifics too.  All this might be extraneous
but if the Python or pg side is gotten wrong it's easy
to have odd, occasional, bugs when suddenly somebody
inputs an unexpected character.  So better to cover
the subject from end-to-end and provide an integrated
understanding.  This will only become more important
as UTF-8 becomes more prevalent, because even though
non-UTF-8 encodings will always be with us, they'll
be less familiarity with these encodings as time passes
and encoding bugs will become even more mysterious.


What this means for my problem at hand is that calling
Dumper.dump() does me no good at all when it comes to
looking for the exact source of the exceptions raised when
converting between the client and server encoding.
dump() does not convert between client and server encodings.
And that means, please correct me if I'm wrong,
that psycopg does not presently expose
the server encoding so that I can do my own checking
and pinpoint the exact problem after catching an encoding
related exception.  (Short of using psycopg to query the
server and ask for the server encoding, or omitting
client_encoding when making a connection, etc.)


FYI.  `man psql` says:
       If both standard input and standard output are a terminal, then psql
       sets the client encoding to “auto”, which will detect the appropriate
       client encoding from the locale settings (LC_CTYPE environment variable
       on Unix systems).

Thanks for the help.  Apologies for the lengthy reply.

Regards,

Karl <kop@karlpinc.com>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein



Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3

От
Karsten Hilbert
Дата:
Am Thu, Feb 15, 2024 at 11:45:15PM -0600 schrieb Karl O. Pinc:

>   Today there is no substitute for knowing the encoding of the
> text your application obtains from the outside world.
> This can be highly system dependent because when reading
> files open()-ed as text, Python decodes (into UTF-8) the bytes read.

Not quite. Python assumes the bytes in the file *are* encoded
by whatever encoding is passed to open(), including, if so
UTF-8). It then decodes said bytes into *unicode code
points*. If we want them back as UTF-8 we need to encode them
as such.

> By default decoding from the system locale's character encoding.
> And when writing files open()-ed as text Python encodes (from UTF-8)

again, from unicode, that is:

    https://docs.python.org/3/howto/unicode.html

>   No matter how you get your data, to put your data into
> the database as text, its bytes must first have their external
> encoding decoded to UTF-8.  Because Python strings are
> UTF-8.

unicode code points, but, yeah

>   Once in Python, psycopg converts the UTF-8 text to the database

unicode

> It's important to get the encoding right so I think it'd be
> good to talk about it.

+1

Karsten
--
GPG  40BE 5B0E C98E 1713 AFA6  5BC0 3BEA AC80 7D4F C89B