Обсуждение: handling unconvertible error messages

Поиск
Список
Период
Сортировка

handling unconvertible error messages

От
Peter Eisentraut
Дата:
Example: I have a database cluster initialized with --locale=ru_RU.UTF-8
(built with NLS).  Let's say for some reason, I have client encoding set
to LATIN1.  All error messages come back like this:

test=> select * from notthere;
ERROR:  character with byte sequence 0xd0 0x9e in encoding "UTF8" has no
equivalent in encoding "LATIN1"

There is no straightforward way for the client to learn that there is a
real error message, but it could not be converted.

I think ideally we could make this better in two ways:

1) Send the original error message untranslated.  That would require
saving the original error message in errmsg(), errdetail(), etc.  That
would be a lot of work for only the occasional use.  But it would also
facilitate an occasionally-requested feature of writing untranslated
error messages into the server log or the csv log, while sending
translated messages to the client (or some variant thereof).

2) Send an indication that there was an encoding problem.  Maybe a
NOTICE, or an error context?  Wiring all this into elog.c looks a bit
tricky, however.

Ideas?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: handling unconvertible error messages

От
Craig Ringer
Дата:
On 25 July 2016 at 22:43, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
Example: I have a database cluster initialized with --locale=ru_RU.UTF-8
(built with NLS).  Let's say for some reason, I have client encoding set
to LATIN1.  All error messages come back like this:

test=> select * from notthere;
ERROR:  character with byte sequence 0xd0 0x9e in encoding "UTF8" has no
equivalent in encoding "LATIN1"

There is no straightforward way for the client to learn that there is a
real error message, but it could not be converted.

I think ideally we could make this better in two ways:

1) Send the original error message untranslated.  That would require
saving the original error message in errmsg(), errdetail(), etc.  That
would be a lot of work for only the occasional use.  But it would also
facilitate an occasionally-requested feature of writing untranslated
error messages into the server log or the csv log, while sending
translated messages to the client (or some variant thereof).

2) Send an indication that there was an encoding problem.  Maybe a
NOTICE, or an error context?  Wiring all this into elog.c looks a bit
tricky, however.


We have a similar problem with the server logs. But there there's also an additional problem: if there isn't any character mapping issue we just totally ignore text encoding concerns and log in whatever encoding the client asked the backend to use into the log files. So log files can be a line-by-line mix of UTF-8, ISO-8859-1, and whatever other fun encodings someone asks for. There is *no* way to correctly read such a file since lines don't have any marking as to their encoding and no tools out there support line-by-line differently encoded text files anyway.

I'm not sure how closely it ties in to the issue you mention, but I think it's at least related enough to keep in mind while considering the client_encoding issue.

I suggest (3) "log the message with unmappable characters masked". Though I would definitely like to be able to also send the raw original, along with a field indicating the encoding of the original since it won't be the client_encoding, since we need some way to get to the info.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: handling unconvertible error messages

От
Kyotaro HORIGUCHI
Дата:
Hello,

At Wed, 27 Jul 2016 19:53:01 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in
<CAMsr+YFL0b1886tMYF9RPeDdpWryG1cr8ew3pYfiXgrJofpHjA@mail.gmail.com>
> On 25 July 2016 at 22:43, Peter Eisentraut <peter.eisentraut@2ndquadrant.com
> > wrote:
> 
> > Example: I have a database cluster initialized with --locale=ru_RU.UTF-8
> > (built with NLS).  Let's say for some reason, I have client encoding set
> > to LATIN1.  All error messages come back like this:
> >
> > test=> select * from notthere;
> > ERROR:  character with byte sequence 0xd0 0x9e in encoding "UTF8" has no
> > equivalent in encoding "LATIN1"
> >
> > There is no straightforward way for the client to learn that there is a
> > real error message, but it could not be converted.
> >
> > I think ideally we could make this better in two ways:
> >
> > 1) Send the original error message untranslated.  That would require
> > saving the original error message in errmsg(), errdetail(), etc.  That
> > would be a lot of work for only the occasional use.  But it would also
> > facilitate an occasionally-requested feature of writing untranslated
> > error messages into the server log or the csv log, while sending
> > translated messages to the client (or some variant thereof).
> >
> > 2) Send an indication that there was an encoding problem.  Maybe a
> > NOTICE, or an error context?  Wiring all this into elog.c looks a bit
> > tricky, however.
> >
> >
> We have a similar problem with the server logs. But there there's also an
> additional problem: if there isn't any character mapping issue we just
> totally ignore text encoding concerns and log in whatever encoding the
> client asked the backend to use into the log files. So log files can be a
> line-by-line mix of UTF-8, ISO-8859-1, and whatever other fun encodings
> someone asks for. There is *no* way to correctly read such a file since
> lines don't have any marking as to their encoding and no tools out there
> support line-by-line differently encoded text files anyway.

Cyrillic messages with such conversion failure looks just as a
series '?' delimited with spaces. The same occurs for Japanese
(or CJK as an integral of similar alphabets), which conatins
(almost) no compatible letters with ASCII characters. We are
sometimes obliged to take a count of '?'s to identify messages
like the following:p

> $ LANG=C postgres
> ?????????:  ??????? ?? ???? ?????????: 2016-07-28 14:08:32 JST
> ?????????:  ?????? ?? ????????? ???????????????? ?????? ????????
> ?????????:  ??????? ?? ?????? ????????? ???????????
> ?????????:  ??????? ??????? ??????????? ??????

> I'm not sure how closely it ties in to the issue you mention, but I think
> it's at least related enough to keep in mind while considering the
> client_encoding issue.

The issue this thread stands for is a failure of character code
replacement performed by backend code, and the another is a
gettext(3)'s behavior according to LC_CTYPE.

I think that data in tables *must* follow the specified encoding
and should result in error for incompatible characters, but I
don't think so for messages from PosgreSQL.

We Jpaanse already have such log message at very early of
starting postmaster.

> LOG:  データベースシステムは 2016-07-28 14:14:06 JST にシャットダウンしました
> LOG:  MultiXact member wraparound protections are now enabled
> LOG:  データベースシステムの接続受付準備が整いました。

The reason for the second line is that it just doesn't have
corresponding translation in ja.po. It is far acceptable than the
sequence of question marks shown above.

> I suggest (3) "log the message with unmappable characters masked". Though I
> would definitely like to be able to also send the raw original, along with
> a field indicating the encoding of the original since it won't be the
> client_encoding, since we need some way to get to the info.

So, I don't think this (3) won't do so much for these
languages. I prefer (1) for this issue. Putting aside the log
issue, error system of PostgreSQL is already doing very similar
thing in err_sendstring for error-recursion cases.

It seems possible to add silent fallback for conversion-failure
there.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: handling unconvertible error messages

От
Victor Wagner
Дата:
On Mon, 25 Jul 2016 10:43:44 -0400
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

> Example: I have a database cluster initialized with
> --locale=ru_RU.UTF-8 (built with NLS).  Let's say for some reason, I
> have client encoding set to LATIN1.  All error messages come back
> like this:
> 
> test=> select * from notthere;  
> ERROR:  character with byte sequence 0xd0 0x9e in encoding "UTF8" has
> no equivalent in encoding "LATIN1"
> 
> There is no straightforward way for the client to learn that there is
> a real error message, but it could not be converted.

Really, situation is a bit worse. There is at least one case, where
error message comes unreadble to the client, even if encodings are
compatible.

I.e. if server default locale is ru_RU.UTF-8 and client requestes
encoding WIN1251 which is able to handle cyrillic.

If error occurs during processing of StartMessage protocol message,
i.e. client request connection to unexisting database,
ErrorResponse would contain message in the server default locale,
despite of client encoding being specified in the StartMessage.

If session is correctly established with such parameters, error
messages are displayed correctly.

I haven't yet investigatged if it is just delayed initialization of
backend locale system or backend is not yet forked at the time of
generation of this message and wrongly encoded message is sent by
postmaster.






Re: handling unconvertible error messages

От
Victor Wagner
Дата:
On Mon, 25 Jul 2016 10:43:44 -0400
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

e is
> a real error message, but it could not be converted.
> 
> I think ideally we could make this better in two ways:
> 
> 1) Send the original error message untranslated.  That would require
> saving the original error message in errmsg(), errdetail(), etc.  That
> would be a lot of work for only the occasional use.  But it would also
> facilitate an occasionally-requested feature of writing untranslated
> error messages into the server log or the csv log, while sending
> translated messages to the client (or some variant thereof).
> 
> 2) Send an indication that there was an encoding problem.  Maybe a
> NOTICE, or an error context?  Wiring all this into elog.c looks a bit
> tricky, however.
> 
> Ideas?

I think there are two more ways:

(3 was in the Craig's message)

4. At the session startup try to reinitializie LC_MESSAGES locale
category with the combination
of the server (or better client-send) language and region and
client-supplied encoding, and if this failed, use untranslated error
message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would fail.
so, if client would ask server with ru_RU.UTF-8 default locale to use
LATIN1 encoding, server would fallback to untranslated messages.

This approach would have problems on windows, where locale is strictly
tied to the ANSI encoding of given language/territory. Even if we would
make UTF-8 a special case, attempt to connect with encoding KOI8 or
LATIN5 to the Windows postgresql server which runs in
Russian_Russia.1251 locale would result in the fallback to untranslated
message. But I think that this case is marginal and better to present
untranslated messages to the people (or applications) which require
non-default 8-bit encoding even if it is possible to represent
translated messages in this encoding, than to present unreadable
translated messages to anybody.

5. Use transliteration in case of encoding problem. Some iconv
implementations (such as Linux glibc iconv and GNU portable libiconv)
supports //TRANSLIT sufix for encoding and if this suffix specified
replace unrepresentable symbols with phonetically similar approximation.
I don't know how well it would work for Japanese, but for Russian it is
definitely better than lots of question marks.

-- 



> 




Re: handling unconvertible error messages

От
Tom Lane
Дата:
Victor Wagner <vitus@wagner.pp.ru> writes:
> If error occurs during processing of StartMessage protocol message,
> i.e. client request connection to unexisting database,
> ErrorResponse would contain message in the server default locale,
> despite of client encoding being specified in the StartMessage.

Yeah.  I'm inclined to think that we should reset the message locale
to C as soon as we've forked away from the postmaster, and leave it
that way until we've absorbed settings from the startup packet.
Sending messages of this sort in English isn't great, but it's better
than sending completely-unreadable ones.  Or is that just my
English-centricity showing?
        regards, tom lane



Re: handling unconvertible error messages

От
Victor Wagner
Дата:
On Thu, 04 Aug 2016 09:42:10 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Victor Wagner <vitus@wagner.pp.ru> writes:
> > If error occurs during processing of StartMessage protocol message,
> > i.e. client request connection to unexisting database,
> > ErrorResponse would contain message in the server default locale,
> > despite of client encoding being specified in the StartMessage.
>
> Yeah.  I'm inclined to think that we should reset the message locale
> to C as soon as we've forked away from the postmaster, and leave it
> that way until we've absorbed settings from the startup packet.
> Sending messages of this sort in English isn't great, but it's better
> than sending completely-unreadable ones.  Or is that just my
> English-centricity showing?

From my russian point of view, english messages are definitely better
than transliteration of Russian  with latin letters (although it is
not completely unreadable), not to mention wrong encoding or lots of
question marks.

Really, if this response is sent after backend has been forked, problem
probably can be easily fixed better way - StartupMessage contain
information about desired client encoding, so this information just
should be processed earlier than any other information from this
message, which can cause errors (such as database name).

If this errors are sent from postmaster itself, things are worse,
because I don't think that locale subsystem is desined to be
reintitalized lots of times in the same process.
But postmaster itself can use non-localized messaging. Its messages in
the logs are typically analyzed by more or less qualified DBA and
system admistrators, not by end user.






Re: handling unconvertible error messages

От
Tom Lane
Дата:
Victor Wagner <vitus@wagner.pp.ru> writes:
> On Thu, 04 Aug 2016 09:42:10 -0400
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah.  I'm inclined to think that we should reset the message locale
>> to C as soon as we've forked away from the postmaster, and leave it
>> that way until we've absorbed settings from the startup packet.

> Really, if this response is sent after backend has been forked, problem
> probably can be easily fixed better way - StartupMessage contain
> information about desired client encoding, so this information just
> should be processed earlier than any other information from this
> message, which can cause errors (such as database name).

I think that's wishful thinking.  There will *always* be errors that
come out before we can examine the contents of the startup message.
Moreover, until we've done authentication, we should be very wary of
applying client-specified settings at all: they might be malicious.
        regards, tom lane



Re: handling unconvertible error messages

От
Victor Wagner
Дата:
On Thu, 04 Aug 2016 14:25:52 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Victor Wagner <vitus@wagner.pp.ru> writes:

> > Really, if this response is sent after backend has been forked,
> > problem probably can be easily fixed better way - StartupMessage
> > contain information about desired client encoding, so this
> > information just should be processed earlier than any other
> > information from this message, which can cause errors (such as
> > database name).  
> 
> I think that's wishful thinking.  There will *always* be errors that
> come out before we can examine the contents of the startup message.
> Moreover, until we've done authentication, we should be very wary of
> applying client-specified settings at all: they might be malicious.

I think that this case can be an exception from the rule "don't apply
settings from the untrusted source".

Let's consider possible threat model:

1. We anyway parse StartupMessage before authentication. There is
nothing we can do with it, so parser should be robust enough, to handle
untrusted input. As I can see from the quick glance, it is.

2. When encoding name is parsed, it is used to search in the array of
supported encoding. No possible attack here - either it is valid or not.

3. As far as I know, we don't allow client to change language, only
encoding, so it is not even possible that attacker could make messages
in the log unreadable for the system administartor.

So, if we would fix the problem, reported by Peter Eisentraut at the
begining of this thread, and fall back to untranslated messages
whenever client-requested encoding is unable to represent messages in
the server default language, this solution,  would be not worse than
your solution. 

There would be fallback to C locale in any case of doubt, but in the
case when NLS messages can be made readable, they would be readable.


Really, there is at least one case, when fallback to C locale should be
done unconditionally - a CancelRequest. In this case client cannot send
an encoding, so C locale should be used.

As far as I understand it is not the case with SSLRequest. Although it
doesn't contain encoding information as well as CancelRequest, errors
in subsequent SSL negotiations would be reported by client-side SSL
libraries, not by server.
-- 




> 
>             regards, tom lane




Re: handling unconvertible error messages

От
Peter Eisentraut
Дата:
On 8/4/16 2:45 AM, Victor Wagner wrote:
> 4. At the session startup try to reinitializie LC_MESSAGES locale
> category with the combination
> of the server (or better client-send) language and region and
> client-supplied encoding, and if this failed, use untranslated error
> message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would fail.
> so, if client would ask server with ru_RU.UTF-8 default locale to use
> LATIN1 encoding, server would fallback to untranslated messages.

I think this is basically my solution (1), with the same problems.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: handling unconvertible error messages

От
Peter Eisentraut
Дата:
On 8/4/16 9:42 AM, Tom Lane wrote:
> I'm inclined to think that we should reset the message locale
> to C as soon as we've forked away from the postmaster, and leave it
> that way until we've absorbed settings from the startup packet.
> Sending messages of this sort in English isn't great, but it's better
> than sending completely-unreadable ones.  Or is that just my
> English-centricity showing?

Well, most of the time this all works, only if there are different
client and server settings you might have problems.  We wouldn't want to
partially disable the NLS feature for the normal case.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: handling unconvertible error messages

От
Victor Wagner
Дата:
On Fri, 5 Aug 2016 11:23:37 -0400
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

> On 8/4/16 9:42 AM, Tom Lane wrote:
> > I'm inclined to think that we should reset the message locale
> > to C as soon as we've forked away from the postmaster, and leave it
> > that way until we've absorbed settings from the startup packet.
> > Sending messages of this sort in English isn't great, but it's
> > better than sending completely-unreadable ones.  Or is that just my
> > English-centricity showing?  
> 
> Well, most of the time this all works, only if there are different
> client and server settings you might have problems.  We wouldn't want
> to partially disable the NLS feature for the normal case.
> 
There are cases, where client cannot tell server which encoding it
wants to use, and server cannot tell which encoding it uses, but it can
send error messages. For example, CancelRequest.

The only way to ensure that message is readable in this case is to fall
back to some encoding, definitely known by both client and server.
And for now it is US-ASCII. 

It is, as far as I understand, what Tom is proposing:
Fall back to the untranslated message at the beginning of session, and
return to NLS only when encoding is successfully negotiated between
client and server.

May be, there can be other solution - prepare client to be able to
accept UTF-8 messages from server regardless of encoding, i.e. if
message starts with BOM marker (0xFEFF unicode char, EF BB BF byte
sequence in utf-8), interpret it as UTF-8. It would require client to
support some kind of encoding conversion, and in some 8-bit
environments pose problems with displaying these messages.

-- 





Re: handling unconvertible error messages

От
Victor Wagner
Дата:
On Fri, 5 Aug 2016 11:21:44 -0400
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

> On 8/4/16 2:45 AM, Victor Wagner wrote:
> > 4. At the session startup try to reinitializie LC_MESSAGES locale
> > category with the combination
> > of the server (or better client-send) language and region and
> > client-supplied encoding, and if this failed, use untranslated error
> > message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would
> > fail. so, if client would ask server with ru_RU.UTF-8 default
> > locale to use LATIN1 encoding, server would fallback to
> > untranslated messages.  
> 
> I think this is basically my solution (1), with the same problems.
> 



I think, that there is a big difference from server point of view.

You propose that both translated and untranslated message should be
passed around inside backend. It has some benefits, but requires
considerable reworking of server internals.

My solution doesn't require keeping both original message and
translated one during all call stack unwinding. It just checks if
combination of language and encoding is supported by the NLS subsystem,
and if not, falls back to untranslated message  for entire session.

It is much more local change and is comparable by complexity with one,
proposed by Tom Lane.






Re: handling unconvertible error messages

От
Kyotaro HORIGUCHI
Дата:
At Mon, 8 Aug 2016 10:19:10 +0300, Victor Wagner <vitus@wagner.pp.ru> wrote in
<20160808101910.49beeed6@fafnir.local.vm>
> On Fri, 5 Aug 2016 11:21:44 -0400
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
> 
> > On 8/4/16 2:45 AM, Victor Wagner wrote:
> > > 4. At the session startup try to reinitializie LC_MESSAGES locale
> > > category with the combination
> > > of the server (or better client-send) language and region and
> > > client-supplied encoding, and if this failed, use untranslated error
> > > message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would
> > > fail. so, if client would ask server with ru_RU.UTF-8 default
> > > locale to use LATIN1 encoding, server would fallback to
> > > untranslated messages.  
> 
> > I think this is basically my solution (1), with the same problems.
> 
> I think, that there is a big difference from server point of view.
> 
> You propose that both translated and untranslated message should be
> passed around inside backend. It has some benefits, but requires
> considerable reworking of server internals.

Agreed.

> My solution doesn't require keeping both original message and
> translated one during all call stack unwinding. It just checks if
> combination of language and encoding is supported by the NLS subsystem,
> and if not, falls back to untranslated message  for entire session.

Looking at check_client_encoding(), the comment says as following.

| * If we are not within a transaction then PrepareClientEncoding will not
| * be able to look up the necessary conversion procs.  If we are still
| * starting up, it will return "OK" anyway, and InitializeClientEncoding
| * will fix things once initialization is far enough along.  After

We shold overcome this to realize startup-time check for
conversion procs.

> It is much more local change and is comparable by complexity with one,
> proposed by Tom Lane.

I'm not sure what messages may be raised before authentication
but it can be a more generic-solution. (Adding check during
on-session.)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: handling unconvertible error messages

От
Kyotaro HORIGUCHI
Дата:
At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160808.171821.100221089.horiguchi.kyotaro@lab.ntt.co.jp>
> Looking at check_client_encoding(), the comment says as following.
> 
> | * If we are not within a transaction then PrepareClientEncoding will not
> | * be able to look up the necessary conversion procs.  If we are still
> | * starting up, it will return "OK" anyway, and InitializeClientEncoding
> | * will fix things once initialization is far enough along.  After
> 
> We shold overcome this to realize startup-time check for
> conversion procs.

Somewhat wrong. The core problem is the procedures offered by
PrepareClientEncoding is choosed only by encoding->encoding
basis, not counting character set compatibility. So, currently
this is not detectable before actually doing conversion of a
character stream.

Conversely, providing a means to check character-set
compatibility will naturally fixes this. Check at session-startup
(out-of-transaction check?) is still another problem.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: handling unconvertible error messages

От
Victor Wagner
Дата:
On Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time)
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> 
> I'm not sure what messages may be raised before authentication
> but it can be a more generic-solution. (Adding check during
> on-session.)

Definitely, there can be authentication error message, which is sent if
authentication didn't happen. Also, as far as I understand, message
"Database ... doesn't exists" is also send before authentication.


Also, there are CancelRequests, where normal authentication is not
used, and server key, provided in another session used instead.






Re: handling unconvertible error messages

От
Kyotaro HORIGUCHI
Дата:
At Mon, 08 Aug 2016 18:11:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160808.181154.252052789.horiguchi.kyotaro@lab.ntt.co.jp>
> At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20160808.171821.100221089.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > Looking at check_client_encoding(), the comment says as following.
> > 
> > | * If we are not within a transaction then PrepareClientEncoding will not
> > | * be able to look up the necessary conversion procs.  If we are still
> > | * starting up, it will return "OK" anyway, and InitializeClientEncoding
> > | * will fix things once initialization is far enough along.  After
> > 
> > We shold overcome this to realize startup-time check for
> > conversion procs.
> 
> Somewhat wrong. The core problem is the procedures offered by
> PrepareClientEncoding is choosed only by encoding->encoding
> basis, not counting character set compatibility. So, currently
> this is not detectable before actually doing conversion of a
> character stream.
> 
> Conversely, providing a means to check character-set
> compatibility will naturally fixes this. Check at session-startup
> (out-of-transaction check?) is still another problem.

I don't see charset compatibility to be easily detectable,
because locale (or character set) is not a matter of PostgreSQL
(except for some encodings bound to one particular character
set)... So the conversion-fallback might be a only available
solution.

Thougts?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: handling unconvertible error messages

От
Victor Wagner
Дата:
On Mon, 08 Aug 2016 18:11:54 +0900 (Tokyo Standard Time)
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro
> HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
> <20160808.171821.100221089.horiguchi.kyotaro@lab.ntt.co.jp>
> 
> Somewhat wrong. The core problem is the procedures offered by
> PrepareClientEncoding is choosed only by encoding->encoding
> basis, not counting character set compatibility. So, currently
> this is not detectable before actually doing conversion of a
> character stream.

Yes, my idea was to check language/encoding compatibility. Make sure
that NLS messages can be represented in the client-specified encoding
in a readable way. As far, as I know, there is no platform-independent
bulletproof way to do so. 

On Unix you can try to initialize locale with given language and given
encoding, but it can fail even if encoding is compatible with language,
simply because corresponding locale is not generated on this system.

But this seems to be a problem of system administration and can be left
out to local sysadmins.

Once you have correctly initialized LC_MESSAGES, you don't need
encoding conversion routines for the NLS messages. You can use
bind_textdomain_codeset function to provide messages in the
client-desired encoding. (but this can cause problems with server logs,
where messages from different sessions would come in different
encodings)

On Windows things are more complicated. There is just one ANSI code
page, associated to given language, and locale initialization would
fail with any other codepage, including utf-8.



> regards,
> 




Re: handling unconvertible error messages

От
Victor Wagner
Дата:
On Mon, 08 Aug 2016 18:28:57 +0900 (Tokyo Standard Time)
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> 
> I don't see charset compatibility to be easily detectable,

In the worst case we can hardcode explicit compatibility table.
There is limited set of languages, which have translated error messages,
and limited (albeit wide) set of encodings, supported by PostgreSQL. So
it is possible to define complete list of encodings, compatible with
some translation. And fall back to untranslated messages if client
encoding is not in this list.

> because locale (or character set) is not a matter of PostgreSQL
> (except for some encodings bound to one particular character
> set)... So the conversion-fallback might be a only available
> solution.

Conversion fallback may be a solution for data. For NLS-messages I think
it is better to fall back to English (untranslated) messages than use of
transliteration or something alike.

I think that for now we can assume that the best effort is already done
for the data, and think how to improve situation with messages.




Re: handling unconvertible error messages

От
Kyotaro HORIGUCHI
Дата:
Hello,

(I've recovered the lost Cc recipients so far)

At Mon, 8 Aug 2016 12:52:11 +0300, Victor Wagner <vitus@wagner.pp.ru> wrote in
<20160808125211.1361cc0f@fafnir.local.vm>
> On Mon, 08 Aug 2016 18:28:57 +0900 (Tokyo Standard Time)
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > 
> > I don't see charset compatibility to be easily detectable,
> 
> In the worst case we can hardcode explicit compatibility table.

We could have the language lists compatible with some
language-bound encodings.  For example, LATIN1 (ISO/IEC 8859-1),
according to Wikipedia
(https://en.wikipedia.org/wiki/ISO/IEC_8859-1)

According to the list, we might have the following compatibility
list of locales, maybe without region.

{{"UTF8", "LATIN1"}, "af", "sq", "eu", "da", "en", "fo", "en"}... and so.

The biggest problem for this is at least *I* cannot confirm the
validity of the list. Both about perfectness of coverage of
LATIN1 over all languages in the list and omission of any
possiblly coverable language. Nontheless, we could use such lists
if we accept the possible imperfectness, which would eventually
result in the original error (conversion failure) or excess
fallback for possibly convertable languages but unfortunately the
latter  would be inacceptable for table data.

> There is limited set of languages, which have translated error messages,
> and limited (albeit wide) set of encodings, supported by PostgreSQL. So

Yes, we can have a negative list already known to be incompatible.

{{"UTF8", "LATIN1"}, "ru", .. er..what else?}

ISO639-1 seems to have about 190 languages and most of them are
apparently incompatible with LATIN1 encoding. It doesn't seem to
me good to have a haphazardly made negative list.

> it is possible to define complete list of encodings, compatible with
> some translation. And fall back to untranslated messages if client
> encoding is not in this list.
> 
> > because locale (or character set) is not a matter of PostgreSQL
> > (except for some encodings bound to one particular character
> > set)... So the conversion-fallback might be a only available
> > solution.
> 
> Conversion fallback may be a solution for data. For NLS-messages I think
> it is better to fall back to English (untranslated) messages than use of
> transliteration or something alike.

I suppose that 'fallback' means "have a try then use English if
failed" so I think it is sutable rather for message, not for
data, and it doesn't need any a priori information about
compatibility. It seems to me that PostgreSQL refuses to ignore
or conceal conversion errors and return broken or unwanted byte
sequence for data.  Things are different for error messages, it
is preferable to be anyyhow readable than totally abandoned.

> I think that for now we can assume that the best effort is already done
> for the data, and think how to improve situation with messages.

Is there any source to know the compatibility for any combination
of language vs encoding? Maybe we need a ground for the list.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: handling unconvertible error messages

От
Victor Wagner
Дата:
On Wed, 10 Aug 2016 11:08:43 +0900 (Tokyo Standard Time)
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> Hello,
> 
> (I've recovered the lost Cc recipients so far)
> 
> At Mon, 8 Aug 2016 12:52:11 +0300, Victor Wagner <vitus@wagner.pp.ru>
> wrote in <20160808125211.1361cc0f@fafnir.local.vm>
> > On Mon, 08 Aug 2016 18:28:57 +0900 (Tokyo Standard Time)
> > Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:  
> > > 
> > > I don't see charset compatibility to be easily detectable,  
> > 
> > In the worst case we can hardcode explicit compatibility table.  
> 
> We could have the language lists compatible with some
> language-bound encodings.  For example, LATIN1 (ISO/IEC 8859-1),
> according to Wikipedia
> (https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
> 
> According to the list, we might have the following compatibility
> list of locales, maybe without region.
> 
> {{"UTF8", "LATIN1"}, "af", "sq", "eu", "da", "en", "fo", "en"}... and
> so.
> 
> The biggest problem for this is at least *I* cannot confirm the
> validity of the list. Both about perfectness of coverage of

I think that people from localization team can. At least authors of
particular translation can tell which encodings support their language.

> ISO639-1 seems to have about 190 languages and most of them are

We don't have 190 message  catalog translations in the PostgreSQL.
So problem with encoding for messages is quite limited.

> 
> I suppose that 'fallback' means "have a try then use English if
> failed" so I think it is sutable rather for message, not for
> data, and it doesn't need any a priori information about

Yes, I'm talking about messages, not about encoding conversion for
data. As far as my experience goes, data in the PostgreSQL are
converted more or less predictable way. May be it could be improved,
but it is possible to set up client and server such way it would do a
right job. 

Situation with messages, especially ones which are returned before
establishing of the session completes (or when it fails) now is a bit
worse.

> compatibility. It seems to me that PostgreSQL refuses to ignore

Alas, it does. At least with example given by Peter Eisentraut at the
start of this thread.

> or conceal conversion errors and return broken or unwanted byte
> sequence for data.  Things are different for error messages, it
> is preferable to be anyyhow readable than totally abandoned.
> 
> > I think that for now we can assume that the best effort is already
> > done for the data, and think how to improve situation with
> > messages.  
> 
> Is there any source to know the compatibility for any combination
> of language vs encoding? Maybe we need a ground for the list.
> 
> regards,
> 




Re: handling unconvertible error messages

От
Vladimir Sitnikov
Дата:
Victor>We don't have 190 message  catalog translations in the PostgreSQL.
Victor>So problem with encoding for messages is quite limited.

Even though the number of translations is limited, there's a problem when trying to tell one "one-byte-encoding" from another "one-byte" one.
It would be so much better if ServerErrorMessages included encoding right in the message itself.

For pgjdbc, I've implemented a workaround that relies on the following:
1) It knows how "FATAL" looks like in several translations, and it knows often used encodings in those translations. For instance, for Russian it tries CP1251, KOI8, and ALT encodings. It converts "ВАЖНО" (Russian for FATAL) using those three encodings and searches that byte sequence in the error message. If there's a match, then the encoding is identified.
2) Unfortunately, it does not help for Japanese, as "FATAL there is translated as FATAL". So I hard-coded several typical words like "database", "user", "role" (see [1]), so if those byte sequences are present, the message is assumed to be in Japanese. It would be great if someone could review those as I do not speak Japanese.
3) Then it tries different LATIN encodings.

Here's the commit

Kyotaro> Is there any source to know the compatibility for any combination
Kyotaro> of language vs encoding? Maybe we need a ground for the list.

I use "locale -a" for that.

For instance, for Japanese it prints the following on my machine (OS X 10.11.6):
locale -a | grep ja
ja_JP
ja_JP.eucJP
ja_JP.SJIS
ja_JP.UTF-8


Vladimir

Re: handling unconvertible error messages

От
Victor Wagner
Дата:
On Sat, 13 Aug 2016 09:24:47 +0000
Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:

> Victor>We don't have 190 message  catalog translations in the
> Victor>PostgreSQL. So problem with encoding for messages is quite
> Victor>limited.
> 
> Even though the number of translations is limited, there's a problem
> when trying to tell one "one-byte-encoding" from another "one-byte"
> one. It would be so much better if ServerErrorMessages included
> encoding right in the message itself.

I think it is better to avoid such a problem and fix system so server
would never send a message in the encoding, different from client one.
It is not a client job to convert encodings.

In most cases server does know which encoding client requests from the
very first protocol message. (if it is startup message). 
So, server can easily tell if it is able to convert NLS messages into
the client desired encoding, and if not - fall back to untranslated
messages.





--                                   Victor Wagner <vitus@wagner.pp.ru>



Re: handling unconvertible error messages

От
Vladimir Sitnikov
Дата:
Victor>It is not a client job to convert encodings.

Of course.

However, there is a vast amount of old PG versions deployed in the wild that send wrong data to clients.

This indeed makes bad user experience, so it is worth doing 2 things:
1) Implement proper solution in further PostgreSQL versions (e.g. include encoding name right into the error message).
2) Implement workaround for current drivers, so clients would get proper error messages even when trying to connect to unpatched server.

Vladimir

Re: handling unconvertible error messages

От
Tom Lane
Дата:
Victor Wagner <vitus@wagner.pp.ru> writes:
> I think it is better to avoid such a problem and fix system so server
> would never send a message in the encoding, different from client one.

Don't hold your breath waiting for that to happen.

Quite aside from the question of whether we want to trust GUC settings
from the startup packet before we've authenticated the user, there's a
small problem that the server *can't* translate *any* encoding until
it's successfully connected to a database and is able to read the
pg_conversion catalog.

We might be able to institute some rule like "examine the startup
packet and see if it specifies a client_encoding different from what
we inherited from the postmaster.  If not, continue with current
behavior (send messages localized per postmaster's settings).  If so,
fall back to English messages/C locale until startup is complete."
This would preserve current functionality in cases where it actually,
er, functions, while giving something at least passable in the cases
that are broken today.
        regards, tom lane



Re: handling unconvertible error messages

От
Vladimir Sitnikov
Дата:
Tom> while giving something at least passable in the cases
that are broken today.

Would you mind adding an explicit "encoding" field to the error message?
At least it would give clear explanation how to parse that message without resorting to a guess dance.

The biggest problem is client has no idea how to parse backend error messages. If implementing client_encoding properly is too hard at this point in time, then I would rather have "encoding field" right in the startup error message.

That "encoding" field would enable sending properly localized messages in the future if "pre-connect client_encoding" would be implemented somehow.

Vladimir

Re: handling unconvertible error messages

От
Victor Wagner
Дата:
On Sat, 13 Aug 2016 12:02:30 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Victor Wagner <vitus@wagner.pp.ru> writes:
> > I think it is better to avoid such a problem and fix system so
> > server would never send a message in the encoding, different from
> > client one.  
> 
> Don't hold your breath waiting for that to happen.
> 
> Quite aside from the question of whether we want to trust GUC settings
> from the startup packet before we've authenticated the user, there's a

What's wrong with trusting this particular setting? I cannot think of
any meaningful exploit. Really, there are lot of http servers out
there, which typically do accept connections from anywhere (which is
seldom case for postgresql servers) and trust Accept-Charset and
Accept-Language client header without any authentication.

There can be attacks that exploits errors in the message parsing, 
but startup message is parsed anyway before authentication.

> small problem that the server *can't* translate *any* encoding until
> it's successfully connected to a database and is able to read the
> pg_conversion catalog.

> 
> We might be able to institute some rule like "examine the startup
> packet and see if it specifies a client_encoding different from what
> we inherited from the postmaster.  If not, continue with current
> behavior (send messages localized per postmaster's settings).  If so,
> fall back to English messages/C locale until startup is complete."
> This would preserve current functionality in cases where it actually,
> er, functions, while giving something at least passable in the cases
> that are broken today.

I think that we can do a bit more than this. We use GNU gettext
library to provide message translation. These library are able to
perform limited set of encoding conversion itself.

So, we can have two-stage fallback here:

1. If encoding is different, but compatible with language, inherited
from postmaster, ask gettext via bind_textdomain_encoding function to
provide messages in this encoding.

2. If it is not possible, fall back to English messages, which are
compatible with any of supported encoding. The same goes for session
which doesn't specify encoding at all (i.e. CancelMessage).



> 
>             regards, tom lane
> 
> 



--                                   Victor Wagner <vitus@wagner.pp.ru>