Обсуждение: handling unconvertible error messages
Example: I have a database cluster initialized with --locale=ru_RU.UTF-8 (built with NLS). Let's say for some reason, I have client encoding set to LATIN1. All error messages come back like this: test=> select * from notthere; ERROR: character with byte sequence 0xd0 0x9e in encoding "UTF8" has no equivalent in encoding "LATIN1" There is no straightforward way for the client to learn that there is a real error message, but it could not be converted. I think ideally we could make this better in two ways: 1) Send the original error message untranslated. That would require saving the original error message in errmsg(), errdetail(), etc. That would be a lot of work for only the occasional use. But it would also facilitate an occasionally-requested feature of writing untranslated error messages into the server log or the csv log, while sending translated messages to the client (or some variant thereof). 2) Send an indication that there was an encoding problem. Maybe a NOTICE, or an error context? Wiring all this into elog.c looks a bit tricky, however. Ideas? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 25 July 2016 at 22:43, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
-- Example: I have a database cluster initialized with --locale=ru_RU.UTF-8
(built with NLS). Let's say for some reason, I have client encoding set
to LATIN1. All error messages come back like this:
test=> select * from notthere;
ERROR: character with byte sequence 0xd0 0x9e in encoding "UTF8" has no
equivalent in encoding "LATIN1"
There is no straightforward way for the client to learn that there is a
real error message, but it could not be converted.
I think ideally we could make this better in two ways:
1) Send the original error message untranslated. That would require
saving the original error message in errmsg(), errdetail(), etc. That
would be a lot of work for only the occasional use. But it would also
facilitate an occasionally-requested feature of writing untranslated
error messages into the server log or the csv log, while sending
translated messages to the client (or some variant thereof).
2) Send an indication that there was an encoding problem. Maybe a
NOTICE, or an error context? Wiring all this into elog.c looks a bit
tricky, however.
We have a similar problem with the server logs. But there there's also an additional problem: if there isn't any character mapping issue we just totally ignore text encoding concerns and log in whatever encoding the client asked the backend to use into the log files. So log files can be a line-by-line mix of UTF-8, ISO-8859-1, and whatever other fun encodings someone asks for. There is *no* way to correctly read such a file since lines don't have any marking as to their encoding and no tools out there support line-by-line differently encoded text files anyway.
I'm not sure how closely it ties in to the issue you mention, but I think it's at least related enough to keep in mind while considering the client_encoding issue.
I suggest (3) "log the message with unmappable characters masked". Though I would definitely like to be able to also send the raw original, along with a field indicating the encoding of the original since it won't be the client_encoding, since we need some way to get to the info.
Hello, At Wed, 27 Jul 2016 19:53:01 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in <CAMsr+YFL0b1886tMYF9RPeDdpWryG1cr8ew3pYfiXgrJofpHjA@mail.gmail.com> > On 25 July 2016 at 22:43, Peter Eisentraut <peter.eisentraut@2ndquadrant.com > > wrote: > > > Example: I have a database cluster initialized with --locale=ru_RU.UTF-8 > > (built with NLS). Let's say for some reason, I have client encoding set > > to LATIN1. All error messages come back like this: > > > > test=> select * from notthere; > > ERROR: character with byte sequence 0xd0 0x9e in encoding "UTF8" has no > > equivalent in encoding "LATIN1" > > > > There is no straightforward way for the client to learn that there is a > > real error message, but it could not be converted. > > > > I think ideally we could make this better in two ways: > > > > 1) Send the original error message untranslated. That would require > > saving the original error message in errmsg(), errdetail(), etc. That > > would be a lot of work for only the occasional use. But it would also > > facilitate an occasionally-requested feature of writing untranslated > > error messages into the server log or the csv log, while sending > > translated messages to the client (or some variant thereof). > > > > 2) Send an indication that there was an encoding problem. Maybe a > > NOTICE, or an error context? Wiring all this into elog.c looks a bit > > tricky, however. > > > > > We have a similar problem with the server logs. But there there's also an > additional problem: if there isn't any character mapping issue we just > totally ignore text encoding concerns and log in whatever encoding the > client asked the backend to use into the log files. So log files can be a > line-by-line mix of UTF-8, ISO-8859-1, and whatever other fun encodings > someone asks for. There is *no* way to correctly read such a file since > lines don't have any marking as to their encoding and no tools out there > support line-by-line differently encoded text files anyway. Cyrillic messages with such conversion failure looks just as a series '?' delimited with spaces. The same occurs for Japanese (or CJK as an integral of similar alphabets), which conatins (almost) no compatible letters with ASCII characters. We are sometimes obliged to take a count of '?'s to identify messages like the following:p > $ LANG=C postgres > ?????????: ??????? ?? ???? ?????????: 2016-07-28 14:08:32 JST > ?????????: ?????? ?? ????????? ???????????????? ?????? ???????? > ?????????: ??????? ?? ?????? ????????? ??????????? > ?????????: ??????? ??????? ??????????? ?????? > I'm not sure how closely it ties in to the issue you mention, but I think > it's at least related enough to keep in mind while considering the > client_encoding issue. The issue this thread stands for is a failure of character code replacement performed by backend code, and the another is a gettext(3)'s behavior according to LC_CTYPE. I think that data in tables *must* follow the specified encoding and should result in error for incompatible characters, but I don't think so for messages from PosgreSQL. We Jpaanse already have such log message at very early of starting postmaster. > LOG: データベースシステムは 2016-07-28 14:14:06 JST にシャットダウンしました > LOG: MultiXact member wraparound protections are now enabled > LOG: データベースシステムの接続受付準備が整いました。 The reason for the second line is that it just doesn't have corresponding translation in ja.po. It is far acceptable than the sequence of question marks shown above. > I suggest (3) "log the message with unmappable characters masked". Though I > would definitely like to be able to also send the raw original, along with > a field indicating the encoding of the original since it won't be the > client_encoding, since we need some way to get to the info. So, I don't think this (3) won't do so much for these languages. I prefer (1) for this issue. Putting aside the log issue, error system of PostgreSQL is already doing very similar thing in err_sendstring for error-recursion cases. It seems possible to add silent fallback for conversion-failure there. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, 25 Jul 2016 10:43:44 -0400 Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > Example: I have a database cluster initialized with > --locale=ru_RU.UTF-8 (built with NLS). Let's say for some reason, I > have client encoding set to LATIN1. All error messages come back > like this: > > test=> select * from notthere; > ERROR: character with byte sequence 0xd0 0x9e in encoding "UTF8" has > no equivalent in encoding "LATIN1" > > There is no straightforward way for the client to learn that there is > a real error message, but it could not be converted. Really, situation is a bit worse. There is at least one case, where error message comes unreadble to the client, even if encodings are compatible. I.e. if server default locale is ru_RU.UTF-8 and client requestes encoding WIN1251 which is able to handle cyrillic. If error occurs during processing of StartMessage protocol message, i.e. client request connection to unexisting database, ErrorResponse would contain message in the server default locale, despite of client encoding being specified in the StartMessage. If session is correctly established with such parameters, error messages are displayed correctly. I haven't yet investigatged if it is just delayed initialization of backend locale system or backend is not yet forked at the time of generation of this message and wrongly encoded message is sent by postmaster.
On Mon, 25 Jul 2016 10:43:44 -0400 Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: e is > a real error message, but it could not be converted. > > I think ideally we could make this better in two ways: > > 1) Send the original error message untranslated. That would require > saving the original error message in errmsg(), errdetail(), etc. That > would be a lot of work for only the occasional use. But it would also > facilitate an occasionally-requested feature of writing untranslated > error messages into the server log or the csv log, while sending > translated messages to the client (or some variant thereof). > > 2) Send an indication that there was an encoding problem. Maybe a > NOTICE, or an error context? Wiring all this into elog.c looks a bit > tricky, however. > > Ideas? I think there are two more ways: (3 was in the Craig's message) 4. At the session startup try to reinitializie LC_MESSAGES locale category with the combination of the server (or better client-send) language and region and client-supplied encoding, and if this failed, use untranslated error message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would fail. so, if client would ask server with ru_RU.UTF-8 default locale to use LATIN1 encoding, server would fallback to untranslated messages. This approach would have problems on windows, where locale is strictly tied to the ANSI encoding of given language/territory. Even if we would make UTF-8 a special case, attempt to connect with encoding KOI8 or LATIN5 to the Windows postgresql server which runs in Russian_Russia.1251 locale would result in the fallback to untranslated message. But I think that this case is marginal and better to present untranslated messages to the people (or applications) which require non-default 8-bit encoding even if it is possible to represent translated messages in this encoding, than to present unreadable translated messages to anybody. 5. Use transliteration in case of encoding problem. Some iconv implementations (such as Linux glibc iconv and GNU portable libiconv) supports //TRANSLIT sufix for encoding and if this suffix specified replace unrepresentable symbols with phonetically similar approximation. I don't know how well it would work for Japanese, but for Russian it is definitely better than lots of question marks. -- >
Victor Wagner <vitus@wagner.pp.ru> writes: > If error occurs during processing of StartMessage protocol message, > i.e. client request connection to unexisting database, > ErrorResponse would contain message in the server default locale, > despite of client encoding being specified in the StartMessage. Yeah. I'm inclined to think that we should reset the message locale to C as soon as we've forked away from the postmaster, and leave it that way until we've absorbed settings from the startup packet. Sending messages of this sort in English isn't great, but it's better than sending completely-unreadable ones. Or is that just my English-centricity showing? regards, tom lane
On Thu, 04 Aug 2016 09:42:10 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Victor Wagner <vitus@wagner.pp.ru> writes: > > If error occurs during processing of StartMessage protocol message, > > i.e. client request connection to unexisting database, > > ErrorResponse would contain message in the server default locale, > > despite of client encoding being specified in the StartMessage. > > Yeah. I'm inclined to think that we should reset the message locale > to C as soon as we've forked away from the postmaster, and leave it > that way until we've absorbed settings from the startup packet. > Sending messages of this sort in English isn't great, but it's better > than sending completely-unreadable ones. Or is that just my > English-centricity showing? From my russian point of view, english messages are definitely better than transliteration of Russian with latin letters (although it is not completely unreadable), not to mention wrong encoding or lots of question marks. Really, if this response is sent after backend has been forked, problem probably can be easily fixed better way - StartupMessage contain information about desired client encoding, so this information just should be processed earlier than any other information from this message, which can cause errors (such as database name). If this errors are sent from postmaster itself, things are worse, because I don't think that locale subsystem is desined to be reintitalized lots of times in the same process. But postmaster itself can use non-localized messaging. Its messages in the logs are typically analyzed by more or less qualified DBA and system admistrators, not by end user.
Victor Wagner <vitus@wagner.pp.ru> writes: > On Thu, 04 Aug 2016 09:42:10 -0400 > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah. I'm inclined to think that we should reset the message locale >> to C as soon as we've forked away from the postmaster, and leave it >> that way until we've absorbed settings from the startup packet. > Really, if this response is sent after backend has been forked, problem > probably can be easily fixed better way - StartupMessage contain > information about desired client encoding, so this information just > should be processed earlier than any other information from this > message, which can cause errors (such as database name). I think that's wishful thinking. There will *always* be errors that come out before we can examine the contents of the startup message. Moreover, until we've done authentication, we should be very wary of applying client-specified settings at all: they might be malicious. regards, tom lane
On Thu, 04 Aug 2016 14:25:52 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Victor Wagner <vitus@wagner.pp.ru> writes: > > Really, if this response is sent after backend has been forked, > > problem probably can be easily fixed better way - StartupMessage > > contain information about desired client encoding, so this > > information just should be processed earlier than any other > > information from this message, which can cause errors (such as > > database name). > > I think that's wishful thinking. There will *always* be errors that > come out before we can examine the contents of the startup message. > Moreover, until we've done authentication, we should be very wary of > applying client-specified settings at all: they might be malicious. I think that this case can be an exception from the rule "don't apply settings from the untrusted source". Let's consider possible threat model: 1. We anyway parse StartupMessage before authentication. There is nothing we can do with it, so parser should be robust enough, to handle untrusted input. As I can see from the quick glance, it is. 2. When encoding name is parsed, it is used to search in the array of supported encoding. No possible attack here - either it is valid or not. 3. As far as I know, we don't allow client to change language, only encoding, so it is not even possible that attacker could make messages in the log unreadable for the system administartor. So, if we would fix the problem, reported by Peter Eisentraut at the begining of this thread, and fall back to untranslated messages whenever client-requested encoding is unable to represent messages in the server default language, this solution, would be not worse than your solution. There would be fallback to C locale in any case of doubt, but in the case when NLS messages can be made readable, they would be readable. Really, there is at least one case, when fallback to C locale should be done unconditionally - a CancelRequest. In this case client cannot send an encoding, so C locale should be used. As far as I understand it is not the case with SSLRequest. Although it doesn't contain encoding information as well as CancelRequest, errors in subsequent SSL negotiations would be reported by client-side SSL libraries, not by server. -- > > regards, tom lane
On 8/4/16 2:45 AM, Victor Wagner wrote: > 4. At the session startup try to reinitializie LC_MESSAGES locale > category with the combination > of the server (or better client-send) language and region and > client-supplied encoding, and if this failed, use untranslated error > message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would fail. > so, if client would ask server with ru_RU.UTF-8 default locale to use > LATIN1 encoding, server would fallback to untranslated messages. I think this is basically my solution (1), with the same problems. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 8/4/16 9:42 AM, Tom Lane wrote: > I'm inclined to think that we should reset the message locale > to C as soon as we've forked away from the postmaster, and leave it > that way until we've absorbed settings from the startup packet. > Sending messages of this sort in English isn't great, but it's better > than sending completely-unreadable ones. Or is that just my > English-centricity showing? Well, most of the time this all works, only if there are different client and server settings you might have problems. We wouldn't want to partially disable the NLS feature for the normal case. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 5 Aug 2016 11:23:37 -0400 Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 8/4/16 9:42 AM, Tom Lane wrote: > > I'm inclined to think that we should reset the message locale > > to C as soon as we've forked away from the postmaster, and leave it > > that way until we've absorbed settings from the startup packet. > > Sending messages of this sort in English isn't great, but it's > > better than sending completely-unreadable ones. Or is that just my > > English-centricity showing? > > Well, most of the time this all works, only if there are different > client and server settings you might have problems. We wouldn't want > to partially disable the NLS feature for the normal case. > There are cases, where client cannot tell server which encoding it wants to use, and server cannot tell which encoding it uses, but it can send error messages. For example, CancelRequest. The only way to ensure that message is readable in this case is to fall back to some encoding, definitely known by both client and server. And for now it is US-ASCII. It is, as far as I understand, what Tom is proposing: Fall back to the untranslated message at the beginning of session, and return to NLS only when encoding is successfully negotiated between client and server. May be, there can be other solution - prepare client to be able to accept UTF-8 messages from server regardless of encoding, i.e. if message starts with BOM marker (0xFEFF unicode char, EF BB BF byte sequence in utf-8), interpret it as UTF-8. It would require client to support some kind of encoding conversion, and in some 8-bit environments pose problems with displaying these messages. --
On Fri, 5 Aug 2016 11:21:44 -0400 Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 8/4/16 2:45 AM, Victor Wagner wrote: > > 4. At the session startup try to reinitializie LC_MESSAGES locale > > category with the combination > > of the server (or better client-send) language and region and > > client-supplied encoding, and if this failed, use untranslated error > > message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would > > fail. so, if client would ask server with ru_RU.UTF-8 default > > locale to use LATIN1 encoding, server would fallback to > > untranslated messages. > > I think this is basically my solution (1), with the same problems. > I think, that there is a big difference from server point of view. You propose that both translated and untranslated message should be passed around inside backend. It has some benefits, but requires considerable reworking of server internals. My solution doesn't require keeping both original message and translated one during all call stack unwinding. It just checks if combination of language and encoding is supported by the NLS subsystem, and if not, falls back to untranslated message for entire session. It is much more local change and is comparable by complexity with one, proposed by Tom Lane.
At Mon, 8 Aug 2016 10:19:10 +0300, Victor Wagner <vitus@wagner.pp.ru> wrote in <20160808101910.49beeed6@fafnir.local.vm> > On Fri, 5 Aug 2016 11:21:44 -0400 > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > > > On 8/4/16 2:45 AM, Victor Wagner wrote: > > > 4. At the session startup try to reinitializie LC_MESSAGES locale > > > category with the combination > > > of the server (or better client-send) language and region and > > > client-supplied encoding, and if this failed, use untranslated error > > > message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would > > > fail. so, if client would ask server with ru_RU.UTF-8 default > > > locale to use LATIN1 encoding, server would fallback to > > > untranslated messages. > > > I think this is basically my solution (1), with the same problems. > > I think, that there is a big difference from server point of view. > > You propose that both translated and untranslated message should be > passed around inside backend. It has some benefits, but requires > considerable reworking of server internals. Agreed. > My solution doesn't require keeping both original message and > translated one during all call stack unwinding. It just checks if > combination of language and encoding is supported by the NLS subsystem, > and if not, falls back to untranslated message for entire session. Looking at check_client_encoding(), the comment says as following. | * If we are not within a transaction then PrepareClientEncoding will not | * be able to look up the necessary conversion procs. If we are still | * starting up, it will return "OK" anyway, and InitializeClientEncoding | * will fix things once initialization is far enough along. After We shold overcome this to realize startup-time check for conversion procs. > It is much more local change and is comparable by complexity with one, > proposed by Tom Lane. I'm not sure what messages may be raised before authentication but it can be a more generic-solution. (Adding check during on-session.) regards, -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160808.171821.100221089.horiguchi.kyotaro@lab.ntt.co.jp> > Looking at check_client_encoding(), the comment says as following. > > | * If we are not within a transaction then PrepareClientEncoding will not > | * be able to look up the necessary conversion procs. If we are still > | * starting up, it will return "OK" anyway, and InitializeClientEncoding > | * will fix things once initialization is far enough along. After > > We shold overcome this to realize startup-time check for > conversion procs. Somewhat wrong. The core problem is the procedures offered by PrepareClientEncoding is choosed only by encoding->encoding basis, not counting character set compatibility. So, currently this is not detectable before actually doing conversion of a character stream. Conversely, providing a means to check character-set compatibility will naturally fixes this. Check at session-startup (out-of-transaction check?) is still another problem. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time) Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > I'm not sure what messages may be raised before authentication > but it can be a more generic-solution. (Adding check during > on-session.) Definitely, there can be authentication error message, which is sent if authentication didn't happen. Also, as far as I understand, message "Database ... doesn't exists" is also send before authentication. Also, there are CancelRequests, where normal authentication is not used, and server key, provided in another session used instead.
At Mon, 08 Aug 2016 18:11:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160808.181154.252052789.horiguchi.kyotaro@lab.ntt.co.jp> > At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20160808.171821.100221089.horiguchi.kyotaro@lab.ntt.co.jp> > > Looking at check_client_encoding(), the comment says as following. > > > > | * If we are not within a transaction then PrepareClientEncoding will not > > | * be able to look up the necessary conversion procs. If we are still > > | * starting up, it will return "OK" anyway, and InitializeClientEncoding > > | * will fix things once initialization is far enough along. After > > > > We shold overcome this to realize startup-time check for > > conversion procs. > > Somewhat wrong. The core problem is the procedures offered by > PrepareClientEncoding is choosed only by encoding->encoding > basis, not counting character set compatibility. So, currently > this is not detectable before actually doing conversion of a > character stream. > > Conversely, providing a means to check character-set > compatibility will naturally fixes this. Check at session-startup > (out-of-transaction check?) is still another problem. I don't see charset compatibility to be easily detectable, because locale (or character set) is not a matter of PostgreSQL (except for some encodings bound to one particular character set)... So the conversion-fallback might be a only available solution. Thougts? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, 08 Aug 2016 18:11:54 +0900 (Tokyo Standard Time) Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro > HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in > <20160808.171821.100221089.horiguchi.kyotaro@lab.ntt.co.jp> > > Somewhat wrong. The core problem is the procedures offered by > PrepareClientEncoding is choosed only by encoding->encoding > basis, not counting character set compatibility. So, currently > this is not detectable before actually doing conversion of a > character stream. Yes, my idea was to check language/encoding compatibility. Make sure that NLS messages can be represented in the client-specified encoding in a readable way. As far, as I know, there is no platform-independent bulletproof way to do so. On Unix you can try to initialize locale with given language and given encoding, but it can fail even if encoding is compatible with language, simply because corresponding locale is not generated on this system. But this seems to be a problem of system administration and can be left out to local sysadmins. Once you have correctly initialized LC_MESSAGES, you don't need encoding conversion routines for the NLS messages. You can use bind_textdomain_codeset function to provide messages in the client-desired encoding. (but this can cause problems with server logs, where messages from different sessions would come in different encodings) On Windows things are more complicated. There is just one ANSI code page, associated to given language, and locale initialization would fail with any other codepage, including utf-8. > regards, >
On Mon, 08 Aug 2016 18:28:57 +0900 (Tokyo Standard Time) Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > I don't see charset compatibility to be easily detectable, In the worst case we can hardcode explicit compatibility table. There is limited set of languages, which have translated error messages, and limited (albeit wide) set of encodings, supported by PostgreSQL. So it is possible to define complete list of encodings, compatible with some translation. And fall back to untranslated messages if client encoding is not in this list. > because locale (or character set) is not a matter of PostgreSQL > (except for some encodings bound to one particular character > set)... So the conversion-fallback might be a only available > solution. Conversion fallback may be a solution for data. For NLS-messages I think it is better to fall back to English (untranslated) messages than use of transliteration or something alike. I think that for now we can assume that the best effort is already done for the data, and think how to improve situation with messages.
Hello, (I've recovered the lost Cc recipients so far) At Mon, 8 Aug 2016 12:52:11 +0300, Victor Wagner <vitus@wagner.pp.ru> wrote in <20160808125211.1361cc0f@fafnir.local.vm> > On Mon, 08 Aug 2016 18:28:57 +0900 (Tokyo Standard Time) > Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > > > I don't see charset compatibility to be easily detectable, > > In the worst case we can hardcode explicit compatibility table. We could have the language lists compatible with some language-bound encodings. For example, LATIN1 (ISO/IEC 8859-1), according to Wikipedia (https://en.wikipedia.org/wiki/ISO/IEC_8859-1) According to the list, we might have the following compatibility list of locales, maybe without region. {{"UTF8", "LATIN1"}, "af", "sq", "eu", "da", "en", "fo", "en"}... and so. The biggest problem for this is at least *I* cannot confirm the validity of the list. Both about perfectness of coverage of LATIN1 over all languages in the list and omission of any possiblly coverable language. Nontheless, we could use such lists if we accept the possible imperfectness, which would eventually result in the original error (conversion failure) or excess fallback for possibly convertable languages but unfortunately the latter would be inacceptable for table data. > There is limited set of languages, which have translated error messages, > and limited (albeit wide) set of encodings, supported by PostgreSQL. So Yes, we can have a negative list already known to be incompatible. {{"UTF8", "LATIN1"}, "ru", .. er..what else?} ISO639-1 seems to have about 190 languages and most of them are apparently incompatible with LATIN1 encoding. It doesn't seem to me good to have a haphazardly made negative list. > it is possible to define complete list of encodings, compatible with > some translation. And fall back to untranslated messages if client > encoding is not in this list. > > > because locale (or character set) is not a matter of PostgreSQL > > (except for some encodings bound to one particular character > > set)... So the conversion-fallback might be a only available > > solution. > > Conversion fallback may be a solution for data. For NLS-messages I think > it is better to fall back to English (untranslated) messages than use of > transliteration or something alike. I suppose that 'fallback' means "have a try then use English if failed" so I think it is sutable rather for message, not for data, and it doesn't need any a priori information about compatibility. It seems to me that PostgreSQL refuses to ignore or conceal conversion errors and return broken or unwanted byte sequence for data. Things are different for error messages, it is preferable to be anyyhow readable than totally abandoned. > I think that for now we can assume that the best effort is already done > for the data, and think how to improve situation with messages. Is there any source to know the compatibility for any combination of language vs encoding? Maybe we need a ground for the list. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, 10 Aug 2016 11:08:43 +0900 (Tokyo Standard Time) Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > (I've recovered the lost Cc recipients so far) > > At Mon, 8 Aug 2016 12:52:11 +0300, Victor Wagner <vitus@wagner.pp.ru> > wrote in <20160808125211.1361cc0f@fafnir.local.vm> > > On Mon, 08 Aug 2016 18:28:57 +0900 (Tokyo Standard Time) > > Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > > > > > I don't see charset compatibility to be easily detectable, > > > > In the worst case we can hardcode explicit compatibility table. > > We could have the language lists compatible with some > language-bound encodings. For example, LATIN1 (ISO/IEC 8859-1), > according to Wikipedia > (https://en.wikipedia.org/wiki/ISO/IEC_8859-1) > > According to the list, we might have the following compatibility > list of locales, maybe without region. > > {{"UTF8", "LATIN1"}, "af", "sq", "eu", "da", "en", "fo", "en"}... and > so. > > The biggest problem for this is at least *I* cannot confirm the > validity of the list. Both about perfectness of coverage of I think that people from localization team can. At least authors of particular translation can tell which encodings support their language. > ISO639-1 seems to have about 190 languages and most of them are We don't have 190 message catalog translations in the PostgreSQL. So problem with encoding for messages is quite limited. > > I suppose that 'fallback' means "have a try then use English if > failed" so I think it is sutable rather for message, not for > data, and it doesn't need any a priori information about Yes, I'm talking about messages, not about encoding conversion for data. As far as my experience goes, data in the PostgreSQL are converted more or less predictable way. May be it could be improved, but it is possible to set up client and server such way it would do a right job. Situation with messages, especially ones which are returned before establishing of the session completes (or when it fails) now is a bit worse. > compatibility. It seems to me that PostgreSQL refuses to ignore Alas, it does. At least with example given by Peter Eisentraut at the start of this thread. > or conceal conversion errors and return broken or unwanted byte > sequence for data. Things are different for error messages, it > is preferable to be anyyhow readable than totally abandoned. > > > I think that for now we can assume that the best effort is already > > done for the data, and think how to improve situation with > > messages. > > Is there any source to know the compatibility for any combination > of language vs encoding? Maybe we need a ground for the list. > > regards, >
Victor>We don't have 190 message catalog translations in the PostgreSQL.
Victor>So problem with encoding for messages is quite limited.
Even though the number of translations is limited, there's a problem when trying to tell one "one-byte-encoding" from another "one-byte" one.
It would be so much better if ServerErrorMessages included encoding right in the message itself.
For pgjdbc, I've implemented a workaround that relies on the following:
1) It knows how "FATAL" looks like in several translations, and it knows often used encodings in those translations. For instance, for Russian it tries CP1251, KOI8, and ALT encodings. It converts "ВАЖНО" (Russian for FATAL) using those three encodings and searches that byte sequence in the error message. If there's a match, then the encoding is identified.
2) Unfortunately, it does not help for Japanese, as "FATAL there is translated as FATAL". So I hard-coded several typical words like "database", "user", "role" (see [1]), so if those byte sequences are present, the message is assumed to be in Japanese. It would be great if someone could review those as I do not speak Japanese.
3) Then it tries different LATIN encodings.
Here's the commit
Kyotaro> Is there any source to know the compatibility for any combination
Kyotaro> of language vs encoding? Maybe we need a ground for the list.
Kyotaro> of language vs encoding? Maybe we need a ground for the list.
I use "locale -a" for that.
For instance, for Japanese it prints the following on my machine (OS X 10.11.6):
locale -a | grep ja
ja_JP
ja_JP.eucJP
ja_JP.SJIS
ja_JP.UTF-8
Vladimir
On Sat, 13 Aug 2016 09:24:47 +0000 Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote: > Victor>We don't have 190 message catalog translations in the > Victor>PostgreSQL. So problem with encoding for messages is quite > Victor>limited. > > Even though the number of translations is limited, there's a problem > when trying to tell one "one-byte-encoding" from another "one-byte" > one. It would be so much better if ServerErrorMessages included > encoding right in the message itself. I think it is better to avoid such a problem and fix system so server would never send a message in the encoding, different from client one. It is not a client job to convert encodings. In most cases server does know which encoding client requests from the very first protocol message. (if it is startup message). So, server can easily tell if it is able to convert NLS messages into the client desired encoding, and if not - fall back to untranslated messages. -- Victor Wagner <vitus@wagner.pp.ru>
Victor>It is not a client job to convert encodings.
Of course.
However, there is a vast amount of old PG versions deployed in the wild that send wrong data to clients.
This indeed makes bad user experience, so it is worth doing 2 things:
1) Implement proper solution in further PostgreSQL versions (e.g. include encoding name right into the error message).
2) Implement workaround for current drivers, so clients would get proper error messages even when trying to connect to unpatched server.
Vladimir
Victor Wagner <vitus@wagner.pp.ru> writes: > I think it is better to avoid such a problem and fix system so server > would never send a message in the encoding, different from client one. Don't hold your breath waiting for that to happen. Quite aside from the question of whether we want to trust GUC settings from the startup packet before we've authenticated the user, there's a small problem that the server *can't* translate *any* encoding until it's successfully connected to a database and is able to read the pg_conversion catalog. We might be able to institute some rule like "examine the startup packet and see if it specifies a client_encoding different from what we inherited from the postmaster. If not, continue with current behavior (send messages localized per postmaster's settings). If so, fall back to English messages/C locale until startup is complete." This would preserve current functionality in cases where it actually, er, functions, while giving something at least passable in the cases that are broken today. regards, tom lane
Tom> while giving something at least passable in the cases
that are broken today.
Would you mind adding an explicit "encoding" field to the error message?
that are broken today.
Would you mind adding an explicit "encoding" field to the error message?
At least it would give clear explanation how to parse that message without resorting to a guess dance.
The biggest problem is client has no idea how to parse backend error messages. If implementing client_encoding properly is too hard at this point in time, then I would rather have "encoding field" right in the startup error message.
That "encoding" field would enable sending properly localized messages in the future if "pre-connect client_encoding" would be implemented somehow.
Vladimir
On Sat, 13 Aug 2016 12:02:30 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Victor Wagner <vitus@wagner.pp.ru> writes: > > I think it is better to avoid such a problem and fix system so > > server would never send a message in the encoding, different from > > client one. > > Don't hold your breath waiting for that to happen. > > Quite aside from the question of whether we want to trust GUC settings > from the startup packet before we've authenticated the user, there's a What's wrong with trusting this particular setting? I cannot think of any meaningful exploit. Really, there are lot of http servers out there, which typically do accept connections from anywhere (which is seldom case for postgresql servers) and trust Accept-Charset and Accept-Language client header without any authentication. There can be attacks that exploits errors in the message parsing, but startup message is parsed anyway before authentication. > small problem that the server *can't* translate *any* encoding until > it's successfully connected to a database and is able to read the > pg_conversion catalog. > > We might be able to institute some rule like "examine the startup > packet and see if it specifies a client_encoding different from what > we inherited from the postmaster. If not, continue with current > behavior (send messages localized per postmaster's settings). If so, > fall back to English messages/C locale until startup is complete." > This would preserve current functionality in cases where it actually, > er, functions, while giving something at least passable in the cases > that are broken today. I think that we can do a bit more than this. We use GNU gettext library to provide message translation. These library are able to perform limited set of encoding conversion itself. So, we can have two-stage fallback here: 1. If encoding is different, but compatible with language, inherited from postmaster, ask gettext via bind_textdomain_encoding function to provide messages in this encoding. 2. If it is not possible, fall back to English messages, which are compatible with any of supported encoding. The same goes for session which doesn't specify encoding at all (i.e. CancelMessage). > > regards, tom lane > > -- Victor Wagner <vitus@wagner.pp.ru>