Обсуждение: main log encoding problem

Поиск
Список
Период
Сортировка

main log encoding problem

От
yi huang
Дата:
I'm using postgresql 9.1.3 from debian squeeze-backports with zh_CN.UTF-8 locale, i find my main log (which is "/var/log/postgresql/postgresql-9.1-main.log") contains "???" which indicate some sort of charset encoding problem.
But error messages related to pgsql is fine, only other system messages have this problem, for example:

2012-05-19 16:06:12 CST ??:  ?????????? 2012-05-19 16:06:10 CST
2012-05-19 16:06:12 CST ??:  ???????????
2012-05-19 16:06:12 CST ??:  ???autovacuum
2012-05-19 16:06:12 CST ??:  ???????
2012-05-19 16:07:16 CST 错误:  角色"postgres" 已经存在    (in english: Error: role "postgres" already exists)
2012-05-19 16:07:16 CST 语句:  CREATE ROLE postgres;
2012-05-19 16:07:16 CST 错误:  语言 "plpgsql" 已经存在     (in english: Error: language "plpgsql" already exists)
2012-05-19 16:07:16 CST 语句:  CREATE PROCEDURAL LANGUAGE plpgsql;
2012-05-19 16:08:23 CST ????:  ?? "huangyi" ???
2012-05-19 16:08:52 CST ????:  ?? "huangyi" ???
2012-05-19 16:09:01 CST ??:  ???????(zlfund)????????(huangyi) ???
2012-05-19 16:09:01 CST ????:  Peer authentication failed for user "zlfund"
2012-05-19 16:09:34 CST ??:  ???????(zlfund)????????(huangyi) ???
2012-05-19 16:09:34 CST ????:  Peer authentication failed for user "zlfund"

I guess it has something to do with packaging problem rather than postgresql itself, but it would be great if you can give me some clue where the problem might be.

My best regards.
Yi Huang.

Re: main log encoding problem

От
Craig Ringer
Дата:
On 05/23/2012 09:15 AM, yi huang wrote:
> I'm using postgresql 9.1.3 from debian squeeze-backports with
> zh_CN.UTF-8 locale, i find my main log (which is
> "/var/log/postgresql/postgresql-9.1-main.log") contains "???" which
> indicate some sort of charset encoding problem.

It's a known issue, I'm afraid. The PostgreSQL postmaster logs in the
system locale, and the PostgreSQL backends log in whatever encoding
their database is in. They all write to the same log file, producing a
log file full of mixed encoding data that'll choke many text editors.

If you force your editor to re-interpret the file according to the
encoding your database(s) are in, this may help.

In the future it's possible that this may be fixed by logging output to
different files on a per-database basis or by converting the text
encoding of log messages, but no agreement has been reached on the
correct approach and nobody has stepped up to implement it.

--
Craig Ringer

Re: main log encoding problem

От
Alexander Law
Дата:
Hello!

May I to propose a solution and to step up?

I've read a discussion of the bug #5800 and here is my 2 cents.
To make things clear let me give an example.
I am a PostgreSQL hosting provider and I let my customers to create any
databases they wish.
I have clients all over the world (so they can create databases with
different encoding).

The question is - what I (as admin) want to see in my postgresql log,
containing errors from all the databases?
IMHO we should consider two requirements for the log.
First, The file should be readable with a generic text viewer. Second,
It should be useful and complete as possible.

Now I see following solutions.
A. We have different logfiles for each database with different encodings.
Then all our logs will be readable, but we have to look at them one by
onе and it's inconvenient at least.
Moreover, our log reader should understand what encoding to use for each
file.

B. We have one logfile with the operating system encoding.
First downside is that the logs can be different for different OSes.
The second is that Windows has non-Unicode system encoding.
And such an encoding can't represent all the national characters. So at
best I will get ??? in the log.

C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any
generic editor/viewer to open it.
Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some
conversation function.

I think that the last solution is the solution. What is your opinion?

In fact the problem exists even with a simple installation on Windows
when you use non-English locale.
So the solution would be useful for many of us.

Best regards,
Alexander

P.S. sorry for the wrong subject in my previous message sent to
pgsql-general


On 05/23/2012 09:15 AM, yi huang wrote:
> I'm using postgresql 9.1.3 from debian squeeze-backports with
> zh_CN.UTF-8 locale, i find my main log (which is
> "/var/log/postgresql/postgresql-9.1-main.log") contains "???" which
> indicate some sort of charset encoding problem.

It's a known issue, I'm afraid. The PostgreSQL postmaster logs in the
system locale, and the PostgreSQL backends log in whatever encoding
their database is in. They all write to the same log file, producing a
log file full of mixed encoding data that'll choke many text editors.

If you force your editor to re-interpret the file according to the
encoding your database(s) are in, this may help.

In the future it's possible that this may be fixed by logging output to
different files on a per-database basis or by converting the text
encoding of log messages, but no agreement has been reached on the
correct approach and nobody has stepped up to implement it.

--
Craig Ringer


Re: [BUGS] main log encoding problem

От
Tatsuo Ishii
Дата:
> C. We have one logfile with UTF-8.
> Pros: Log messages of all our clients can fit in it. We can use any
> generic editor/viewer to open it.
> Nothing changes for Linux (and other OSes with UTF-8 encoding).
> Cons: All the strings written to log file should go through some
> conversation function.
>
> I think that the last solution is the solution. What is your opinion?

I am thinking about variant of C.

Problem with C is, converting from other encoding to UTF-8 is not
cheap because it requires huge conversion tables. This may be a
serious problem with busy server. Also it is possible some information
is lossed while in this conversion. This is because there's no
gualntee that there is one-to-one-mapping between UTF-8 and other
encodings. Other problem with UTF-8 is, you have to choose *one*
locale when using your editor. This may or may not affect handling of
string in your editor.

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion
   table is required. Also no information loss happens in this
   conversion.

2) Mule-internal encoding can be handled by emacs, one of the most
   popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough
   information about language.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: [BUGS] main log encoding problem

От
Tom Lane
Дата:
Tatsuo Ishii <ishii@postgresql.org> writes:
> My idea is using mule-internal encoding for the log file instead of
> UTF-8. There are several advantages:

> 1) Converion to mule-internal encoding is cheap because no conversion
>    table is required. Also no information loss happens in this
>    conversion.

> 2) Mule-internal encoding can be handled by emacs, one of the most
>    popular editors in the world.

> 3) No need to worry about locale. Mule-internal encoding has enough
>    information about language.

Um ... but ...

(1) nothing whatsoever can read MULE, except emacs and xemacs.

(2) there is more than one version of MULE (emacs versus xemacs,
not to mention any possible cross-version discrepancies).

(3) from a log volume standpoint, this could be pretty disastrous.

I'm not for a write-only solution, which is pretty much what this
would be.

            regards, tom lane

Re: [BUGS] main log encoding problem

От
Tatsuo Ishii
Дата:
> Tatsuo Ishii <ishii@postgresql.org> writes:
>> My idea is using mule-internal encoding for the log file instead of
>> UTF-8. There are several advantages:
>
>> 1) Converion to mule-internal encoding is cheap because no conversion
>>    table is required. Also no information loss happens in this
>>    conversion.
>
>> 2) Mule-internal encoding can be handled by emacs, one of the most
>>    popular editors in the world.
>
>> 3) No need to worry about locale. Mule-internal encoding has enough
>>    information about language.
>
> Um ... but ...
>
> (1) nothing whatsoever can read MULE, except emacs and xemacs.
>
> (2) there is more than one version of MULE (emacs versus xemacs,
> not to mention any possible cross-version discrepancies).
>
> (3) from a log volume standpoint, this could be pretty disastrous.
>
> I'm not for a write-only solution, which is pretty much what this
> would be.

I'm not sure how long xemacs will survive (the last stable release of
xemacs was released in 2009). Anyway, I'm not too worried about your
points, since it's easy to convert back from mule-internal code
encoded log files to original encoding mixed log file. No information
will be lost. Even converting to UTF-8 should be possible. My point
is, once the log file is converted to UTF-8, there's no way to convert
back to original encoding log file.

Probably we treat mule-internal encoded log files as an internal
format, and have a utility which does conversion from mule-internal to
UTF-8.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: main log encoding problem

От
Craig Ringer
Дата:
On 07/18/2012 11:16 PM, Alexander Law wrote:
> Hello!
>
> May I to propose a solution and to step up?
>
> I've read a discussion of the bug #5800 and here is my 2 cents.
> To make things clear let me give an example.
> I am a PostgreSQL hosting provider and I let my customers to create
> any databases they wish.
> I have clients all over the world (so they can create databases with
> different encoding).
>
> The question is - what I (as admin) want to see in my postgresql log,
> containing errors from all the databases?
> IMHO we should consider two requirements for the log.
> First, The file should be readable with a generic text viewer. Second,
> It should be useful and complete as possible.
>
> Now I see following solutions.
> A. We have different logfiles for each database with different encodings.
> Then all our logs will be readable, but we have to look at them one by
> onе and it's inconvenient at least.
> Moreover, our log reader should understand what encoding to use for
> each file.
>
> B. We have one logfile with the operating system encoding.
> First downside is that the logs can be different for different OSes.
> The second is that Windows has non-Unicode system encoding.
> And such an encoding can't represent all the national characters. So
> at best I will get ??? in the log.
>
> C. We have one logfile with UTF-8.
> Pros: Log messages of all our clients can fit in it. We can use any
> generic editor/viewer to open it.
> Nothing changes for Linux (and other OSes with UTF-8 encoding).
> Cons: All the strings written to log file should go through some
> conversation function.
>
> I think that the last solution is the solution. What is your opinion?

Implementing any of these isn't trivial - especially making sure
messages emitted to stderr from things like segfaults and dynamic linker
messages are always correct. Ensuring that the logging collector knows
when setlocale() has been called to change the encoding and translation
of system messages, handling the different logging output methods, etc -
it's going to be fiddly.

I have some performance concerns about the transcoding required for (b)
or (c), but realistically it's already the norm to convert all the data
sent to and from clients. Conversion for logging should not be a
significant additional burden. Conversion can be short-circuited out
when source and destination encodings are the same for the common case
of logging in utf-8 or to a dedicated file.

I suspect the eventual choice will be "all of the above":

- Default to (b) or (c), both have pros and cons. I favour (c) with a
UTF-8 BOM to warn editors, but (b) is nice for people whose DBs are all
in the system locale.

- Allow (a) for people who have many different DBs in many different
encodings, do high volume logging, and want to avoid conversion
overhead. Let them deal with the mess, just provide an additional % code
for the encoding so they can name their per-DB log files to indicate the
encoding.

The main issue is just that code needs to be prototyped, cleaned up, and
submitted. So far nobody's cared enough to design it, build it, and get
it through patch review. I've just foolishly volunteered myself to work
on an automated crash-test system for virtual plug-pull testing, so I'm
not stepping up.

--
Craig Ringer



Re: [BUGS] main log encoding problem

От
Alexander Law
Дата:
Hello,

C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any
generic editor/viewer to open it.
Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some
conversation function.

I think that the last solution is the solution. What is your opinion?
I am thinking about variant of C.

Problem with C is, converting from other encoding to UTF-8 is not
cheap because it requires huge conversion tables. This may be a
serious problem with busy server. Also it is possible some information
is lossed while in this conversion. This is because there's no
gualntee that there is one-to-one-mapping between UTF-8 and other
encodings. Other problem with UTF-8 is, you have to choose *one*
locale when using your editor. This may or may not affect handling of
string in your editor.

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion  table is required. Also no information loss happens in this  conversion.

2) Mule-internal encoding can be handled by emacs, one of the most  popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough  information about language.
--

I believe that postgres has such conversion functions anyway. And they used for data conversion when we have clients (and databases) with different encodings. So if they can be used for data, why not to use them for relatively little amount of log messages?
And regarding mule internal encoding - reading about Mule http://www.emacswiki.org/emacs/UnicodeEncoding I found:
In future (probably Emacs 22), Mule will use an internal encoding which is a UTF-8 encoding of a superset of Unicode.
So I still see UTF-8 as a common denominator for all the encodings.
I am not aware of any characters absent in Unicode. Can you please provide some examples of these that can results in lossy conversion?
Сhoosing UTF-8 in a viewer/editor is no big deal too. Most of them detect UTF-8 automagically, and for the others BOM can be added.

Best regards,
Aexander

Re: main log encoding problem

От
Alexander Law
Дата:
Hello,
>
> Implementing any of these isn't trivial - especially making sure
> messages emitted to stderr from things like segfaults and dynamic
> linker messages are always correct. Ensuring that the logging
> collector knows when setlocale() has been called to change the
> encoding and translation of system messages, handling the different
> logging output methods, etc - it's going to be fiddly.
>
> I have some performance concerns about the transcoding required for
> (b) or (c), but realistically it's already the norm to convert all the
> data sent to and from clients. Conversion for logging should not be a
> significant additional burden. Conversion can be short-circuited out
> when source and destination encodings are the same for the common case
> of logging in utf-8 or to a dedicated file.
>
The initial issue was that log file contains messages in different
encodings. So transcoding is performed already, but it's not consistent
and in my opinion this is the main problem.

> I suspect the eventual choice will be "all of the above":
>
> - Default to (b) or (c), both have pros and cons. I favour (c) with a
> UTF-8 BOM to warn editors, but (b) is nice for people whose DBs are
> all in the system locale.
As I understand UTF-8 is the default encoding for databases. And even
when a database is in  the system encoding, translated postgres messages
still come in UTF-8 and will go through UTF-8 -> System locale
conversion within gettext.
>
> - Allow (a) for people who have many different DBs in many different
> encodings, do high volume logging, and want to avoid conversion
> overhead. Let them deal with the mess, just provide an additional %
> code for the encoding so they can name their per-DB log files to
> indicate the encoding.
>
I think that (a) solution can be an evolvement of the logging mechanism
if there will be a need for it.
> The main issue is just that code needs to be prototyped, cleaned up,
> and submitted. So far nobody's cared enough to design it, build it,
> and get it through patch review. I've just foolishly volunteered
> myself to work on an automated crash-test system for virtual plug-pull
> testing, so I'm not stepping up.
>
I see you point and I can prepare a prototype if the proposed (c)
solution seems reasonable enough and can be accepted.

Best regards,
Alexander


Re: [BUGS] main log encoding problem

От
Tatsuo Ishii
Дата:
>> I am thinking about variant of C.
>>
>> Problem with C is, converting from other encoding to UTF-8 is not
>> cheap because it requires huge conversion tables. This may be a
>> serious problem with busy server. Also it is possible some information
>> is lossed while in this conversion. This is because there's no
>> gualntee that there is one-to-one-mapping between UTF-8 and other
>> encodings. Other problem with UTF-8 is, you have to choose *one*
>> locale when using your editor. This may or may not affect handling of
>> string in your editor.
>>
>> My idea is using mule-internal encoding for the log file instead of
>> UTF-8. There are several advantages:
>>
>> 1) Converion to mule-internal encoding is cheap because no conversion
>>     table is required. Also no information loss happens in this
>>     conversion.
>>
>> 2) Mule-internal encoding can be handled by emacs, one of the most
>>     popular editors in the world.
>>
>> 3) No need to worry about locale. Mule-internal encoding has enough
>>     information about language.
>> --
>>
> I believe that postgres has such conversion functions anyway. And they
> used for data conversion when we have clients (and databases) with
> different encodings. So if they can be used for data, why not to use
> them for relatively little amount of log messages?

Frontend/Backend encoding conversion only happens when they are
different. While conversion for logs *always* happens. A busy database
could produce tons of logs (i is not unusual that log all SQLs for
auditing purpose).

> And regarding mule internal encoding - reading about Mule
> http://www.emacswiki.org/emacs/UnicodeEncoding I found:
> /In future (probably Emacs 22), Mule will use an internal encoding
> which is a UTF-8 encoding of a superset of Unicode. /
> So I still see UTF-8 as a common denominator for all the encodings.
> I am not aware of any characters absent in Unicode. Can you please
> provide some examples of these that can results in lossy conversion?

You can google by "encoding "EUC_JP" has no equivalent in "UTF8"" or
some such to find such an example. In this case PostgreSQL just throw
an error. For frontend/backend encoding conversion this is fine. But
what should we do for logs? Apparently we cannot throw an error here.

"Unification" is another problem. Some kanji characters of CJK are
"unified" in Unicode. The idea of unification is, if kanji A in China,
B in Japan, C in Korea looks "similar" unify ABC to D. This is a great
space saving:-) The price of this is inablity of
round-trip-conversion. You can convert A, B or C to D, but you cannot
convert D to A/B/C.

BTW, I'm not stick with mule-internal encoding. What we need here is a
"super" encoding which could include any existing encodings without
information loss. For this purpose, I think we can even invent a new
encoding(maybe something like very first prposal of ISO/IEC
10646?). However, using UTF-8 for this purpose seems to be just a
disaster to me.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: [BUGS] main log encoding problem

От
Tatsuo Ishii
Дата:
> Hello,
>>
>> Implementing any of these isn't trivial - especially making sure
>> messages emitted to stderr from things like segfaults and dynamic
>> linker messages are always correct. Ensuring that the logging
>> collector knows when setlocale() has been called to change the
>> encoding and translation of system messages, handling the different
>> logging output methods, etc - it's going to be fiddly.
>>
>> I have some performance concerns about the transcoding required for
>> (b) or (c), but realistically it's already the norm to convert all the
>> data sent to and from clients. Conversion for logging should not be a
>> significant additional burden. Conversion can be short-circuited out
>> when source and destination encodings are the same for the common case
>> of logging in utf-8 or to a dedicated file.
>>
> The initial issue was that log file contains messages in different
> encodings. So transcoding is performed already, but it's not

This is not true. Transcoding happens only when PostgreSQL is built
with --enable-nls option (default is no nls).

> consistent and in my opinion this is the main problem.
>
>> I suspect the eventual choice will be "all of the above":
>>
>> - Default to (b) or (c), both have pros and cons. I favour (c) with a
>> - UTF-8 BOM to warn editors, but (b) is nice for people whose DBs are
>> - all in the system locale.
> As I understand UTF-8 is the default encoding for databases. And even
> when a database is in the system encoding, translated postgres
> messages still come in UTF-8 and will go through UTF-8 -> System
> locale conversion within gettext.

Again, this is not always true.

>> - Allow (a) for people who have many different DBs in many different
>> - encodings, do high volume logging, and want to avoid conversion
>> - overhead. Let them deal with the mess, just provide an additional %
>> - code for the encoding so they can name their per-DB log files to
>> - indicate the encoding.
>>
> I think that (a) solution can be an evolvement of the logging
> mechanism if there will be a need for it.
>> The main issue is just that code needs to be prototyped, cleaned up,
>> and submitted. So far nobody's cared enough to design it, build it,
>> and get it through patch review. I've just foolishly volunteered
>> myself to work on an automated crash-test system for virtual plug-pull
>> testing, so I'm not stepping up.
>>
> I see you point and I can prepare a prototype if the proposed (c)
> solution seems reasonable enough and can be accepted.
>
> Best regards,
> Alexander
>
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] main log encoding problem

От
Alexander Law
Дата:
>> The initial issue was that log file contains messages in different
>> encodings. So transcoding is performed already, but it's not
> This is not true. Transcoding happens only when PostgreSQL is built
> with --enable-nls option (default is no nls).
I'll restate the initial issue as I see it.
I have Windows and I'm installing PostgreSQL for Windows (latest
version, downloaded from enterprise.db). Then I create a database with
default settings (with UTF-8 encoding), do something wrong in my DB and
get such a log file with the two different encodings (UTF-8 and
Windows-1251 (ANSI)) and with localized postgres messages.


Re: [BUGS] main log encoding problem

От
Alexander Law
Дата:
>> And regarding mule internal encoding - reading about Mule
>> http://www.emacswiki.org/emacs/UnicodeEncoding I found:
>> /In future (probably Emacs 22), Mule will use an internal encoding
>> which is a UTF-8 encoding of a superset of Unicode. /
>> So I still see UTF-8 as a common denominator for all the encodings.
>> I am not aware of any characters absent in Unicode. Can you please
>> provide some examples of these that can results in lossy conversion?
> You can google by "encoding "EUC_JP" has no equivalent in "UTF8"" or
> some such to find such an example. In this case PostgreSQL just throw
> an error. For frontend/backend encoding conversion this is fine. But
> what should we do for logs? Apparently we cannot throw an error here.
>
> "Unification" is another problem. Some kanji characters of CJK are
> "unified" in Unicode. The idea of unification is, if kanji A in China,
> B in Japan, C in Korea looks "similar" unify ABC to D. This is a great
> space saving:-) The price of this is inablity of
> round-trip-conversion. You can convert A, B or C to D, but you cannot
> convert D to A/B/C.
>
> BTW, I'm not stick with mule-internal encoding. What we need here is a
> "super" encoding which could include any existing encodings without
> information loss. For this purpose, I think we can even invent a new
> encoding(maybe something like very first prposal of ISO/IEC
> 10646?). However, using UTF-8 for this purpose seems to be just a
> disaster to me.
>
Ok, maybe the time of real universal encoding has not yet come. Then we
maybe just should add a new parameter "log_encoding" (UTF-8 by default)
to postgresql.conf. And to use this encoding consistently within
logging_collector.
If this encoding is not available then fall back to 7-bit ASCII.


Re: [BUGS] main log encoding problem

От
Tatsuo Ishii
Дата:
>> You can google by "encoding "EUC_JP" has no equivalent in "UTF8"" or
>> some such to find such an example. In this case PostgreSQL just throw
>> an error. For frontend/backend encoding conversion this is fine. But
>> what should we do for logs? Apparently we cannot throw an error here.
>>
>> "Unification" is another problem. Some kanji characters of CJK are
>> "unified" in Unicode. The idea of unification is, if kanji A in China,
>> B in Japan, C in Korea looks "similar" unify ABC to D. This is a great
>> space saving:-) The price of this is inablity of
>> round-trip-conversion. You can convert A, B or C to D, but you cannot
>> convert D to A/B/C.
>>
>> BTW, I'm not stick with mule-internal encoding. What we need here is a
>> "super" encoding which could include any existing encodings without
>> information loss. For this purpose, I think we can even invent a new
>> encoding(maybe something like very first prposal of ISO/IEC
>> 10646?). However, using UTF-8 for this purpose seems to be just a
>> disaster to me.
>>
> Ok, maybe the time of real universal encoding has not yet come. Then
> we maybe just should add a new parameter "log_encoding" (UTF-8 by
> default) to postgresql.conf. And to use this encoding consistently
> within logging_collector.
> If this encoding is not available then fall back to 7-bit ASCII.

What do you mean by "not available"?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: [BUGS] main log encoding problem

От
Alexander Law
Дата:
>> Ok, maybe the time of real universal encoding has not yet come. Then
>> we maybe just should add a new parameter "log_encoding" (UTF-8 by
>> default) to postgresql.conf. And to use this encoding consistently
>> within logging_collector.
>> If this encoding is not available then fall back to 7-bit ASCII.
> What do you mean by "not available"?
Sorry, it was inaccurate phrase. I mean "if the conversion to this
encoding is not avaliable". For example, when we have database in EUC_JP
and log_encoding set to Latin1. I think that we can even fall back to
UTF-8 as we can convert all encodings to it (with some exceptions that
you noticed).


Re: [BUGS] main log encoding problem

От
Tatsuo Ishii
Дата:
> Sorry, it was inaccurate phrase. I mean "if the conversion to this
> encoding is not avaliable". For example, when we have database in
> EUC_JP and log_encoding set to Latin1. I think that we can even fall
> back to UTF-8 as we can convert all encodings to it (with some
> exceptions that you noticed).

So, what you wanted to say here is:

"If the conversion to this encoding is not avaliable then fall back to
UTF-8"

Am I correct?

Also is it possible to completely disable the feature?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: [BUGS] main log encoding problem

От
Alban Hertroys
Дата:
On 19 July 2012 10:40, Alexander Law <exclusion@gmail.com> wrote:
>>> Ok, maybe the time of real universal encoding has not yet come. Then
>>> we maybe just should add a new parameter "log_encoding" (UTF-8 by
>>> default) to postgresql.conf. And to use this encoding consistently
>>> within logging_collector.
>>> If this encoding is not available then fall back to 7-bit ASCII.
>>
>> What do you mean by "not available"?
>
> Sorry, it was inaccurate phrase. I mean "if the conversion to this encoding
> is not avaliable". For example, when we have database in EUC_JP and
> log_encoding set to Latin1. I think that we can even fall back to UTF-8 as
> we can convert all encodings to it (with some exceptions that you noticed).

I like Craig's idea of adding the client encoding to the log lines. A
possible problem with that (I'm not an encoding expert) is that a log
line like that will contain data about the database server meta-data
(log time, client encoding, etc) in the database default encoding and
database data (the logged query and user-supplied values) in the
client encoding. One option would be to use the client encoding for
the entire log line, but would that result in legible meta-data in
every encoding?

It appears that the primarly here is that SQL statements and
user-supplied data are being logged, while the log-file is a text file
in a fixed encoding.
Perhaps another solution would be to add the ability to log certain
types of information (not the core database server log info, of
course!) to a database/table so that each record can be stored in its
own encoding?
That way the transcoding doesn't have to take place until someone is
reading the log, you'd know what to transcode the data to (namely the
client_encoding of the reading session) and there isn't any issue of
transcoding errors while logging statements.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

Re: [BUGS] main log encoding problem

От
Alban Hertroys
Дата:
Yikes, messed up my grammar a bit I see!

On 19 July 2012 10:58, Alban Hertroys <haramrae@gmail.com> wrote:

> I like Craig's idea of adding the client encoding to the log lines. A
> possible problem with that (I'm not an encoding expert) is that a log
> line like that will contain data about the database server meta-data
> (log time, client encoding, etc) in the database default encoding and

...will contain meta-data about the database server (log time...

> It appears that the primarly here is that SQL statements and

It appears the primary issue here...

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

Re: [BUGS] main log encoding problem

От
Alexander Law
Дата:
>> Sorry, it was inaccurate phrase. I mean "if the conversion to this
>> encoding is not avaliable". For example, when we have database in
>> EUC_JP and log_encoding set to Latin1. I think that we can even fall
>> back to UTF-8 as we can convert all encodings to it (with some
>> exceptions that you noticed).
> So, what you wanted to say here is:
>
> "If the conversion to this encoding is not avaliable then fall back to
> UTF-8"
>
> Am I correct?
>
> Also is it possible to completely disable the feature?
>
Yes, you're.  I think it could be disabled by setting log_encoding='',
but if the parameter is missing then the feature should be enabled (with
UTF-8).



Re: [BUGS] main log encoding problem

От
Alexander Law
Дата:
> I like Craig's idea of adding the client encoding to the log lines. A
> possible problem with that (I'm not an encoding expert) is that a log
> line like that will contain data about the database server meta-data
> (log time, client encoding, etc) in the database default encoding and
> database data (the logged query and user-supplied values) in the
> client encoding. One option would be to use the client encoding for
> the entire log line, but would that result in legible meta-data in
> every encoding?
I think then we get non-human readable logs. We will need one more tool
to open and convert the log (and omit excessive encoding specification
in each line).
> It appears that the primarly here is that SQL statements and
> user-supplied data are being logged, while the log-file is a text file
> in a fixed encoding.
Yes, and in in my opinion there is nothing unusual about it. XML/HTML
are examples of a text files with fixed encoding that can contain
multi-language strings. UTF-8 is the default encoding for XML. And when
it's not good enough (as Tatsou noticed), you still can switch to another.
> Perhaps another solution would be to add the ability to log certain
> types of information (not the core database server log info, of
> course!) to a database/table so that each record can be stored in its
> own encoding?
> That way the transcoding doesn't have to take place until someone is
> reading the log, you'd know what to transcode the data to (namely the
> client_encoding of the reading session) and there isn't any issue of
> transcoding errors while logging statements.
I don't think it would be the simplest solution of the existing problem.
It can be another branch of evolution, but it doesn't answer the
question - what encoding to use for the core database server log?


Re: [BUGS] main log encoding problem

От
Alban Hertroys
Дата:
On 19 July 2012 13:50, Alexander Law <exclusion@gmail.com> wrote:
>> I like Craig's idea of adding the client encoding to the log lines. A
>> possible problem with that (I'm not an encoding expert) is that a log
>> line like that will contain data about the database server meta-data
>> (log time, client encoding, etc) in the database default encoding and
>> database data (the logged query and user-supplied values) in the
>> client encoding. One option would be to use the client encoding for
>> the entire log line, but would that result in legible meta-data in
>> every encoding?
>
> I think then we get non-human readable logs. We will need one more tool to
> open and convert the log (and omit excessive encoding specification in each
> line).

Only the parts that contain user-supplied data in very different
encodings would not be "human readable", similar to what we already
have.

>> It appears that the primarly here is that SQL statements and
>> user-supplied data are being logged, while the log-file is a text file
>> in a fixed encoding.
>
> Yes, and in in my opinion there is nothing unusual about it. XML/HTML are
> examples of a text files with fixed encoding that can contain multi-language
> strings. UTF-8 is the default encoding for XML. And when it's not good
> enough (as Tatsou noticed), you still can switch to another.

Yes, but in those examples it is acceptable that the application fails
to write the output. That, and the output needs to be converted to
various different client encodings (namely that of the visitor's
browser) anyway, so it does not really add any additional overhead.

This doesn't hold true for database server log files. Ideally, writing
those has to be reliable (how are you going to catch errors
otherwise?) and should not impact the performance of the database
server in a significant way (the less the better). The end result will
probably be somewhere in the middle.

>> Perhaps another solution would be to add the ability to log certain
>> types of information (not the core database server log info, of
>> course!) to a database/table so that each record can be stored in its
>> own encoding?
>> That way the transcoding doesn't have to take place until someone is
>> reading the log, you'd know what to transcode the data to (namely the
>> client_encoding of the reading session) and there isn't any issue of
>> transcoding errors while logging statements.
>
> I don't think it would be the simplest solution of the existing problem. It
> can be another branch of evolution, but it doesn't answer the question -
> what encoding to use for the core database server log?

It makes that problem much easier. If you need the "human-readable"
logs, you can write those to a different log (namely one in the
database). The result is that the server can use pretty much any
encoding (or a mix of multiple!) to write its log files.

You'll need a query to read the human-readable logs of course, but
since they're in the database, all the tools you need are already
available to you.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

Re: [BUGS] main log encoding problem

От
Craig Ringer
Дата:
On 07/19/2012 03:24 PM, Tatsuo Ishii wrote:
> BTW, I'm not stick with mule-internal encoding. What we need here is a
> "super" encoding which could include any existing encodings without
> information loss. For this purpose, I think we can even invent a new
> encoding(maybe something like very first prposal of ISO/IEC
> 10646?). However, using UTF-8 for this purpose seems to be just a
> disaster to me.

Good point re unified chars. That was always a bad idea, and that's just
one of the issues it causes.

I think these difficult encodings are where logging to dedicated file
per-database is useful.

I'm not convinced that a weird and uncommon encoding is the answer. I
guess as an alternative for people for whom it's useful if it's low cost
in terms of complexity/maintenance/etc...

--
Craig Ringer

Re: [BUGS] main log encoding problem

От
Craig Ringer
Дата:
On 07/19/2012 04:58 PM, Alban Hertroys wrote:
> On 19 July 2012 10:40, Alexander Law <exclusion@gmail.com> wrote:
>>>> Ok, maybe the time of real universal encoding has not yet come. Then
>>>> we maybe just should add a new parameter "log_encoding" (UTF-8 by
>>>> default) to postgresql.conf. And to use this encoding consistently
>>>> within logging_collector.
>>>> If this encoding is not available then fall back to 7-bit ASCII.
>>> What do you mean by "not available"?
>> Sorry, it was inaccurate phrase. I mean "if the conversion to this encoding
>> is not avaliable". For example, when we have database in EUC_JP and
>> log_encoding set to Latin1. I think that we can even fall back to UTF-8 as
>> we can convert all encodings to it (with some exceptions that you noticed).
> I like Craig's idea of adding the client encoding to the log lines.
Nonono! Log *file* *names* when one-file-per-database is in use.

Encoding as a log line prefix is a terrible idea for all sorts of reasons.
--
Craig Ringer