Re: BUG #5661: The character encoding in logfile is confusing.

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: BUG #5661: The character encoding in logfile is confusing.
Дата
Msg-id 4C9D6D6F.4050806@postnewspapers.com.au
обсуждение исходный текст
Ответ на Re: BUG #5661: The character encoding in logfile is confusing.  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 09/22/2010 09:55 PM, Tom Lane wrote:
> Peter Eisentraut<peter_e@gmx.net>  writes:
>> On ons, 2010-09-22 at 19:25 +0800, Craig Ringer wrote:
>>> I still wonder if, rather than making this configurable, the right
>>> choice is to force logging to UTF-8 (with BOM) across the board,
>
>> I don't think this would make things better or easier.  At some point
>> you're going to have to insert a recode call, and it doesn't matter much
>> whether the destination argument is a constant or a variable.
>
> It'd avoid the problem of having possibly-unconvertable messages ...
> at the cost of pissing off users who have a uniform server encoding
> selection already and don't see why they should be forced to deal with
> UTF8 in the log.
>
> It's pretty much just one step from here to deciding that the server
> should work exclusively in UTF8 and never mind all those other legacy
> encodings.  We've resisted that attitude for quite some years now,
> and are probably not really ready to adopt it for the log either.

Fair enough. The current approach is broken, though. Mis-encoded 
messages the user can't read are little more good to them than messages 
that're never logged.

I see four options here (two of which are practical IMO):

(1) Log in UTF-8, convert everything to UTF-8. Better for admin tools & 
apps, sucks for OS utilities/grep/etc on non-utf-8 locales. Preserves 
all messages no matter what the database and system encodings are.

(2) Log in default encoding for locale, convert all messages to that 
encoding. Where characters cannot be represented in the target encoding 
replace them with a placeholder (? or something). Better - but far from 
good - for OS utilities/grep/etc, sucks for admin tools and apps. 
Doesn't preserve all messages properly if user has databases in 
encodings other than the system encoding.

(3) Have a log for the postmaster in the default locale for the system. 
Have a log file for each database that's in the encoding for that 
database. IMO this is the worst of both worlds, but it does preserve 
original encodings without transcoding or forcing a particular encoding 
and does preserve messages. Horribly complicated for admin tools, 
inconsistent and horrid for grep etc.

(4) Keep things much as they are, but log an encoding identifier prefix 
for each line. Lets GUI/admin tools post-process the logs into something 
sane, permits automated log processing because line encodings are known. 
Sucks for shell tools, which can't tell which lines are which; we'd need 
to provide a "pggrep" and "pgless" for reliable log search! Preserves 
all messages, but not in a reliably searchable manner.

(0) Change nothing. Log all messages in the original encoding they were 
generated in. Perform no conversion. Logs contain mixed encodings. 
Horrible for admin/gui tools (broken text). Horrible for shell 
utilities/OS tools (can't trust grep results etc). Automatic log 
processing impossible as the encoding for each line isn't known and 
can't be reliably discovered.


As far as I'm concerned, (3) is out. It's horrible. I don't think the 
status quo (0) is OK either, it's producing broken log files. (4) is 
pretty awful too, but it's the smallest change that kind-of fixes the 
issue to the point where it's at least possible for PgAdmin etc to 
convert the logs into a consistent encoding.

IMO it's down to (1) and (2). There's no clear consensus between those 
two, so I'd be inclined to offer the admin the choice between them as a 
config option, depending on the trade-off they prefer to make.

For sensible systems in a utf-8 locale (1) and (2) are equivalent, and 
(2) is fine for systems where the database encoding is always the same 
as the system encoding. It's only for systems with a non-utf-8 locale 
that use databases in encodings other than the system locale's encoding 
that problems arise. In this case they're going to get suboptimal 
results one way or the other, it's just a matter of letting them pick how.

Thoughts?

I should ask on the various language-specific mailing lists and see what 
people there have to say about it. Maybe it doesn't affect people enough 
in practice for them to care.

--
Craig Ringer


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: What happened to the is_ family of functions proposal?
Следующее
От: Darren Duncan
Дата:
Сообщение: Re: What happened to the is_ family of functions proposal?