Re: Unicode escapes with any backend encoding
От | Chapman Flack |
---|---|
Тема | Re: Unicode escapes with any backend encoding |
Дата | |
Msg-id | ef2648e8-66dc-c95c-c5ad-72ff05191c2c@anastigmatix.net обсуждение исходный текст |
Ответ на | Re: Unicode escapes with any backend encoding (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
On 1/14/20 4:25 PM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack <chap@anastigmatix.net> wrote: >>> On 1/14/20 10:10 AM, Tom Lane wrote: >>>> to me that this error is just useless pedantry. As long as the DB >>>> encoding can represent the desired character, it should be transparent >>>> to users. > >>> That's my position too. > >> and mine. > > I'm confused --- yesterday you seemed to be against this idea. > Have you changed your mind? > > I'll gladly go change the patch if people are on board with this. Hmm, well, let me clarify for my own part what I think I'm agreeing with ... perhaps it's misaligned with something further upthread. In an ideal world (which may be ideal in more ways than are in scope for the present discussion) I would expect to see these principles: 1. On input, whether a Unicode escape is or isn't allowed should not depend on any encoding settings. It should be lexically allowed always, and if it represents a character that exists in the server encoding, it should mean that character. If it's not representable in the storage format, it should produce an error that says that. 2. If it happens that the character is representable in both the storage encoding and the client encoding, it shouldn't matter whether it arrives literally as an é or as an escape. Either should get stored on disk as the same bytes. 3. On output, as long as the character is representable in the client encoding, there is nothing to worry about. It will be sent as its representation in the client encoding (which may be different bytes than its representation in the server encoding). 4. If a character to be output isn't in the client encoding, it will be datatype-dependent whether there is any way to escape. For example, xml_out could produce ????; forms, and json_out could produce \u???? forms. 5. If the datatype being output has no escaping rules available (as would be the case for an ordinary text column, say), then the unrepresentable character has to be reported in an error. (Encoding conversions often have the option of substituting a replacement character like ? but I don't believe a DBMS has any business making such changes to data, unless by explicit opt-in. If it can't give you the data you wanted, it should say "here's why I can't give you that.") 6. While 'text' in general provides no escaping mechanism, some functions that produce text may still have that option. For example, quote_literal and quote_ident could conceivably produce the U&'...' or U&"..." forms, respectively, if the argument contains characters that won't go in the client encoding. I understand that on the way from 1 to 6 I will have drifted further from what's discussed in this thread; for example, I bet that quote_literal/quote_ident never produce U& forms now, and that no one is proposing to change that, and I'm pretending not to notice the question of how astonishing such behavior could be. (Not to mention, how would they know whether they are returning a value that's destined to go across the client encoding, rather than to be used in a purely server-side expression? Maybe distinct versions of those functions could take an encoding argument, and produce the U& forms when the content won't go in the specified encoding. That would avoid astonishing changes to existing functions.) Regards, -Chap
В списке pgsql-hackers по дате отправления: