Обсуждение: [PATCH] Reject ENCODING option for COPY TO FORMAT JSON

Поиск
Список
Период
Сортировка

[PATCH] Reject ENCODING option for COPY TO FORMAT JSON

От
Ayush Tiwari
Дата:
Hi hackers,

COPY TO FORMAT JSON silently accepts the ENCODING option but doesn't
perform encoding conversion(?)  CopyToJsonOneRow() sends the output of
composite_to_json() via CopySendData() without calling
pg_server_to_any(), unlike the text and CSV paths.

  COPY t TO '/tmp/out.json' WITH (FORMAT json, ENCODING 'LATIN1');

On a UTF-8 server this produces UTF-8 output, not LATIN1.

RFC 8259 says JSON text must be UTF-8, so arguably JSON output
should never be converted.  But even under that interpretation,
silently accepting the option and ignoring it looks wrong, the user
explicitly asked for LATIN1 and got something else.  The same issue
also affects COPY TO STDOUT when client_encoding differs from the
server encoding, since the default file_encoding is the client
encoding and CopyToJsonOneRow never checks need_transcoding.

The attached patch rejects the explicit ENCODING option for JSON
mode, consistent with how DELIMITER, NULL, DEFAULT, and HEADER are
already rejected.  The implicit client_encoding case is a separate
design question (should COPY TO JSON always emit UTF-8 regardless
of client_encoding?) that maybe we should address separately and not as 
part of v19.

Introduced by 7dadd38cda9 (json format for COPY TO). I've attached a patch
for rejecting the ENCODING option. Thoughts?
Вложения

Re: [PATCH] Reject ENCODING option for COPY TO FORMAT JSON

От
Daniel Gustafsson
Дата:
> On 20 Apr 2026, at 08:06, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:

> The attached patch rejects the explicit ENCODING option for JSON
> mode, consistent with how DELIMITER, NULL, DEFAULT, and HEADER are
> already rejected.

Given that we reject other incompatible parameters it makes sense to reject
this one as well, however I think we can expand the comment a little and
explain why.

--
Daniel Gustafsson




Re: [PATCH] Reject ENCODING option for COPY TO FORMAT JSON

От
Ayush Tiwari
Дата:
Hi,

On Mon, 20 Apr 2026 at 13:56, Daniel Gustafsson <daniel@yesql.se> wrote:
> On 20 Apr 2026, at 08:06, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:

> The attached patch rejects the explicit ENCODING option for JSON
> mode, consistent with how DELIMITER, NULL, DEFAULT, and HEADER are
> already rejected.

Given that we reject other incompatible parameters it makes sense to reject
this one as well, however I think we can expand the comment a little and
explain why.


Thanks Daniel. Agreed, v2 attached with an expanded comment
explaining why the option is rejected, I've tried to make it small, because rest
rejected ones did not have relevant comments.

  /*
   * Reject ENCODING for JSON format.  JSON output is produced as
   * a whole by composite_to_json(), so the per-attribute encoding
   * conversion done for text and CSV output is not applied.
   */

Regards,
Ayush

Вложения

Re: [PATCH] Reject ENCODING option for COPY TO FORMAT JSON

От
Tom Lane
Дата:
Ayush Tiwari <ayushtiwari.slg01@gmail.com> writes:
> COPY TO FORMAT JSON silently accepts the ENCODING option but doesn't
> perform encoding conversion(?)  CopyToJsonOneRow() sends the output of
> composite_to_json() via CopySendData() without calling
> pg_server_to_any(), unlike the text and CSV paths.

>   COPY t TO '/tmp/out.json' WITH (FORMAT json, ENCODING 'LATIN1');

> On a UTF-8 server this produces UTF-8 output, not LATIN1.

Seems to me the correct thing here is to make it work like the other
cases, ie perform pg_server_to_any().  I have exactly no sympathy for
the argument about the RFC saying it must be UTF-8, not least because
that's not in fact what is implemented (what if the server encoding
isn't UTF-8?).

Rejecting this option altogether doesn't improve anything, not
functionally, not specs-compliance-wise, nor according to the
principle of least surprise.

> The attached patch rejects the explicit ENCODING option for JSON
> mode, consistent with how DELIMITER, NULL, DEFAULT, and HEADER are
> already rejected.  The implicit client_encoding case is a separate
> design question (should COPY TO JSON always emit UTF-8 regardless
> of client_encoding?) that maybe we should address separately and not as
> part of v19.

No, you don't get to punt this till later.  Once we ship v19 there's
going to be a strong expectation of backwards compatibility.

The idea of sending UTF-8 to a client that's set client_encoding to
something else would be risible, if it weren't a security hazard.

            regards, tom lane



Re: [PATCH] Reject ENCODING option for COPY TO FORMAT JSON

От
Ayush Tiwari
Дата:
Hi,

On Mon, 20 Apr 2026 at 19:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ayush Tiwari <ayushtiwari.slg01@gmail.com> writes:
> COPY TO FORMAT JSON silently accepts the ENCODING option but doesn't
> perform encoding conversion(?)  CopyToJsonOneRow() sends the output of
> composite_to_json() via CopySendData() without calling
> pg_server_to_any(), unlike the text and CSV paths.

>   COPY t TO '/tmp/out.json' WITH (FORMAT json, ENCODING 'LATIN1');

> On a UTF-8 server this produces UTF-8 output, not LATIN1.

Seems to me the correct thing here is to make it work like the other
cases, ie perform pg_server_to_any().  I have exactly no sympathy for
the argument about the RFC saying it must be UTF-8, not least because
that's not in fact what is implemented (what if the server encoding
isn't UTF-8?).

Agreed. I initially thought rejecting the option was the safer route 
given the RFC, but as you pointed out, we aren't enforcing 
UTF-8 strictly on the server side anyway. 


Rejecting this option altogether doesn't improve anything, not
functionally, not specs-compliance-wise, nor according to the
principle of least surprise.
 
Makes sense. Implementing the conversion properly 
keeps JSON format consistent with how the text and CSV formats behave.


> The attached patch rejects the explicit ENCODING option for JSON
> mode, consistent with how DELIMITER, NULL, DEFAULT, and HEADER are
> already rejected.  The implicit client_encoding case is a separate
> design question (should COPY TO JSON always emit UTF-8 regardless
> of client_encoding?) that maybe we should address separately and not as
> part of v19.

No, you don't get to punt this till later.  Once we ship v19 there's
going to be a strong expectation of backwards compatibility.

The idea of sending UTF-8 to a client that's set client_encoding to
something else would be risible, if it weren't a security hazard.

I agree sending unconverted bytes to a mismatched 
client encoding is clearly a security hazard that needs addressing. Did
not consider the backward compatibility part, my bad.

Was trying out adding  pg_server_to_any() to the json_buf after 
composite_to_json() returns, 
correctly covering both explicit ENCODING option specifications and 
implicit client_encoding mismatches. 

Let me send a patch with code and associated test cases.

Regards,
Ayush

 

Re: [PATCH] Reject ENCODING option for COPY TO FORMAT JSON

От
Ayush Tiwari
Дата:
Hi,

On Mon, 20 Apr 2026 at 19:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
 
Seems to me the correct thing here is to make it work like the other
cases, ie perform pg_server_to_any().  I have exactly no sympathy for
the argument about the RFC saying it must be UTF-8, not least because
that's not in fact what is implemented (what if the server encoding
isn't UTF-8?).

Agreed. I initially thought rejecting the option was the safer route 
given the RFC, but as you pointed out, we aren't enforcing 
UTF-8 strictly on the server side anyway. 


Rejecting this option altogether doesn't improve anything, not
functionally, not specs-compliance-wise, nor according to the
principle of least surprise.
 
Makes sense. Implementing the conversion properly 
keeps JSON format consistent with how the text and CSV formats behave.

No, you don't get to punt this till later.  Once we ship v19 there's
going to be a strong expectation of backwards compatibility.

The idea of sending UTF-8 to a client that's set client_encoding to
something else would be risible, if it weren't a security hazard.

I agree sending unconverted bytes to a mismatched 
client encoding is clearly a security hazard that needs addressing. Did
not consider the backward compatibility part, my bad.

Was trying out adding  pg_server_to_any() to the json_buf after 
composite_to_json() returns, 
correctly covering both explicit ENCODING option specifications and 
implicit client_encoding mismatches. 

Let me send a patch with code and associated test cases.

 
Attached patch with round trip test case. Please review and let me 
know if it's in the right direction.

Regards,
Ayush 
Вложения

Re: [PATCH] Reject ENCODING option for COPY TO FORMAT JSON

От
Ayush Tiwari
Дата:
Hi,

On Mon, 20 Apr 2026 at 20:31, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:
Hi,

On Mon, 20 Apr 2026 at 19:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
 
Seems to me the correct thing here is to make it work like the other
cases, ie perform pg_server_to_any().  I have exactly no sympathy for
the argument about the RFC saying it must be UTF-8, not least because
that's not in fact what is implemented (what if the server encoding
isn't UTF-8?).

Agreed. I initially thought rejecting the option was the safer route 
given the RFC, but as you pointed out, we aren't enforcing 
UTF-8 strictly on the server side anyway. 


Rejecting this option altogether doesn't improve anything, not
functionally, not specs-compliance-wise, nor according to the
principle of least surprise.
 
Makes sense. Implementing the conversion properly 
keeps JSON format consistent with how the text and CSV formats behave.

No, you don't get to punt this till later.  Once we ship v19 there's
going to be a strong expectation of backwards compatibility.

The idea of sending UTF-8 to a client that's set client_encoding to
something else would be risible, if it weren't a security hazard.

I agree sending unconverted bytes to a mismatched 
client encoding is clearly a security hazard that needs addressing. Did
not consider the backward compatibility part, my bad.

Was trying out adding  pg_server_to_any() to the json_buf after 
composite_to_json() returns, 
correctly covering both explicit ENCODING option specifications and 
implicit client_encoding mismatches. 

Let me send a patch with code and associated test cases.

 
Attached patch with round trip test case. Please review and let me 
know if it's in the right direction.

I have registered this patch set in the CommitFest for tracking: 

Please let me know if the patch looks good, and if I need to add it
in the open items list for PG 19.

Regards,
Ayush  

Re: [PATCH] Reject ENCODING option for COPY TO FORMAT JSON

От
Andrew Dunstan
Дата:


On 2026-04-29 We 12:49 PM, Ayush Tiwari wrote:
Hi,

On Mon, 20 Apr 2026 at 20:31, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:
Hi,

On Mon, 20 Apr 2026 at 19:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
 
Seems to me the correct thing here is to make it work like the other
cases, ie perform pg_server_to_any().  I have exactly no sympathy for
the argument about the RFC saying it must be UTF-8, not least because
that's not in fact what is implemented (what if the server encoding
isn't UTF-8?).

Agreed. I initially thought rejecting the option was the safer route 
given the RFC, but as you pointed out, we aren't enforcing 
UTF-8 strictly on the server side anyway. 


Rejecting this option altogether doesn't improve anything, not
functionally, not specs-compliance-wise, nor according to the
principle of least surprise.
 
Makes sense. Implementing the conversion properly 
keeps JSON format consistent with how the text and CSV formats behave.

No, you don't get to punt this till later.  Once we ship v19 there's
going to be a strong expectation of backwards compatibility.

The idea of sending UTF-8 to a client that's set client_encoding to
something else would be risible, if it weren't a security hazard.

I agree sending unconverted bytes to a mismatched 
client encoding is clearly a security hazard that needs addressing. Did
not consider the backward compatibility part, my bad.

Was trying out adding  pg_server_to_any() to the json_buf after 
composite_to_json() returns, 
correctly covering both explicit ENCODING option specifications and 
implicit client_encoding mismatches. 

Let me send a patch with code and associated test cases.

 
Attached patch with round trip test case. Please review and let me 
know if it's in the right direction.

I have registered this patch set in the CommitFest for tracking: 

Please let me know if the patch looks good, and if I need to add it
in the open items list for PG 19.



Basically good, I think. I have modified your test a bit, testing more directly for the presence of the LATIN-1 encoded character and the absence of the UTF-8 encoded character, by reading in the file with pg_read_binary_file, and adding a test for implicit encoding by setting client_encoding.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Вложения

Re: [PATCH] Reject ENCODING option for COPY TO FORMAT JSON

От
Ayush Tiwari
Дата:
Hi,

On Mon, 4 May 2026 at 19:49, Andrew Dunstan <andrew@dunslane.net> wrote:

Basically good, I think. I have modified your test a bit, testing more directly for the presence of the LATIN-1 encoded character and the absence of the UTF-8 encoded character, by reading in the file with pg_read_binary_file, and adding a test for implicit encoding by setting client_encoding.


The revised tests look better to me.  Checking the raw bytes with
pg_read_binary_file() directly verifies that LATIN1 output does not contain 
the UTF-8 sequence, and the added implicit client_encoding case too 
looks good.

Thanks for improving the test coverage.

Regards,
Ayush