Обсуждение: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
If it is not intentional, stricter validation could reject unassigned byte positions at input time.
--
See the related bug report https://www.postgresql.org/message-id/CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.comCurrently PostgreSQL accepts structurally well-formed EUC_CN byte sequences such as 0xA2A3 into text columns. The value round-trips when client_encoding is EUC_CN, but fails when client_encoding is UTF8 because euc_cn_to_utf8 has no mapping.If this behavior is intentional for compatibility, the documentation should explicitly say that validation for some legacy encodings is byte-structure validation, not mapping-table validation.
If it is not intentional, stricter validation could reject unassigned byte positions at input time.
--Zhongpu Chen
--
The issue is not specific to E'\\x..' literals. A normal COPY FROM data file with ENCODING 'EUC_CN' can create text rows that later cannot be retrieved with SELECT.This suggests that input validation for EUC_CN is only structural, while the EUC_CN-to-UTF8 conversion table is stricter.
Thanks for the clarification.
I agree that validation on every input may have runtime-cost concerns. But this can be well-controlled. For example, MySQL adopts a finer checking for EUC-CN (i.e., GB2312) in https://github.com/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc:
```
static int func_gb2312_uni_onechar(int code) {
if ((code >= 0x2121) && (code <= 0x2658))
return (tab_gb2312_uni0[code - 0x2121]);
if ((code >= 0x2721) && (code <= 0x296F))
return (tab_gb2312_uni1[code - 0x2721]);
if ((code >= 0x3021) && (code <= 0x777E))
return (tab_gb2312_uni2[code - 0x3021]);
return (0);
}
```
where `code` is obtained by subtracting 0x8080. Of course, MySQL's checking can also be enhanced.
Anyway, it is reasonable to note these details in the documentation.
On Friday, May 1, 2026, Zhongpu Chen <chenloveit@gmail.com> wrote:The issue is not specific to E'\\x..' literals. A normal COPY FROM data file with ENCODING 'EUC_CN' can create text rows that later cannot be retrieved with SELECT.This suggests that input validation for EUC_CN is only structural, while the EUC_CN-to-UTF8 conversion table is stricter.I suspect a lack of desire to maintain and ensure that specific values are verified; or accepting the runtime cost to do so. It is indeed structural. This point should probably be documented better. But it’s hard to feel too bad if the input claims it is providing verifiable EUC_CN data then proceeds to supply data that lacks meaning in reality. We are happy to just store and return your data to you - but it’s unreasonable to ask for it to be converted. It would be nice for the database to provide an extra layer of protection, so I’m not against the idea. Either automatically or or at least providing a function that could, say, be called in a trigger for opt-in. But definitely feels like a problematic benefit-to-cost proposition.David J.
--
gb2312::is_gb2312_iconv time: [2.5621 ms 2.5681 ms 2.5740 ms]
change: [-2.6404% -2.3284% -2.0023%] (p = 0.00 < 0.05)
Performance has improved.
gb2312::is_gb2312_icu time: [3.2427 ms 3.2589 ms 3.2771 ms]
change: [-1.5710% -1.0409% -0.4387%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
gb2312::is_gb2312_rs time: [2.8157 ms 2.8229 ms 2.8303 ms]
change: [-1.6985% -1.2165% -0.7501%] (p = 0.00 < 0.05)
Change within noise threshold.
Benchmarking gb2312::is_gb2312_range: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.3s, enable flat sampling, or reduce sample count to 50.
gb2312::is_gb2312_range time: [1.6237 ms 1.6294 ms 1.6351 ms]
change: [+3.8720% +4.2901% +4.6933%] (p = 0.00 < 0.05)
Performance has regressed.
gb2312::is_gb2312_lookup
time: [488.12 µs 490.04 µs 491.97 µs]
change: [+0.9273% +2.2343% +3.2599%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild
gb2312::is_gb2312_simd time: [181.00 µs 181.77 µs 182.53 µs]
change: [-4.4563% -3.6971% -3.0260%] (p = 0.00 < 0.05)
Performance has improved.
gb2312:is_gb2312_ranges_pg
time: [467.69 µs 469.27 µs 470.82 µs]
Benchmarking gb2312:is_gb2312_ranges_mysql: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
gb2312:is_gb2312_ranges_mysql
time: [1.2611 ms 1.2667 ms 1.2724 ms]
Thanks for the clarification.
I agree that validation on every input may have runtime-cost concerns. But this can be well-controlled. For example, MySQL adopts a finer checking for EUC-CN (i.e., GB2312) in https://github.com/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc:
```
static int func_gb2312_uni_onechar(int code) {
if ((code >= 0x2121) && (code <= 0x2658))
return (tab_gb2312_uni0[code - 0x2121]);
if ((code >= 0x2721) && (code <= 0x296F))
return (tab_gb2312_uni1[code - 0x2721]);
if ((code >= 0x3021) && (code <= 0x777E))
return (tab_gb2312_uni2[code - 0x3021]);
return (0);
}```
where `code` is obtained by subtracting 0x8080. Of course, MySQL's checking can also be enhanced.
Anyway, it is reasonable to note these details in the documentation.
On Sat, May 2, 2026 at 11:28 AM David G. Johnston <david.g.johnston@gmail.com> wrote:On Friday, May 1, 2026, Zhongpu Chen <chenloveit@gmail.com> wrote:The issue is not specific to E'\\x..' literals. A normal COPY FROM data file with ENCODING 'EUC_CN' can create text rows that later cannot be retrieved with SELECT.This suggests that input validation for EUC_CN is only structural, while the EUC_CN-to-UTF8 conversion table is stricter.I suspect a lack of desire to maintain and ensure that specific values are verified; or accepting the runtime cost to do so. It is indeed structural. This point should probably be documented better. But it’s hard to feel too bad if the input claims it is providing verifiable EUC_CN data then proceeds to supply data that lacks meaning in reality. We are happy to just store and return your data to you - but it’s unreasonable to ask for it to be converted. It would be nice for the database to provide an extra layer of protection, so I’m not against the idea. Either automatically or or at least providing a function that could, say, be called in a trigger for opt-in. But definitely feels like a problematic benefit-to-cost proposition.David J.
--Zhongpu Chen
--
On 02.05.26 04:31, Zhongpu Chen wrote: > See the related bug report https://www.postgresql.org/message-id/ > CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com > <https://www.postgresql.org/message-id/ > CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com> > > Currently PostgreSQL accepts structurally well-formed EUC_CN byte > sequences such as 0xA2A3 into text columns. The value round-trips when > client_encoding is EUC_CN, but fails when client_encoding is UTF8 > because euc_cn_to_utf8 has no mapping. > > If this behavior is intentional for compatibility, the documentation > should explicitly say that validation for some legacy encodings is byte- > structure validation, not mapping-table validation. > If it is not intentional, stricter validation could reject unassigned > byte positions at input time. It is in general not necessarily required that all text in all non-UTF8 encodings must be convertible to UTF8. (This is also a result of history: These encodings were implemented in PostgreSQL before Unicode.) That said, I can see how different behaviors might be desirable. My first question would be, are these non-convertible byte sequences just characters that don't map to Unicode, or are they invalid within the definition of the EUC-* encodings themselves? If the latter, then we should just reject them (modulo some backward compatibility), similar to how we reject certain Unicode code points that exist "structurally" but are not valid for one reason or another. Alternatively, if these byte sequences are valid characters but they just didn't end up in Unicode for some reason, then rejecting them might break valid uses. (I don't know much about EUC-* to be able to answer these.)
For the reported EUC-CN cases, this is exactly the point in question. These byte sequences are structurally well-formed EUC-CN byte pairs, but they fall into reserved or unassigned positions of the GB2312 code table. For example, byte sequences with first byte 0xAA correspond to row 10 of GB2312, which is reserved/unassigned. Therefore, these cases are not merely valid legacy characters that happen to lack Unicode mappings. Rather, under strict GB2312/EUC-CN semantics, they are not assigned to any character at all, and thus should not be considered valid GB2312 characters.
So my concern is not that every legacy-encoded character must be convertible to UTF8. The concern is that PostgreSQL's write-time validation accepts a structural superset of EUC-CN byte pairs as text, while some of these byte pairs are not valid assigned GB2312 characters and PostgreSQL's own later conversion path cannot assign character semantics to them.
On 02.05.26 04:31, Zhongpu Chen wrote:
> See the related bug report https://www.postgresql.org/message-id/
> CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com
> <https://www.postgresql.org/message-id/
> CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com>
>
> Currently PostgreSQL accepts structurally well-formed EUC_CN byte
> sequences such as 0xA2A3 into text columns. The value round-trips when
> client_encoding is EUC_CN, but fails when client_encoding is UTF8
> because euc_cn_to_utf8 has no mapping.
>
> If this behavior is intentional for compatibility, the documentation
> should explicitly say that validation for some legacy encodings is byte-
> structure validation, not mapping-table validation.
> If it is not intentional, stricter validation could reject unassigned
> byte positions at input time.
It is in general not necessarily required that all text in all non-UTF8
encodings must be convertible to UTF8.
(This is also a result of history: These encodings were implemented in
PostgreSQL before Unicode.)
That said, I can see how different behaviors might be desirable.
My first question would be, are these non-convertible byte sequences
just characters that don't map to Unicode, or are they invalid within
the definition of the EUC-* encodings themselves? If the latter, then
we should just reject them (modulo some backward compatibility), similar
to how we reject certain Unicode code points that exist "structurally"
but are not valid for one reason or another.
Alternatively, if these byte sequences are valid characters but they
just didn't end up in Unicode for some reason, then rejecting them might
break valid uses.
(I don't know much about EUC-* to be able to answer these.)
--