Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
От | Nathan Bossart |
---|---|
Тема | Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails |
Дата | |
Msg-id | Zz9OTyWAvATeeHev@nathan обсуждение исходный текст |
Ответ на | Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails (Bruce Momjian <bruce@momjian.us>) |
Ответы |
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
|
Список | pgsql-bugs |
On Thu, Nov 21, 2024 at 09:47:56AM -0500, Bruce Momjian wrote: > On Thu, Nov 21, 2024 at 02:35:50PM +0000, Bertrand Drouvot wrote: >> On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote: >> > I don't understand this logic. Why are two bytes important? If we knew >> > it was UTF8 we could check for non-first bytes always starting with >> > bits 10, but we can't know that. >> >> I think this is because this is a reliable way to detect if the truncation happened >> in the middle of a character, without needing to know the specifics of the encoding. >> >> My understanding is that the key insight is that in any multibyte encoding, all >> bytes within a multibyte character will have their high bits set. >> >> That's just my understanding from the code and Tom's previous explanations: I >> might be wrong as not an expert in this area. > > But the logic doesn't make sense. Why would two bytes be any different > than one? Tom provided a concise explanation upthread [0]. My understanding is the same as Bertrand's, i.e., this is an easy way to rule out a bunch of cases where we know that we couldn't possibly have truncated in the middle of a multi-byte character. This allows us to avoid doing multiple pg_database lookups. > I assumed you would just remove all trailing high-bit bytes > and stop and the first non-high-bit byte. I think this risks truncating more than one multi-byte character, which would cause the login path to truncate differently than the CREATE/ALTER DATABASE path (which is encoding-aware). > Also, do we really expect > there to be trailing multi-byte characters and then some ASCII before > it? Isn't it likely it will be all ASCII or all multi-byte characters? > I guess for Latin1, it would work fine, but I assume for Asian > languages, it will be almost all multi-byte characters. I guess digits > would be ASCII. All of these seem within the realm of possibility to me. > This all just seems very unfocused. I see the following options: * Try to do multibyte-aware truncation (the patch at hand). * Only truncate for all-ASCII identifiers for historical purposes. Folks using non-ASCII characters in database names will need to specify the datname exactly during login. * ERROR for long identifiers instead of automatically truncating (upthread this was considered a non-starter since this behavior has been around for so long). * Revert the patch, leaving multibyte database names potentially broken (AFAIK Bertrand's initial report is the only one). * Do nothing, so folks who previously relied on the truncation will now have to specify the datname exactly during login as of >= v17. [0] https://postgr.es/m/158506.1732120196%40sss.pgh.pa.us -- nathan
В списке pgsql-bugs по дате отправления: