Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
От | Bruce Momjian |
---|---|
Тема | Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails |
Дата | |
Msg-id | Zz9jfOkVmlYcYHSy@momjian.us обсуждение исходный текст |
Ответ на | Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails (Nathan Bossart <nathandbossart@gmail.com>) |
Ответы |
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
|
Список | pgsql-bugs |
On Thu, Nov 21, 2024 at 09:14:23AM -0600, Nathan Bossart wrote: > On Thu, Nov 21, 2024 at 09:47:56AM -0500, Bruce Momjian wrote: > > On Thu, Nov 21, 2024 at 02:35:50PM +0000, Bertrand Drouvot wrote: > >> On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote: > >> > I don't understand this logic. Why are two bytes important? If we knew > >> > it was UTF8 we could check for non-first bytes always starting with > >> > bits 10, but we can't know that. > >> > >> I think this is because this is a reliable way to detect if the truncation happened > >> in the middle of a character, without needing to know the specifics of the encoding. > >> > >> My understanding is that the key insight is that in any multibyte encoding, all > >> bytes within a multibyte character will have their high bits set. > >> > >> That's just my understanding from the code and Tom's previous explanations: I > >> might be wrong as not an expert in this area. > > > > But the logic doesn't make sense. Why would two bytes be any different > > than one? > > Tom provided a concise explanation upthread [0]. My understanding is the > same as Bertrand's, i.e., this is an easy way to rule out a bunch of cases > where we know that we couldn't possibly have truncated in the middle of a > multi-byte character. This allows us to avoid doing multiple pg_database > lookups. Where does Tom mention anything about checking two bytes? He is basically saying remove all trailing high-bit characters until you get a match, because once you get a match, you are have found the point of valid truncation for the encoding. In fact, here, he specifically talks about MAX_MULTIBYTE_CHAR_LEN-1: https://www.postgresql.org/message-id/3796535.1732044807%40sss.pgh.pa.us This text: * If the original name is too long and we see two consecutive bytes * with their high bits set at the truncation point, we might have * truncated in the middle of a multibyte character. In multibyte * encodings, every byte of a multibyte character has its high bit * set. So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and * NAMEDATALEN-2, we know we're in the middle of a multibyte * character. We need to try truncating one more byte back to find the * start of the next character. needs to be fixed, at a minimum, specifically, "So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and NAMEDATALEN-2, we know we're in the middle of a multibyte character." > > I assumed you would just remove all trailing high-bit bytes > > and stop and the first non-high-bit byte. > > I think this risks truncating more than one multi-byte character, which > would cause the login path to truncate differently than the CREATE/ALTER > DATABASE path (which is encoding-aware). True, we can stop at MAX_MULTIBYTE_CHAR_LEN-1, and know there is no match. > * Try to do multibyte-aware truncation (the patch at hand). Yes, I am fine with that, but we need to do more than the patch does to accomplish this, unless I am totally confused. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
В списке pgsql-bugs по дате отправления: