Re: proposal: change behavior on collation version mismatch

Поиск
Список
Период
Сортировка
От Jeremy Schneider
Тема Re: proposal: change behavior on collation version mismatch
Дата
Msg-id 88ab1cc5-2d6f-4d67-8f3b-3e10f2e52b3d@amazon.com
обсуждение исходный текст
Ответ на Re: proposal: change behavior on collation version mismatch  ("Daniel Verite" <daniel@manitou-mail.org>)
Список pgsql-hackers
On 11/28/23 2:12 AM, Daniel Verite wrote:
        Jeremy Schneider wrote:
1) "collation changes are uncommon" (which is relatively correct)
2) "most users would rather have ease-of-use than 100% safety, since
it's uncommon"

And I think this led to the current behavior of issuing a warning rather
than an error,
There's a technical reason for this being a warning.
If it was an error, any attempt to do anything with the collation
would fail, which includes REINDEX on indexes  using
that collation.
And yet that's precisely what you're supposed to do in that
situation.


Indexes are the most obvious and impactful corruption, so the focus is understandable, but there's a bit of a myth in the general public that REINDEX means you fixed your database.  I'm concerned that too many people believe this falsehood, and don't realize that things like constraints and partitions can also be affected by a major OS update when leaving PG data files in place.  Also there's a tendancy to use amcheck and validate btree indexes, but skip other index types.  And of course none of this is possible when people mistakenly use a different major OS for the hot standby (but Postgres willingly sends incorrect query results to users).

This is why my original proposal included an update to the ALTER ... REFRESH/COLLATION docs.  Today's conventional wisdom suggests this is a safe command.  It's really not, if you're using unicode (which everyone is). Fifteen years ago, you needed to buy a french keyboard to type french accented characters.  Today it's a quick tap on your phone to get chinese, russian, tibetan, emojis, and any other character you can dream of.  All of those surprising characters eventually get stored in Postres databases, often to the surprise of devs and admins, after they discover corruption from an OS upgrade.

And to recap some data about historical ICU versions from the torture test:

ICU Version | OS Version | en-US characters changed collation | zh-Hans-CN characters changed collation | fr-FR characters changed collation
55.1-7ubuntu0.5 | Ubuntu 16.04.7 LTS | 286,654 | 286,654 | 286,654
60.2-3ubuntu3.1 | Ubuntu 18.04.6 LTS | 23,741 | 24,415 | 23,741
63.1-6 | Ubuntu 19.04 | 688 | 688 | 688
66.1-2ubuntu2 | Ubuntu 20.04.3 LTS | 6,497 | 6,531 | 6,497
70.1-2 | Ubuntu 22.04 LTS | 879 | 887 | 879


The very clear trend here is that most changes are made in the root collation rules, affecting all locales.  This means that worrying about specific collation versions of different locales is really focusing on an irrelevant edge case.  In ICU development, all the locales tend to change.

If anyone thinks the Collation Apocalypse is bad now, I predict the Kubernetes wave will be mayhem.  Fifteen years ago it was rare to physically move PG datafiles to a new major OS.  Most people would dump and load their databases, sized in GBs.  Today's multi-TB Postgres databases have meant an increase of in-place OS upgrades in recent years.  People started to either detach/attach their storage, or they used a hot standby. Kubernetes will make these moves across major OS's a daily, effortless occurrence.

ICU doesn't fix anything directly.  We do need ICU - only because it finally enables us to compile that old version of ICU forever on every new OS we move to going forward. This was simply impossible with glibc. Over the past couple decades, not even Oracle or IBM has managed to deprecate a single version of ICU from a relational database, and not for lack of desire.

-Jeremy

-- 
Jeremy Schneider
Performance Engineer
Amazon Web Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jeff Davis
Дата:
Сообщение: Re: encoding affects ICU regex character classification
Следующее
От: Peter Smith
Дата:
Сообщение: Re: pg_upgrade and logical replication