Re: DB encoding, locale and indexes

Поиск
Список
Период
Сортировка
От Sterfield
Тема Re: DB encoding, locale and indexes
Дата
Msg-id CAPf6=kei-VXJ5AvtqxpLFyEpU--iM=gC6r4rukMvrpCxcakKvA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: DB encoding, locale and indexes  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-general
2015-02-05 15:56 GMT+01:00 Tom Lane <tgl@sss.pgh.pa.us>:
Sterfield <sterfield@gmail.com> writes:
> I'm a sysadmin working for an application that stores all its data in a PG
> database.
> Currently, the cluster has its encoding set to UTF-8, and the locale (both
> LC_COLLATE and LC_CTYPE) is set to 'en_US.UTF-8'.

> I discovered recently that the indexes created on varchar fields are not
> working for LIKE operator, as they are created without the correct class
> operator (as stated in
> http://www.postgresql.org/docs/9.2/static/indexes-types.html).

Right, because en_US.UTF-8 uses dictionary sort order rather than plain
byte-by-byte sort.

> The most straight-forward solution would be to create a second index on the
> same field but with the class operator, in order to have indexes both for
> =, >, < operators AND LIKE / regexp operators. Few additional indexes, some
> diskspace eaten, problem solved.

Yup.

> However, some people are saying that nothing has to change on the index,
> and that the only thing we have to do is to change the LC_COLLATE of each
> databases to 'C', in order for the indexes to work without the class
> operator.

Yes, that is another possible solution, and it's documented.  Keep in mind
though that you can *not* just reach into pg_database and tweak those
fields; if you did, all your indexes would be corrupt, because they'd no
longer match the sort order the system is expecting.  The only safe way to
get there would be to dump and reload into a new database set up this way.
(If you wanted to live dangerously, I guess you could manually tweak the
pg_database fields and then REINDEX every affected index ... but this
still involves substantial downtime, and I would not recommend doing it
without practicing on a test installation.)

Yeah, I'll not take the risk. For current databases, I'll probably create manually new indexes. For new environment, I'll change the LC_COLLATE to 'C'. I've spent some time re-creating a test environment, using encoding to UTF8, locale to 'en_US.UTF-8' except LC_COLLATE set to 'C'. Nothing special to report, the index is working as expected for LIKE operators, and I have correct answers if I'm doing a LIKE '<something>é%'.


You also have to ask whether any of your applications are expecting ORDER
BY some-text-field to produce dictionary order rather than ASCII order.

Indeed, the order of the results is not the same with a LC_COLLATE to 'en_US.UTF-8' or LC_COLLATE to 'C', but I highly doubt that the application is taking advantage of having an index already sorted.


>    - If I have unicode character stored in my database (for example 'é'),
>    and the LC_COLLATE set to 'C', how the index will behave if I do a query
>    with LIKE '<something>é%' ?

It's still the same character, but it will sort in a possibly unexpected
way.

                        regards, tom lane

Many thanks for your help, guys, especially on this non-trivial subject (at least, for me).

Cheers,

В списке pgsql-general по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: DB encoding, locale and indexes
Следующее
От: Rémi Cura
Дата:
Сообщение: Re: How do I bump a row to the front of sort efficiently