Обсуждение: Can we get sha* function over text, that could be used in index?

Поиск

Список

Период

Сортировка

Can we get sha* function over text, that could be used in index?

От

hubert depesz lubaczewski

Дата:

18 февраля, 17:58:36

Hi,
So, we have various sha* functions.

And I recently got asked about using them as a based for unique index
over long texts.
Normally one would do it with md5(text), but the person asking wanted to
use sha(). and these functions work only on bytea.

And apparently - we can't.

'text-value'::bytea won't work for some specific text values.
convert_to() isn't immutable.

I figured out that I can do something like:

SELECT
    sha256(
        string_agg( ascii( t )::text, ',' ORDER BY idx )::bytea
    )
FROM
    regexp_split_to_table( 'INPUT_STRING', '' ) WITH ORDINALITY AS x ( t, idx );

But that's hardly sane solution.

I've read bug report from 2008:
https://www.postgresql.org/message-id/flat/48D20645.1090503%40gmx.net#ce27df4802c9854a9eb77066a5c7cb05

And while I kinda undestand, create-conversion, server-encoding, I don't
really *grok* why we can't have immutable conversion to bytea. And/or
versions of sha* functions that simply work on text.

Is it doable? How does it work in md5()? Apparently it does also work in
pgcrypto/digest(), so there should be a way to get it in core sha*
functions?

Best regards,

depesz

Re: Can we get sha* function over text, that could be used in index?

От

Ron Johnson

Дата:

18 февраля, 18:07:22

On Wed, Feb 18, 2026 at 9:58 AM hubert depesz lubaczewski <depesz@depesz.com> wrote:

Hi,
So, we have various sha* functions.

And I recently got asked about using them as a based for unique index
over long texts.
Normally one would do it with md5(text), but the person asking wanted to
use sha().

I think I'd push back, asking them if they really need cryptographically-secure hashing (which they most probably don't).

Death to <Redacted>, and butter sauce.

Don't boil me, I'm still alive.

<Redacted> lobster!

Re: Can we get sha* function over text, that could be used in index?

От

Linus Heckemann

Дата:

18 февраля, 21:02:10

Hi, I'm the one who asked :)

Ron Johnson wrote:
> On Wed, Feb 18, 2026 at 9:58 AM hubert depesz lubaczewski <depesz(at)depesz(dot)com>
> wrote:
>
> > Hi,
> > So, we have various sha* functions.
> >
> > And I recently got asked about using them as a based for unique index
> > over long texts.
> > Normally one would do it with md5(text), but the person asking wanted to
> > use sha().
>
> I think I'd push back, asking them if they really need
> cryptographically-secure hashing (which they most probably don't).

I indeed don't need cryptographically-secure hashing in this
scenario, and I will be using md5.

Nevertheless, I think using md5 in any context that isn't "to verify
digests produced in the past" is a poor decision, and postgres should
make using other hashes just as easy!

The specific context of hashing aside, it seems weird to me that:

- there is a byte-array representation of text columns, which appears to
  be independent of database encoding
- there doesn't seem to be _any_ sane way to access this.

The obvious (to a naive user, like I was) approach, casting to bytea,
has exceptionally surprising behaviour: for many text strings, it does
exactly what the naive user might hope for, giving back the UTF-8
representation. But multiple distinct text strings, like '\033' and
'\x1b', convert to the same byte string! And text strings containing a
backslash that doesn't fit the bytea hex format or the bytea escape
format will fail to convert completely!

The fact that convert_to() is only stable and not immutable makes sense
to me given the effect that configuration can have on its behaviour, but
given that there does appear to be a trivially-accessible UTF-8
representation (as used by md5()) I think there should be an immutable
function that provides access to it? Is there a good reason not to? I'd
be willing to send a patch for it myself.

Linus

Re: Can we get sha* function over text, that could be used in index?

От

Peter Eisentraut

Дата:

19 февраля, 17:48:33

On 18.02.26 15:58, hubert depesz lubaczewski wrote:
> And while I kinda undestand, create-conversion, server-encoding, I don't
> really*grok* why we can't have immutable conversion to bytea. And/or
> versions of sha* functions that simply work on text.

Hash functions fundamentally work on a sequence of bytes, so bytea is 
the right type.  The encoding of text into bytes is complicated, so it 
seems better if you handle that yourself depending on the local 
requirements.

Re: Can we get sha* function over text, that could be used in index?

От

"Peter J. Holzer"

Дата:

20 февраля, 01:30:17

On 2026-02-19 15:48:33 +0100, Peter Eisentraut wrote:
> On 18.02.26 15:58, hubert depesz lubaczewski wrote:
> > And while I kinda undestand, create-conversion, server-encoding, I don't
> > really*grok* why we can't have immutable conversion to bytea. And/or
> > versions of sha* functions that simply work on text.
>
> Hash functions fundamentally work on a sequence of bytes, so bytea is the
> right type.  The encoding of text into bytes is complicated,

Maybe, but it needs to be done anyway, sicne text is ultimately stored
as a sequence of bytes on disk and sent as a sequence of bytes over the
wire. So the code should be present already.

Something like
    encode(s text, enc text) -> bytea
        Encodes s in encoding enc. E.g. encode('Tröt!', 'utf-8')
        produces \x5472c3b67421
    decode(d bytea, enc text) -> text
        Decodes d assuming encoding enc. E.g.
        decode('\x5472c3b67421'::bytea, 'utf-8') produces 'Tröt!'.
might be generally useful.

> so it seems better if you handle that yourself depending on the local
> requirements.

I would probably do that kind of processing in the application code,
but I can see that one might want it in the database.

        hjp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

signature.asc

Re: Can we get sha* function over text, that could be used in index?

От

"Daniel Verite"

Дата:

21 февраля, 00:40:37

    Linus Heckemann wrote:

> - there is a byte-array representation of text columns, which appears to
>  be independent of database encoding

Not sure what you're refering to. Both the on-disk and in-memory
representations of text/varchar are encoding-dependent.

> The obvious (to a naive user, like I was) approach, casting to bytea,
> has exceptionally surprising behaviour: for many text strings, it does
> exactly what the naive user might hope for, giving back the UTF-8
> representation. But multiple distinct text strings, like '\033' and
> '\x1b', convert to the same byte string! And text strings containing a
> backslash that doesn't fit the bytea hex format or the bytea escape
> format will fail to convert completely!

Yes. It seems a common mistake to forget or ignore that
backslashes are special in the input text representation of bytea.
It might be not obvious from reading the doc at [1]
but we just need to quote backslashes by doubling them.

AFAIK a working solution for the OP would be:
 sha256(replace(colname, '\', '\\')::bytea)

The result is encoding-dependent, but that does not matter
in the context of an expression. index.
If the database ever needs to change its encoding, it will have to
be recreated entirely anyway.

[1]
https://www.postgresql.org/docs/current/datatype-binary.html#DATATYPE-BINARY-BYTEA-ESCAPE-FORMAT

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

Re: Can we get sha* function over text, that could be used in index?

От

Linus Heckemann

Дата:

21 февраля, 12:54:53

"Daniel Verite" <daniel@manitou-mail.org> writes:
>> - there is a byte-array representation of text columns, which appears to
>>  be independent of database encoding
>
> Not sure what you're refering to. Both the on-disk and in-memory
> representations of text/varchar are encoding-dependent.

Huh, I thought I'd come to the opposite conclusion from testing the
other day (md5('ÿ') giving the same results for both a utf-8 database
and a latin1 database), but I must have been holding something wrong,
because trying again does confirm this. Sorry about that!


>> The obvious (to a naive user, like I was) approach, casting to bytea,
>> has exceptionally surprising behaviour: for many text strings, it does
>> exactly what the naive user might hope for, giving back the UTF-8
>> representation. But multiple distinct text strings, like '\033' and
>> '\x1b', convert to the same byte string! And text strings containing a
>> backslash that doesn't fit the bytea hex format or the bytea escape
>> format will fail to convert completely!
>
> Yes. It seems a common mistake to forget or ignore that
> backslashes are special in the input text representation of bytea.
> It might be not obvious from reading the doc at [1]
> but we just need to quote backslashes by doubling them.
>
> AFAIK a working solution for the OP would be:
>  sha256(replace(colname, '\', '\\')::bytea)

That's so simple that it feels like it should have been
obvious... Thanks! Nevertheless, it feels like a weird and potentially
computationally expensive contortion for getting from a text value to
its bytea representation.

Given their encoding-dependence, shouldn't md5() (and ::bytea, for that
matter) be marked only STABLE, not IMMUTABLE? Or is IMMUTABLE's

> guaranteed to return the same results given the same arguments forever

only valid within the context of a single database?


md5() and ::bytea aside, it seems like it would be good to have both

 - a function for accessing the bytea representation of a text value
   without any extra steps, and

 - an IMMUTABLE function for getting a UTF-8 (specifically UTF-8,
   because it's the only encoding with a character set encompassing all
   other supported character sets) bytea representation of a text value
   regardless of the database encoding?

Cheers
Linus

Re: Can we get sha* function over text, that could be used in index?

От

"Peter J. Holzer"

Дата:

21 февраля, 15:40:55

On 2026-02-19 23:30:17 +0100, Peter J. Holzer wrote:
> Something like
>     encode(s text, enc text) -> bytea
>     decode(d bytea, enc text) -> text
> might be generally useful.

Please ignore this email. These are basically the same as convert_to()
and convert_from() which already exist.

        hjp


--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

signature.asc

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Can we get sha* function over text, that could be used in index?

Вложения

Вложения