Обсуждение: Can we get sha* function over text, that could be used in index?
Hi,
So, we have various sha* functions.
And I recently got asked about using them as a based for unique index
over long texts.
Normally one would do it with md5(text), but the person asking wanted to
use sha(). and these functions work only on bytea.
And apparently - we can't.
'text-value'::bytea won't work for some specific text values.
convert_to() isn't immutable.
I figured out that I can do something like:
SELECT
sha256(
string_agg( ascii( t )::text, ',' ORDER BY idx )::bytea
)
FROM
regexp_split_to_table( 'INPUT_STRING', '' ) WITH ORDINALITY AS x ( t, idx );
But that's hardly sane solution.
I've read bug report from 2008:
https://www.postgresql.org/message-id/flat/48D20645.1090503%40gmx.net#ce27df4802c9854a9eb77066a5c7cb05
And while I kinda undestand, create-conversion, server-encoding, I don't
really *grok* why we can't have immutable conversion to bytea. And/or
versions of sha* functions that simply work on text.
Is it doable? How does it work in md5()? Apparently it does also work in
pgcrypto/digest(), so there should be a way to get it in core sha*
functions?
Best regards,
depesz
On Wed, Feb 18, 2026 at 9:58 AM hubert depesz lubaczewski <depesz@depesz.com> wrote:
Hi,
So, we have various sha* functions.
And I recently got asked about using them as a based for unique index
over long texts.
Normally one would do it with md5(text), but the person asking wanted to
use sha().
I think I'd push back, asking them if they really need cryptographically-secure hashing (which they most probably don't).
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!
Hi, I'm the one who asked :) Ron Johnson wrote: > On Wed, Feb 18, 2026 at 9:58 AM hubert depesz lubaczewski <depesz(at)depesz(dot)com> > wrote: > > > Hi, > > So, we have various sha* functions. > > > > And I recently got asked about using them as a based for unique index > > over long texts. > > Normally one would do it with md5(text), but the person asking wanted to > > use sha(). > > I think I'd push back, asking them if they really need > cryptographically-secure hashing (which they most probably don't). I indeed don't need cryptographically-secure hashing in this scenario, and I will be using md5. Nevertheless, I think using md5 in any context that isn't "to verify digests produced in the past" is a poor decision, and postgres should make using other hashes just as easy! The specific context of hashing aside, it seems weird to me that: - there is a byte-array representation of text columns, which appears to be independent of database encoding - there doesn't seem to be _any_ sane way to access this. The obvious (to a naive user, like I was) approach, casting to bytea, has exceptionally surprising behaviour: for many text strings, it does exactly what the naive user might hope for, giving back the UTF-8 representation. But multiple distinct text strings, like '\033' and '\x1b', convert to the same byte string! And text strings containing a backslash that doesn't fit the bytea hex format or the bytea escape format will fail to convert completely! The fact that convert_to() is only stable and not immutable makes sense to me given the effect that configuration can have on its behaviour, but given that there does appear to be a trivially-accessible UTF-8 representation (as used by md5()) I think there should be an immutable function that provides access to it? Is there a good reason not to? I'd be willing to send a patch for it myself. Linus
On 18.02.26 15:58, hubert depesz lubaczewski wrote: > And while I kinda undestand, create-conversion, server-encoding, I don't > really*grok* why we can't have immutable conversion to bytea. And/or > versions of sha* functions that simply work on text. Hash functions fundamentally work on a sequence of bytes, so bytea is the right type. The encoding of text into bytes is complicated, so it seems better if you handle that yourself depending on the local requirements.
On 2026-02-19 15:48:33 +0100, Peter Eisentraut wrote:
> On 18.02.26 15:58, hubert depesz lubaczewski wrote:
> > And while I kinda undestand, create-conversion, server-encoding, I don't
> > really*grok* why we can't have immutable conversion to bytea. And/or
> > versions of sha* functions that simply work on text.
>
> Hash functions fundamentally work on a sequence of bytes, so bytea is the
> right type. The encoding of text into bytes is complicated,
Maybe, but it needs to be done anyway, sicne text is ultimately stored
as a sequence of bytes on disk and sent as a sequence of bytes over the
wire. So the code should be present already.
Something like
encode(s text, enc text) -> bytea
Encodes s in encoding enc. E.g. encode('Tröt!', 'utf-8')
produces \x5472c3b67421
decode(d bytea, enc text) -> text
Decodes d assuming encoding enc. E.g.
decode('\x5472c3b67421'::bytea, 'utf-8') produces 'Tröt!'.
might be generally useful.
> so it seems better if you handle that yourself depending on the local
> requirements.
I would probably do that kind of processing in the application code,
but I can see that one might want it in the database.
hjp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Вложения
Linus Heckemann wrote: > - there is a byte-array representation of text columns, which appears to > be independent of database encoding Not sure what you're refering to. Both the on-disk and in-memory representations of text/varchar are encoding-dependent. > The obvious (to a naive user, like I was) approach, casting to bytea, > has exceptionally surprising behaviour: for many text strings, it does > exactly what the naive user might hope for, giving back the UTF-8 > representation. But multiple distinct text strings, like '\033' and > '\x1b', convert to the same byte string! And text strings containing a > backslash that doesn't fit the bytea hex format or the bytea escape > format will fail to convert completely! Yes. It seems a common mistake to forget or ignore that backslashes are special in the input text representation of bytea. It might be not obvious from reading the doc at [1] but we just need to quote backslashes by doubling them. AFAIK a working solution for the OP would be: sha256(replace(colname, '\', '\\')::bytea) The result is encoding-dependent, but that does not matter in the context of an expression. index. If the database ever needs to change its encoding, it will have to be recreated entirely anyway. [1] https://www.postgresql.org/docs/current/datatype-binary.html#DATATYPE-BINARY-BYTEA-ESCAPE-FORMAT Best regards, -- Daniel Vérité https://postgresql.verite.pro/
"Daniel Verite" <daniel@manitou-mail.org> writes:
>> - there is a byte-array representation of text columns, which appears to
>> be independent of database encoding
>
> Not sure what you're refering to. Both the on-disk and in-memory
> representations of text/varchar are encoding-dependent.
Huh, I thought I'd come to the opposite conclusion from testing the
other day (md5('ÿ') giving the same results for both a utf-8 database
and a latin1 database), but I must have been holding something wrong,
because trying again does confirm this. Sorry about that!
>> The obvious (to a naive user, like I was) approach, casting to bytea,
>> has exceptionally surprising behaviour: for many text strings, it does
>> exactly what the naive user might hope for, giving back the UTF-8
>> representation. But multiple distinct text strings, like '\033' and
>> '\x1b', convert to the same byte string! And text strings containing a
>> backslash that doesn't fit the bytea hex format or the bytea escape
>> format will fail to convert completely!
>
> Yes. It seems a common mistake to forget or ignore that
> backslashes are special in the input text representation of bytea.
> It might be not obvious from reading the doc at [1]
> but we just need to quote backslashes by doubling them.
>
> AFAIK a working solution for the OP would be:
> sha256(replace(colname, '\', '\\')::bytea)
That's so simple that it feels like it should have been
obvious... Thanks! Nevertheless, it feels like a weird and potentially
computationally expensive contortion for getting from a text value to
its bytea representation.
Given their encoding-dependence, shouldn't md5() (and ::bytea, for that
matter) be marked only STABLE, not IMMUTABLE? Or is IMMUTABLE's
> guaranteed to return the same results given the same arguments forever
only valid within the context of a single database?
md5() and ::bytea aside, it seems like it would be good to have both
- a function for accessing the bytea representation of a text value
without any extra steps, and
- an IMMUTABLE function for getting a UTF-8 (specifically UTF-8,
because it's the only encoding with a character set encompassing all
other supported character sets) bytea representation of a text value
regardless of the database encoding?
Cheers
Linus
On 2026-02-19 23:30:17 +0100, Peter J. Holzer wrote:
> Something like
> encode(s text, enc text) -> bytea
> decode(d bytea, enc text) -> text
> might be generally useful.
Please ignore this email. These are basically the same as convert_to()
and convert_from() which already exist.
hjp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"