Обсуждение: How to hash a large amount of data within Postgres?

Поиск
Список
Период
Сортировка

How to hash a large amount of data within Postgres?

От
Thorsten Schöning
Дата:
Hi all,

I'm one of those people still storing user uploaded files within
Postgres instead of some file system and over the years this increased
to a point where individual uploads of multiple GiB are reached.

Some years ago I implemented some SQL to read all files, build a table
of SHA256 hashes and tell me how much data is redundant. The goal was
to have a look at which files share the same hash with different LOIDs
and optionally change that, so that all those files are only stored
once on the end.

While the approach was pretty naive, because it simply read all files
into memory to calculate the hashes, I'm somewhat sure it worked in
the past with Postgres 9.6. The executing server had enough free RAM
available as well to process the at most ~4 GiB large files one after
another.

I tried that SQL today with Postgres 11 on UB 18.04 and it failed:

> [Code: 0, SQL State: XX000]  FEHLER: invalid memory alloc request size 1898107949
>   Wobei: PL/pgSQL-Funktion loid_hash_calc(oid,text)[...]

> PostgreSQL
> 11.12 (Ubuntu 11.12-1.pgdg18.04+1)
> PostgreSQL JDBC Driver
> 42.2.9

I searched regaridng that issue and only found two relevant results:
Corrupted rows for some reason and simply size restrictions when
allocating memory. The latter is more likely than the former in my
case, as the restrictions seems to be 1 GiB and I do have larger
files.

I'm doing the following simply currently, because I didn't find any
interfaces allowing to forward blocks of data, LOIDs, file descriptors
or anything like that working with smaller buffers or alike.

> fd      := lo_open( loid,  INV_READ);
> size    := lo_lseek(fd, 0, SEEK_END);
> PERFORM    lo_lseek(fd, 0, SEEK_SET);

> hashBin := digest(loread(fd, size), algorithm);
> hashHex := encode(hashBin,          'hex');

So, is there any way to work around the problem I have currently? Can
I increase the memory restriction somewhere in the config? Are there
any functions available working with blocks of data I'm missing now? I
didn't find any state maintainig HASH-calls.

Thanks!

Mit freundlichen Grüßen

Thorsten Schöning

--
AM-SoFT IT-Service - Bitstore Hameln GmbH
Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK

E-Mail: Thorsten.Schoening@AM-SoFT.de
Web:    http://www.AM-SoFT.de/

Tel:   05151-  9468- 0
Tel:   05151-  9468-55
Fax:   05151-  9468-88
Mobil:  0178-8 9468-04

AM-SoFT IT-Service - Bitstore Hameln GmbH, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 221853 - Geschäftsführer: Janine Galonska


Für Rückfragen stehe ich Ihnen sehr gerne zur Verfügung.

Mit freundlichen Grüßen

Thorsten Schöning


Tel: 05151 9468 0
Fax: 05151 9468 88
Mobil:
Webseite: https://www.am-soft.de

AM-Soft IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister
fürIT und TK 

AM-Soft IT-Service - Bitstore Hameln GmbH i.G.
Brandenburger Str. 7c
31789 Hameln
Tel: 05151 9468 0

Bitstore IT-Consulting GmbH
Zentrale - Berlin Lichtenberg
Frankfurter Allee 285
10317 Berlin
Tel: 030 453 087 80

CBS IT-Service - Bitstore Kaulsdorf UG
Tel: 030 453 087 880 1

Büro Dallgow-Döberitz
Tel: 03322 507 020

Büro Kloster Lehnin
Tel: 033207 566 530

PCE IT-Service - Bitstore Darmstadt UG
Darmstadt
Tel: 06151 392 973 0

Büro Neuruppin
Tel: 033932 606 090

ACI EDV Systemhaus - Bitstore Dresden GmbH
Dresden
Tel: 0351 254 410

Das Systemhaus - Bitstore Magdeburg GmbH
Magdeburg
Tel: 0391 636 651 0

Allerdata.IT - Bitstore Wittenberg GmbH
Wittenberg
Tel: 03491 876 735 7

Büro Liebenwalde
Tel: 033054 810 00

HSA - das Büro - Bitstore Altenburg UG
Altenburg
Tel: 0344 784 390 97

Bitstore IT – Consulting GmbH
NL Piesteritz
Piesteritz
Tel: 03491 644 868 6

Solltec IT-Services - Bitstore Braunschweig UG
Braunschweig
Tel: 0531 206 068 0

MF Computer Service - Bitstore Gütersloh GmbH
Gütersloh
Tel: 05245 920 809 3

Firmensitz: AM-Soft IT-Service - Bitstore Hameln GmbH i.G. , Brandenburger Str. 7c , 31789 Hameln
Geschäftsführer Janine Galonska









Re: How to hash a large amount of data within Postgres?

От
"Peter J. Holzer"
Дата:
On 2021-06-21 15:53:09 +0200, Thorsten Schöning wrote:
> Some years ago I implemented some SQL to read all files, build a table
> of SHA256 hashes and tell me how much data is redundant. The goal was
> to have a look at which files share the same hash with different LOIDs
> and optionally change that, so that all those files are only stored
> once on the end.
>
> While the approach was pretty naive, because it simply read all files
> into memory to calculate the hashes, I'm somewhat sure it worked in
> the past with Postgres 9.6. The executing server had enough free RAM
> available as well to process the at most ~4 GiB large files one after
> another.
>
> I tried that SQL today with Postgres 11 on UB 18.04 and it failed:
>
> > [Code: 0, SQL State: XX000]  FEHLER: invalid memory alloc request size 1898107949
> >   Wobei: PL/pgSQL-Funktion loid_hash_calc(oid,text)[...]
[...]
> I searched regaridng that issue and only found two relevant results:
> Corrupted rows for some reason and simply size restrictions when
> allocating memory. The latter is more likely than the former in my
> case, as the restrictions seems to be 1 GiB and I do have larger
> files.

1 GB is the maximum size of quite a few data structures in PostgreSQL. I
don't use PL/pgSQL, but I wouldn't be surprised if it was the maximum
size of whatever loread() returns (a bytea?). I would be surprised if
this limit was higher in version 9.6 than it is in version 11, however.


> I'm doing the following simply currently, because I didn't find any
> interfaces allowing to forward blocks of data, LOIDs, file descriptors
> or anything like that working with smaller buffers or alike.
>
> > fd      := lo_open( loid,  INV_READ);
> > size    := lo_lseek(fd, 0, SEEK_END);
> > PERFORM    lo_lseek(fd, 0, SEEK_SET);
>
> > hashBin := digest(loread(fd, size), algorithm);
> > hashHex := encode(hashBin,          'hex');
>
> So, is there any way to work around the problem I have currently?

Normally, hash libararies have a way to feed chunks of data into a hash
computations to avoid having to keep the whole thing in RAM.
The pgcrypto extension seems to be lacking such functionality, however.

I would build something similar to a Merkle tree:

Choose a convenient chunk size (a few MB is probably ok), read the large
object in chunks of this size, computing the hash for each. Concatenate
all the hashes and compute the hash of that. Add intermediate levels if
the the concatenated hashes are still too large to fit in memory.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

Re: How to hash a large amount of data within Postgres?

От
Tomas Vondra
Дата:
On 6/23/21 7:39 PM, Peter J. Holzer wrote:
> On 2021-06-21 15:53:09 +0200, Thorsten Schöning wrote:
>> Some years ago I implemented some SQL to read all files, build a table
>> of SHA256 hashes and tell me how much data is redundant. The goal was
>> to have a look at which files share the same hash with different LOIDs
>> and optionally change that, so that all those files are only stored
>> once on the end.
>>
>> While the approach was pretty naive, because it simply read all files
>> into memory to calculate the hashes, I'm somewhat sure it worked in
>> the past with Postgres 9.6. The executing server had enough free RAM
>> available as well to process the at most ~4 GiB large files one after
>> another.
>>
>> I tried that SQL today with Postgres 11 on UB 18.04 and it failed:
>>
>>> [Code: 0, SQL State: XX000]  FEHLER: invalid memory alloc request size 1898107949
>>>   Wobei: PL/pgSQL-Funktion loid_hash_calc(oid,text)[...]
> [...]
>> I searched regaridng that issue and only found two relevant results:
>> Corrupted rows for some reason and simply size restrictions when
>> allocating memory. The latter is more likely than the former in my
>> case, as the restrictions seems to be 1 GiB and I do have larger
>> files.
> 
> 1 GB is the maximum size of quite a few data structures in PostgreSQL. I
> don't use PL/pgSQL, but I wouldn't be surprised if it was the maximum
> size of whatever loread() returns (a bytea?). I would be surprised if
> this limit was higher in version 9.6 than it is in version 11, however.
> 

Well, it's actually a bit worse than that - the maximum allocation size
is (1GB - 1B), as it's defined like this:

#define MaxAllocSize    ((Size) 0x3fffffff) /* 1 gigabyte - 1 */

And this includes both the "user data" and a small "header" used for the
bytea value. Depending on what format you use to output the values there
may be additional limits (e.g. 'hex' requires 2 characters per byte, so
doubling the amount of memory needed).

For large objects this is not an issue, because we store them in small
chunks, not as one large bytea value.

> 
>> I'm doing the following simply currently, because I didn't find any
>> interfaces allowing to forward blocks of data, LOIDs, file descriptors
>> or anything like that working with smaller buffers or alike.
>>
>>> fd      := lo_open( loid,  INV_READ);
>>> size    := lo_lseek(fd, 0, SEEK_END);
>>> PERFORM    lo_lseek(fd, 0, SEEK_SET);
>>
>>> hashBin := digest(loread(fd, size), algorithm);
>>> hashHex := encode(hashBin,          'hex');
>>
>> So, is there any way to work around the problem I have currently?
> 
> Normally, hash libararies have a way to feed chunks of data into a hash
> computations to avoid having to keep the whole thing in RAM.
> The pgcrypto extension seems to be lacking such functionality, however.
> 
> I would build something similar to a Merkle tree:
> 
> Choose a convenient chunk size (a few MB is probably ok), read the large
> object in chunks of this size, computing the hash for each. Concatenate
> all the hashes and compute the hash of that. Add intermediate levels if
> the the concatenated hashes are still too large to fit in memory.
> 

Not sure where you searched, but there definitely are interfaces to read
chunks of data from large objects - see this:

1) lo_get (loid, offset, length)
   https://www.postgresql.org/docs/13/lo-funcs.html

2) lo_seek() + lo_read()
   https://www.postgresql.org/docs/13/lo-interfaces.html

Obviously, you can't do "loread(fd, size)" because that's going to
attempt building one large bytea, failing because of the alloc limit.
You have to stream the data into the hash.

Doing that in plpgsql is possible, although possibly somewhat slow.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [SPAM] Re: How to hash a large amount of data within Postgres?

От
Thorsten Schöning
Дата:
Guten Tag Tomas Vondra,
am Donnerstag, 24. Juni 2021 um 00:56 schrieben Sie:

> Not sure where you searched, but there definitely are interfaces to read
> chunks of data from large objects - see this:

That wasn't the point, but the lack of HASH-functions working with
such read blocks of larger data and maintaining state internally. I
only see functions outputting a calculated hash for a given block of
data, start/progress/end-interfaces like in many other libs.

So, am I missing something or do I need to build something based on
multiple individually output hash results instead, like suggested?

Mit freundlichen Grüßen

Thorsten Schöning

--
AM-SoFT IT-Service - Bitstore Hameln GmbH
Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK

E-Mail: Thorsten.Schoening@AM-SoFT.de
Web:    http://www.AM-SoFT.de/

Tel:   05151-  9468- 0
Tel:   05151-  9468-55
Fax:   05151-  9468-88
Mobil:  0178-8 9468-04

AM-SoFT IT-Service - Bitstore Hameln GmbH, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 221853 - Geschäftsführer: Janine Galonska


Für Rückfragen stehe ich Ihnen sehr gerne zur Verfügung.

Mit freundlichen Grüßen

Thorsten Schöning


Tel: 05151 9468 0
Fax: 05151 9468 88
Mobil:
Webseite: https://www.am-soft.de

AM-Soft IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister
fürIT und TK 

AM-Soft IT-Service - Bitstore Hameln GmbH i.G.
Brandenburger Str. 7c
31789 Hameln
Tel: 05151 9468 0

Bitstore IT-Consulting GmbH
Zentrale - Berlin Lichtenberg
Frankfurter Allee 285
10317 Berlin
Tel: 030 453 087 80

CBS IT-Service - Bitstore Kaulsdorf UG
Tel: 030 453 087 880 1

Büro Dallgow-Döberitz
Tel: 03322 507 020

Büro Kloster Lehnin
Tel: 033207 566 530

PCE IT-Service - Bitstore Darmstadt UG
Darmstadt
Tel: 06151 392 973 0

Büro Neuruppin
Tel: 033932 606 090

ACI EDV Systemhaus - Bitstore Dresden GmbH
Dresden
Tel: 0351 254 410

Das Systemhaus - Bitstore Magdeburg GmbH
Magdeburg
Tel: 0391 636 651 0

Allerdata.IT - Bitstore Wittenberg GmbH
Wittenberg
Tel: 03491 876 735 7

Büro Liebenwalde
Tel: 033054 810 00

HSA - das Büro - Bitstore Altenburg UG
Altenburg
Tel: 0344 784 390 97

Bitstore IT – Consulting GmbH
NL Piesteritz
Piesteritz
Tel: 03491 644 868 6

Solltec IT-Services - Bitstore Braunschweig UG
Braunschweig
Tel: 0531 206 068 0

MF Computer Service - Bitstore Gütersloh GmbH
Gütersloh
Tel: 05245 920 809 3

Firmensitz: AM-Soft IT-Service - Bitstore Hameln GmbH i.G. , Brandenburger Str. 7c , 31789 Hameln
Geschäftsführer Janine Galonska