Обсуждение: md5(large_object_id)

Поиск
Список
Период
Сортировка

md5(large_object_id)

От
Karsten Hilbert
Дата:
I have searched the mailing list archives as well as the
internet at large but haven't been able to find a _streaming_
version of md5 hashing for large objects. The closest thing I
found is this thread

    http://www.postgresql.org/message-id/5d0f60990704081003p615530acx86257a70bff33fa4@mail.gmail.com

which eventually lead to this code

    http://llg.cubic.org/pg-mdagg/

which, however, doesn't stream in the large object data
either. Essentially, for hashing _very_ large objects (to the
order of several GB per object, I am dealing with radiology
studies aka DICOM data) one would want an md5 function which
streams in parts of a large object piece by piece using
md5_update and m5_finalize or some such.

It didn't look like pgcrypto offers a streaming version either.

Anything I am overlooking ?

Thanks,
Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: md5(large_object_id)

От
Willy-Bas Loos
Дата:
On Mon, Sep 21, 2015 at 10:17 PM, Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
 I am dealing with radiology
studies aka DICOM data) one would want an md5 function which
streams in parts of a large object piece by piece using
md5_update and m5_finalize or some such.

It didn't look like pgcrypto offers a streaming version either.

Anything I am overlooking ?


Hi,

I don't think that it is possible to stream the result of a query anyway, or is this some neat new feature?
I'm not sure if there is much to be gained from storing such large objects in a database. Also, the max size of a value is 1GB.
Maybe it would be better to do some file administration in the database, but not the actual storage. Then you could use a tool that does what you want on the bare file and maybe store the results in the database.

HTH,

--
Willy-Bas Loos

Re: md5(large_object_id)

От
"Karsten Hilbert"
Дата:
> I don't think that it is possible to stream the result of a query anyway,

I was unclear. I don't expect query results to stream out to the client.
I want the "SELECT md5(OID);" to return a single md5 hash value. It is
already possible to "SELECT md5(lo_read(OID));" but that will read the
entire large object into memory. I would like the function md5(oid) to
_internally_ feed consecutive calls to lo_read() into an updating md5
function and eventually return the final hash.

> I'm not sure if there is much to be gained from storing such large
> objects in a database.

ACID. Permissions. Single access method.

> Also, the max size of a value is 1GB.

The maximum size is 4 Terabyte ATM for large objects.

Karsten Hilbert



Re: md5(large_object_id)

От
Kevin Grittner
Дата:
Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:

> I am dealing with radiology studies aka DICOM data) one would
> want an md5 function which streams in parts of a large object
> piece by piece using md5_update and m5_finalize or some such.
It would certainly be possible to write a lo_md5(oid) function to do
this, but as far as I'm aware nobody has yet done so.  How are your
C skills?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: md5(large_object_id)

От
Karsten Hilbert
Дата:
On Mon, Oct 05, 2015 at 03:27:26PM +0000, Kevin Grittner wrote:

> Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
>
> > I am dealing with radiology studies aka DICOM data) one would
> > want an md5 function which streams in parts of a large object
> > piece by piece using md5_update and m5_finalize or some such.
> It would certainly be possible to write a lo_md5(oid) function to do
> this, but as far as I'm aware nobody has yet done so.  How are your
> C skills?

Next to none, unfortunately :-/

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: md5(large_object_id)

От
"Daniel Verite"
Дата:
    Karsten Hilbert wrote:

> On Mon, Oct 05, 2015 at 03:27:26PM +0000, Kevin Grittner wrote:
>
> > Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
> >
> > > I am dealing with radiology studies aka DICOM data) one would
> > > want an md5 function which streams in parts of a large object
> > > piece by piece using md5_update and m5_finalize or some such.
> > It would certainly be possible to write a lo_md5(oid) function to do
> > this, but as far as I'm aware nobody has yet done so.  How are your
> > C skills?
>
> Next to none, unfortunately :-/

It would be nice to have this API exposed in pgcrypto,
but as a workaround, you could use plperl instead, or a
similarly "easy" PL that provides it.

Here's an example with plperlu, using the core
Digest module. It works with other hash types, not just MD5.
See http://perldoc.perl.org/Digest.html for the exact names
(warning: 'SHA-1' instead of pgcrypto's 'SHA1').

 CREATE OR REPLACE FUNCTION lo_digest(text, oid, int)
 RETURNS bytea AS
$$
  use Digest;
  use strict;
  use Data::Dumper;
  my $ctxt = Digest->new($_[0]);
  my $sz=$_[2];
  elog(ERROR, "Invalid chunk size: $sz") if ($sz<=0);
  my $sth = spi_query("SELECT lo_open($_[1], 262144) as fd");
  my $row = spi_fetchrow($sth);
  spi_cursor_close($sth);

  if ($row) {
     my $fd = $row->{fd};
     my $bytes;
     my $plan = spi_prepare("SELECT loread($fd, $sz) as chunk");
     do {
       $sth = spi_query_prepared($plan);
       $row = spi_fetchrow($sth);
       $bytes = decode_bytea($row->{chunk});
       $ctxt->add($bytes);
       spi_cursor_close($sth);
     } while (length($bytes)>0);
     spi_exec_query("select lo_close($fd)");
     spi_freeplan($plan);
  }
  return encode_bytea($ctxt->digest);
$$ LANGUAGE plperlu;

Example of use:
# select lo_digest('MD5', 2557608, 2048*256);

         lo_digest
------------------------------------
 \xa8447e145d0f8d9ca7fe7df1bbf06d75

2557608 is the oid and 2048*256 represents 512KB.
The lo blocksize is typically 2048, so a multiple of 2048 is ideal
to optimize reads.

I expect it to work on very large contents while requesting only modest
amounts of memory. But unfortunately is seems pretty slow.

On my desktop core i5-3470, it takes 2.5 seconds to digest
a 100 MB audio file already in cache. An independant Perl script
doing the equiavlent processing on the same data takes 0.4s .

It's not because of Digest::MD5 or loread(), it seems that it's the
pipe in-between with the text->bytea decoding that eats most of the
CPU cycles, that step being necessary because plperl lacks the ability
to consume bytea directly. Sigh...

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: md5(large_object_id)

От
Karsten Hilbert
Дата:
On Mon, Oct 05, 2015 at 03:27:26PM +0000, Kevin Grittner wrote:

> Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
>
> > I am dealing with radiology studies aka DICOM data) one would
> > want an md5 function which streams in parts of a large object
> > piece by piece using md5_update and m5_finalize or some such.
> It would certainly be possible to write a lo_md5(oid) function to do
> this, but as far as I'm aware nobody has yet done so.  How are your
> C skills?

I had hoped someone was going to say: "Yeah, right, low
hanging fruit, let's just do it for 9.next" :-)

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: md5(large_object_id)

От
Karsten Hilbert
Дата:
On Wed, Oct 07, 2015 at 12:55:38PM +0200, Karsten Hilbert wrote:

> > > I am dealing with radiology studies aka DICOM data) one would
> > > want an md5 function which streams in parts of a large object
> > > piece by piece using md5_update and m5_finalize or some such.
> > It would certainly be possible to write a lo_md5(oid) function to do
> > this, but as far as I'm aware nobody has yet done so.  How are your
> > C skills?
>
> I had hoped someone was going to say: "Yeah, right, low
> hanging fruit, let's just do it for 9.next" :-)

Someone _with_ C skills, that is.

Thanks for the plperlu suggestion, btw.

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: md5(large_object_id)

От
Pavel Stehule
Дата:


2015-10-07 13:18 GMT+02:00 Karsten Hilbert <Karsten.Hilbert@gmx.net>:
On Wed, Oct 07, 2015 at 12:55:38PM +0200, Karsten Hilbert wrote:

> > > I am dealing with radiology studies aka DICOM data) one would
> > > want an md5 function which streams in parts of a large object
> > > piece by piece using md5_update and m5_finalize or some such.
> > It would certainly be possible to write a lo_md5(oid) function to do
> > this, but as far as I'm aware nobody has yet done so.  How are your
> > C skills?
>
> I had hoped someone was going to say: "Yeah, right, low
> hanging fruit, let's just do it for 9.next" :-)

Someone _with_ C skills, that is.

if the size of blobs is less than 1GB, then it should be possible in plpgsql too.

 postgres=# \lo_import ~/Desktop/001.jpg
lo_import 24577
postgres=# select md5(lo_get(24577));
┌──────────────────────────────────┐
│               md5                │
╞══════════════════════════════════╡
│ 610ccaab8c7c60e1168abfa799d1305d │
└──────────────────────────────────┘
(1 row)


Regards

Pavel


Thanks for the plperlu suggestion, btw.

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: md5(large_object_id)

От
Karsten Hilbert
Дата:
On Wed, Oct 07, 2015 at 01:30:24PM +0200, Pavel Stehule wrote:

> > > > > I am dealing with radiology studies aka DICOM data) one would
> > > > > want an md5 function which streams in parts of a large object
> > > > > piece by piece using md5_update and m5_finalize or some such.
> > > > It would certainly be possible to write a lo_md5(oid) function to do
> > > > this, but as far as I'm aware nobody has yet done so.  How are your
> > > > C skills?
> > >
> > > I had hoped someone was going to say: "Yeah, right, low
> > > hanging fruit, let's just do it for 9.next" :-)
> >
> > Someone _with_ C skills, that is.
> >
>
> if the size of blobs is less than 1GB, then it should be possible in
> plpgsql too.
>
>  postgres=# \lo_import ~/Desktop/001.jpg
> lo_import 24577
> postgres=# select md5(lo_get(24577));
> ┌──────────────────────────────────┐
> │               md5                │
> ╞══════════════════════════════════╡
> │ 610ccaab8c7c60e1168abfa799d1305d │
> └──────────────────────────────────┘
> (1 row)

It is, I know. but I am exactly after the use case > 1 GB

Thanks,
Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: md5(large_object_id)

От
Karsten Hilbert
Дата:
For the record - I have also devised another solution to the
underlying problem (md5(bytea) cannot process large amounts
of input), chunked md5():

    create or replace function lo_chunked_md5(oid, int)
        returns text
        language 'plpgsql'
        stable strict
        as '
    DECLARE
        lo_id alias for $1;
        chunk_size alias for $2;
        _lo_fd integer;
        _lo_size integer;
        _chunk_count integer;
        _remainder integer;
        _md5_concat text;
        INV_READ  constant integer := x''40000''::integer;
        SEEK_SET  constant integer := 0;
        SEEK_END  constant integer := 2;
    BEGIN
        -- check for existence of lo_id

        _lo_fd := lo_open(lo_id, INV_READ);
        -- get size
        _lo_size := lo_lseek(_lo_fd, 0, SEEK_END);
        PERFORM lo_close(_lo_fd);
        -- calculate chunks and remainder
        _chunk_count := _lo_size / chunk_size;
        _remainder := _lo_size % chunk_size;
        -- loop over chunks
        _md5_concat := '''';
        FOR _chunk_id in 1.._chunk_count LOOP
            _md5_concat := _md5_concat || md5(lo_get(lo_id, (_chunk_id - 1) * chunk_size, chunk_size));
        END LOOP;
        -- add remainder
        _md5_concat := _md5_concat || md5(lo_get(lo_id, _chunk_count * chunk_size, _remainder));
        return md5(_md5_concat);
    END;';

This can easily be mirrored by a client-side function, say,

    def file2chunked_md5(filename=None, chunk_size=500*_MB):
        _log.debug('chunked_md5(%s, %s)', filename, chunk_size)
        md5_concat = u''
        f = open(filename, 'rb')
        while True:
            md5 = hashlib.md5()
            data = f.read(chunk_size)
            if not data:
                break
            md5.update(data)
            md5_concat += md5.hexdigest()
        f.close()

        md5 = hashlib.md5()
        md5.update(md5_concat)
        hex_digest = md5.hexdigest()

        _log.debug('md5(%s): %s', md5_concat, hex_digest)

        return hexdigest

in Python.

Still, I'd welcome a native, streaming md5(loid) which is
bound to be more optimized by design.

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: md5(large_object_id)

От
Jim Nasby
Дата:
On 10/7/15 10:34 AM, Karsten Hilbert wrote:
> Still, I'd welcome a native, streaming md5(loid) which is
> bound to be more optimized by design.

It would be nice if we had an interface to TOAST that allowed for
streaming (well, really chunking) data to a function. That wouldn't help
in this particular case, but it would significantly expand the
usefulness of a streaming version of md5 and all the other hash operators.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: md5(large_object_id)

От
Karsten Hilbert
Дата:
On Wed, Oct 07, 2015 at 01:01:41PM -0500, Jim Nasby wrote:

> >Still, I'd welcome a native, streaming md5(loid) which is
> >bound to be more optimized by design.
>
> It would be nice if we had an interface to TOAST that allowed for streaming
> (well, really chunking) data to a function. That wouldn't help in this
> particular case, but it would significantly expand the usefulness of a
> streaming version of md5 and all the other hash operators.

Sure enough. A native md5(oid_of_large_object), which
internally uses lo_read(), didn't seem all that much work to
me.

Maybe I'm wrong on that side of things.

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346