Re: MD5 aggregate

Поиск
Список
Период
Сортировка
Искать

Re: MD5 aggregate

От:
Stephen Frost <sfrost@snowman.net>
Дата:

Re: MD5 aggregate

От:
Stephen Frost <sfrost@snowman.net>
Дата:

Re: MD5 aggregate

От:
Andres Freund <andres@2ndquadrant.com>
Дата:

Re: MD5 aggregate

От:
Marko Kreen <markokr@gmail.com>
Дата:

Review [was Re: MD5 aggregate]

От:
David Fetter <david@fetter.org>
Дата:

Re: Review [was Re: MD5 aggregate]

От:
David Fetter <david@fetter.org>
Дата:

Re: MD5 aggregate

От:
Noah Misch <noah@leadboat.com>
Дата:

Re: MD5 aggregate

От:
Noah Misch <noah@leadboat.com>
Дата:

Re: MD5 aggregate

От:
Peter Eisentraut <peter_e@gmx.net>
Дата:

Re: MD5 aggregate

От:
Andrew Dunstan <andrew@dunslane.net>
Дата:

Re: MD5 aggregate

От:
Hannu Krosing <hannu@2ndQuadrant.com>
Дата:

Re: MD5 aggregate

От:
Craig Ringer <craig@2ndquadrant.com>
Дата:

Re: MD5 aggregate

От:
Craig Ringer <craig@2ndquadrant.com>
Дата:

Re: MD5 aggregate

От:
Peter Eisentraut <peter_e@gmx.net>
Дата:

Re: MD5 aggregate

От:
Peter Eisentraut <peter_e@gmx.net>
Дата:

Re: MD5 aggregate

От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:

Re: MD5 aggregate

От:
Tom Lane <tgl@sss.pgh.pa.us>
Дата:

Re: MD5 aggregate

От:
Robert Haas <robertmhaas@gmail.com>
Дата:

Re: MD5 aggregate

От:
Marko Kreen <markokr@gmail.com>
Дата:

Re: MD5 aggregate

От:
Marko Kreen <markokr@gmail.com>
Дата:

Re: MD5 aggregate

От:
Benedikt Grundmann <bgrundmann@janestreet.com>
Дата:



On Fri, Jun 14, 2013 at 2:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Marko Kreen <markokr@gmail.com> writes:
> On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>> Attached is a patch implementing a new aggregate function md5_agg() to
>> compute the aggregate MD5 sum across a number of rows.

> It's more efficient to calculate per-row md5, and then sum() them.
> This avoids the need for ORDER BY.

Good point.  The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

                        regards, tom lane


xor works but only if each row is different (e.g. at the very least all columns together make a unique key). 


 

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: MD5 aggregate

От:
Dean Rasheed <dean.a.rasheed@gmail.com>
Дата:

Re: MD5 aggregate

От:
Dean Rasheed <dean.a.rasheed@gmail.com>
Дата:

MD5 aggregate

От:
Dean Rasheed <dean.a.rasheed@gmail.com>
Дата:
Hi,

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows. This is
something I've wished for a number of times. I think the primary use
case is to do a quick check that 2 tables, possibly on different
servers, contain the same data, using a query like

  SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

or

  SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

these would be equivalent to

  SELECT md5(string_agg(foo.*::text, '' ORDER BY id)) FROM foo;

but without the excessive memory consumption for the intermediate
concatenated string, and the resulting 1GB table size limit.

I've added 2 variants: md5_agg(text) and md5_agg(bytea) to match the 2
variants of md5(), so pure binary data can also be checksummed.

In passing, I've tidied up and optimised the code in md5.c a bit ---
specifically I've removed the malloc()/memcpy()/free() code that was
unnecessarily making a copy of the entire input data just to pad it
and append the bit count. This reduces the memory consumption of the
existing md5() functions for large inputs, and gives a modest
performance boost. As a result, the md5() function can no longer throw
an out-of-memory error.

Regards,
Dean

Re: MD5 aggregate

От:
Dean Rasheed <dean.a.rasheed@gmail.com>
Дата:

Re: MD5 aggregate

От:
Dean Rasheed <dean.a.rasheed@gmail.com>
Дата:

Re: MD5 aggregate

От:
Dean Rasheed <dean.a.rasheed@gmail.com>
Дата:
On 15 June 2013 10:22, Dean Rasheed  wrote:
> There seem to be 2 separate directions that this could go, which
> really meet different requirements:
>
> 1). Produce an unordered sum for SQL to compare 2 tables regardless of
> the order in which they are scanned. A possible approach to this might
> be something like an aggregate
>
> md5_total(text/bytea) returns text
>
> that returns the sum of the md5 values of each input value, treating
> each md5 value as an unsigned 128-bit integer, and then producing the
> hexadecimal representation of the final sum. This should out-perform a
> solution based on numeric addition, and in typical cases, the result
> wouldn't be much longer than a regular md5 sum, and so would be easy
> to eyeball for differences.
>

I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
  FROM (SELECT * FROM foo ORDER BY id) foo;

 50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

 02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

Note, however, that if there is an index that can be used for
ordering, the performance is not necessarily better than md5_agg(), as
this example shows. There is a small additional overhead per row for
initialising the MD5 sums, and adding the results to the total, but I
think the biggest factor is that md5_total() is processing more data.
The reason is that MD5 works on 64-byte blocks, so the total amount of
data going through the core MD5 algorithm is each row's size is
rounded up to a multiple of 64. In this simple case it ends up
processing around 1.5 times as much data:

SELECT sum(length(foo.*::text)) AS md5_agg,
       sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;

  md5_agg   |  md5_total
------------+-------------
 8103815438 | 12799909248

although of course that overhead won't be as large on wider tables,
and even in this case the overall performance is still on a par with
md5_agg().

ISTM that both aggregates are potentially useful in different
situations. I would probably typically use md5_total() because of its
simplicity/order-independence and consistent performance, but
md5_agg() might also be useful when comparing with external data.

Regards,
Dean

Re: Review [was Re: MD5 aggregate]

От:
Dean Rasheed <dean.a.rasheed@gmail.com>
Дата:

Re: MD5 aggregate

От:
Dean Rasheed <dean.a.rasheed@gmail.com>
Дата:

Re: MD5 aggregate

От:
Dean Rasheed <dean.a.rasheed@gmail.com>
Дата:

Re: MD5 aggregate

От:
Dean Rasheed <dean.a.rasheed@gmail.com>
Дата:

Re: MD5 aggregate

От:
Dean Rasheed <dean.a.rasheed@gmail.com>
Дата:
FAQ