Re: MD5 aggregate

Поиск
Список
Период
Сортировка
От Dean Rasheed
Тема Re: MD5 aggregate
Дата
Msg-id CAEZATCUFUL9FoYYraE0MSqJdcmya6=SObbh37bfyp23p=vU=Cg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: MD5 aggregate  (Noah Misch <noah@leadboat.com>)
Ответы Re: MD5 aggregate  (Peter Eisentraut <peter_e@gmx.net>)
Список pgsql-hackers
On 26 June 2013 22:48, Noah Misch <noah@leadboat.com> wrote:
> On Wed, Jun 26, 2013 at 09:04:34PM +0100, Dean Rasheed wrote:
>> On 26 June 2013 19:32, Noah Misch <noah@leadboat.com> wrote:
>> > On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
>
>> > md5_agg() is well-defined and not cryptographically novel, and your use case
>> > is credible.  However, not every useful-sometimes function belongs in core; we
>> > mostly stick to functions with ubiquitous value and functions that would be
>> > onerous to implement externally.  (We do carry legacy stuff that wouldn't make
>> > the cut today.)  In my estimation, md5_agg() does not make that cut.  The
>> > variety of intuitive md5_agg() definitions attested upthread doesn't bode well
>> > for its broad applicability.  It could just as well live in an extension
>> > published on PGXN.  Mine is just one opinion, though; I won't be horrified if
>> > the community wants an md5_agg() in core.
>>
>> I disagree with this though. I see md5_agg() as a natural extension of
>> the already-in-core md5() functions and underlying code, satisfying a
>> well-defined use-case, and providing a checksum comparable with
>> externally generated md5 sums.
>
> All true, but I don't see those facts justifying core inclusion.
>
>> A quick google search reveals several people asking for something like
>> this, and people recommending md5(string_agg(...)) or
>> md5(string_agg(md5(...))) based solutions, which are doomed to failure
>> on larger tables. So I think that there is a case for having md5_agg()
>> in core as an alternative to such hacks, while having more
>> sophisticated crypto functions available as extensions.
>
> My nine Google hits for "md5(string_agg" included one Stack Overflow answer,
> several mirrors of that answer, and a few posts on this thread.
>

I found more than that using Google. It's not uncommon for people to
use Google and then pick the first "accepted" answer on Stack
Overflow, in which case they might well pick one of the answers here
[1] or here [2]. Or they might find this [3]. Or if they came to our
lists they might find this [4], or deduce from this [5] that it isn't
possible.

[1] http://stackoverflow.com/questions/4020033/how-can-i-get-a-hash-of-an-entire-table-in-postgresql

[2] http://stackoverflow.com/questions/13554333/finding-out-the-hash-value-of-a-group-of-rows

[3] https://ralphholz.de/?q=node/298

[4] http://www.postgresql.org/message-id/4E5F469C.5070104@compulab.co.il

[5] http://www.postgresql.org/message-id/e94d85500801301153u6b976e31m89e311c7134a0160@mail.gmail.com

I'd say there are clearly people who want it, and the nature of some
of those answers suggests to me that we ought to have a better answer
in core.


>> I haven't looked at pgcrypto because, as I said, performance wasn't my
>> primary criterion either, but removing the unnessary data copy just
>> seemed like good sense.
>>
>> I'll take a look at it, and then as you and Peter suggest, look to
>> split the patch into two. I think I will withdraw md5_total() because
>> I was never entirely happy with that anyway.
>
> Understood; feel free to back off from any performance improvements in which
> you didn't wish to involve yourself.
>
> If we do end up moving forward with md5_agg(), I note that the pgcrypto
> version already has an initialize/accumulate/finalize API division.  A patch
> importing that code largely-unchanged would be easier to verify than a patch
> restructuring the src/backend/libpq/md5.c implementation.  The two patches
> would then be a "use pgcrypto md5.c in core" patch and an md5_agg() patch.
>

I'll take a look at it, although I think there is still the potential
for bugs either way.

What I've done with libpq's md5.c is just to rearrange the internal
pieces, without touching the core algorithm code or the higher level
callers. So the most likely types of bugs are boundary/size errors. If
I'd broken any of pg_md5_hash(), pg_md5_binary() or pg_md5_crypt(),
then I'd have broken them all. It's possible to get a reasonable level
of confidence in those changes using queries like this on HEAD and
with the patch:

SELECT md5(string_agg(md5(str) || repeat(' ', i), '')) FROM ( SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str
FROMgenerate_series(0,10000) i,        generate_series(0,i) j  GROUP BY i
 
) t;

and these with the patch:

SELECT md5_agg(md5(str) || repeat(' ', i)) FROM ( SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str   FROM
generate_series(0,10000)i,        generate_series(0,i) j  GROUP BY i
 
) t;

SELECT md5_agg(md5_agg || repeat(' ', i)) FROM ( SELECT i, md5_agg(chr(32+(i+j*31)%224))   FROM
generate_series(0,10000)i,        generate_series(0,i) j  GROUP BY i
 
) t;

which all produce the same overall sum.

Regards,
Dean



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Atri Sharma
Дата:
Сообщение: Group Commits Vs WAL Writes
Следующее
От: Dean Rasheed
Дата:
Сообщение: Re: MD5 aggregate