Review [was Re: MD5 aggregate]

Поиск
Список
Период
Сортировка
От David Fetter
Тема Review [was Re: MD5 aggregate]
Дата
Msg-id 20130621174835.GI5395@fetter.org
обсуждение исходный текст
Ответ на Re: MD5 aggregate  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Ответы Re: Review [was Re: MD5 aggregate]  (David Fetter <david@fetter.org>)
Список pgsql-hackers
On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
> On 15 June 2013 10:22, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> > There seem to be 2 separate directions that this could go, which
> > really meet different requirements:
> >
> > 1). Produce an unordered sum for SQL to compare 2 tables regardless of
> > the order in which they are scanned. A possible approach to this might
> > be something like an aggregate
> >
> > md5_total(text/bytea) returns text
> >
> > that returns the sum of the md5 values of each input value, treating
> > each md5 value as an unsigned 128-bit integer, and then producing the
> > hexadecimal representation of the final sum. This should out-perform a
> > solution based on numeric addition, and in typical cases, the result
> > wouldn't be much longer than a regular md5 sum, and so would be easy
> > to eyeball for differences.
> >
> 
> I've been playing around with the idea of an aggregate that computes
> the sum of the md5 hashes of each of its inputs, which I've called
> md5_total() for now, although I'm not particularly wedded to that
> name. Comparing it with md5_agg() on a 100M row table (see attached
> test script) produces interesting results:
> 
> SELECT md5_agg(foo.*::text)
>   FROM (SELECT * FROM foo ORDER BY id) foo;
> 
>  50bc42127fb9b028c9708248f835ed8f
> 
> Time: 92960.021 ms
> 
> SELECT md5_total(foo.*::text) FROM foo;
> 
>  02faea7fafee4d253fc94cfae031afc43c03479c
> 
> Time: 96190.343 ms
> 
> Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
> result is longer) but it seems like it would be very useful for
> quickly comparing data in SQL, since its value is not dependent on the
> row-order making it easier to use and better performing if there is no
> usable index for ordering.
> 
> Note, however, that if there is an index that can be used for
> ordering, the performance is not necessarily better than md5_agg(), as
> this example shows. There is a small additional overhead per row for
> initialising the MD5 sums, and adding the results to the total, but I
> think the biggest factor is that md5_total() is processing more data.
> The reason is that MD5 works on 64-byte blocks, so the total amount of
> data going through the core MD5 algorithm is each row's size is
> rounded up to a multiple of 64. In this simple case it ends up
> processing around 1.5 times as much data:
> 
> SELECT sum(length(foo.*::text)) AS md5_agg,
>        sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
> 
>   md5_agg   |  md5_total
> ------------+-------------
>  8103815438 | 12799909248
> 
> although of course that overhead won't be as large on wider tables,
> and even in this case the overall performance is still on a par with
> md5_agg().
> 
> ISTM that both aggregates are potentially useful in different
> situations. I would probably typically use md5_total() because of its
> simplicity/order-independence and consistent performance, but
> md5_agg() might also be useful when comparing with external data.
> 
> Regards,
> Dean

Submission review:
   Is the patch in a patch format which has context? (eg: context diff format)
       Yes.
   Does it apply cleanly to the current git master?
       Yes.      Does it include reasonable tests, necessary doc patches, etc? 
       Yes.

Usability review:
   Does the patch actually implement that?
       Yes.
   Do we want that?
       Enough do.
   Do we already have it?
       No.
   Does it follow SQL spec, or the community-agreed behavior?
       No, and yes, respectively.
   Does it include pg_dump support (if applicable)?
       N/A
   Are there dangers?
       Patch changes the MD5 implementation, which could       theoretically result in backward incompatibility if the
    changes are not 100% backward-compatible.
 
   Have all the bases been covered? 
       Yes.

Feature test:
   Does the feature work as advertised?
       Yes.
   Are there corner cases the author has failed to consider?
       Not that I've found so far.
   Are there any assertion failures or crashes?
       No.

Performance review (skills needed: ability to time performance)
   Does the patch slow down simple tests?
       Yes.  For an MD5 checksum of some random data, here are       results from master:
           shackle@postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM
generate_series(1,10000)),          postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
postgres-#select md5(a::text) FROM t2;           shackle@postgres:5493=# \timing            Timing is on.
shackle@postgres:5493=#\g           Time: 955.393 ms           shackle@postgres:5493=# \g           Time: 950.909 ms
      shackle@postgres:5493=# \g           Time: 947.955 ms           shackle@postgres:5493=# \g           Time:
946.193ms           shackle@postgres:5493=# \g           Time: 950.591 ms           shackle@postgres:5493=# \g
Time: 952.346 ms           shackle@postgres:5493=# \g           Time: 948.623 ms           shackle@postgres:5493=# \g
       Time: 939.819 ms
 
       and here from master + the patch:
           WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
   t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))           select md5(a::text) FROM t2;           Time:
1129.178ms           shackle@postgres:5494=# \g           Time: 1134.013 ms           shackle@postgres:5494=# \g
  Time: 1124.387 ms           shackle@postgres:5494=# \g           Time: 1119.733 ms           shackle@postgres:5494=#
\g          Time: 1104.906 ms           shackle@postgres:5494=# \g           Time: 1137.055 ms
shackle@postgres:5494=#\g           Time: 1128.977 ms           shackle@postgres:5494=# \g           Time: 1143.619 ms
 
           Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms           Avg, StddevPopulation with patch: 1127.73
ms,11.06 ms
 
   If it claims to improve performance, does it?
       Possibly for the aggregate.
   Does it slow down other things? 
       Not that I've found.

Coding review:
   Does it follow the project coding guidelines?
       Yes.
   Are there portability issues?
       Not that I can see.
   Will it work on Windows/BSD etc?
       Not yet tested.
   Are the comments sufficient and accurate?
       Yes.
   Does it do what it says, correctly?
       As far as I can tell.
   Does it produce compiler warnings?
       No.
   Can you make it crash? 
       My efforts so far have failed.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Josh Berkus
Дата:
Сообщение: Re: Why can't I use windowing functions over ordered aggregates?
Следующее
От: Thom Brown
Дата:
Сообщение: Unaccent performance