Re: Review [was Re: MD5 aggregate]

Поиск
Список
Период
Сортировка
От Dean Rasheed
Тема Re: Review [was Re: MD5 aggregate]
Дата
Msg-id CAEZATCVC_3kRiRwpNJrTSL9Gt90dNjL=pcXdXboEDBaOu-k4QQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Review [was Re: MD5 aggregate]  (David Fetter <david@fetter.org>)
Список pgsql-hackers
On 21 June 2013 21:04, David Fetter <david@fetter.org> wrote:
> On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote:
>> On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
>> > On 15 June 2013 10:22, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>> > > There seem to be 2 separate directions that this could go, which
>> > > really meet different requirements:
>> > >
>> > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of
>> > > the order in which they are scanned. A possible approach to this might
>> > > be something like an aggregate
>> > >
>> > > md5_total(text/bytea) returns text
>> > >
>> > > that returns the sum of the md5 values of each input value, treating
>> > > each md5 value as an unsigned 128-bit integer, and then producing the
>> > > hexadecimal representation of the final sum. This should out-perform a
>> > > solution based on numeric addition, and in typical cases, the result
>> > > wouldn't be much longer than a regular md5 sum, and so would be easy
>> > > to eyeball for differences.
>> > >
>> >
>> > I've been playing around with the idea of an aggregate that computes
>> > the sum of the md5 hashes of each of its inputs, which I've called
>> > md5_total() for now, although I'm not particularly wedded to that
>> > name. Comparing it with md5_agg() on a 100M row table (see attached
>> > test script) produces interesting results:
>> >
>> > SELECT md5_agg(foo.*::text)
>> >   FROM (SELECT * FROM foo ORDER BY id) foo;
>> >
>> >  50bc42127fb9b028c9708248f835ed8f
>> >
>> > Time: 92960.021 ms
>> >
>> > SELECT md5_total(foo.*::text) FROM foo;
>> >
>> >  02faea7fafee4d253fc94cfae031afc43c03479c
>> >
>> > Time: 96190.343 ms
>> >
>> > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
>> > result is longer) but it seems like it would be very useful for
>> > quickly comparing data in SQL, since its value is not dependent on the
>> > row-order making it easier to use and better performing if there is no
>> > usable index for ordering.
>> >
>> > Note, however, that if there is an index that can be used for
>> > ordering, the performance is not necessarily better than md5_agg(), as
>> > this example shows. There is a small additional overhead per row for
>> > initialising the MD5 sums, and adding the results to the total, but I
>> > think the biggest factor is that md5_total() is processing more data.
>> > The reason is that MD5 works on 64-byte blocks, so the total amount of
>> > data going through the core MD5 algorithm is each row's size is
>> > rounded up to a multiple of 64. In this simple case it ends up
>> > processing around 1.5 times as much data:
>> >
>> > SELECT sum(length(foo.*::text)) AS md5_agg,
>> >        sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
>> >
>> >   md5_agg   |  md5_total
>> > ------------+-------------
>> >  8103815438 | 12799909248
>> >
>> > although of course that overhead won't be as large on wider tables,
>> > and even in this case the overall performance is still on a par with
>> > md5_agg().
>> >
>> > ISTM that both aggregates are potentially useful in different
>> > situations. I would probably typically use md5_total() because of its
>> > simplicity/order-independence and consistent performance, but
>> > md5_agg() might also be useful when comparing with external data.
>> >
>> > Regards,
>> > Dean
>
>> Performance review (skills needed: ability to time performance)
>>
>>     Does the patch slow down simple tests?
>>
>>         Yes.  For an MD5 checksum of some random data, here are
>>         results from master:
>>
>>             shackle@postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM
generate_series(1,10000)),
>>             postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
>>             postgres-# select md5(a::text) FROM t2;
>>             shackle@postgres:5493=# \timing
>>             Timing is on.
>>             shackle@postgres:5493=# \g
>>             Time: 955.393 ms
>>             shackle@postgres:5493=# \g
>>             Time: 950.909 ms
>>             shackle@postgres:5493=# \g
>>             Time: 947.955 ms
>>             shackle@postgres:5493=# \g
>>             Time: 946.193 ms
>>             shackle@postgres:5493=# \g
>>             Time: 950.591 ms
>>             shackle@postgres:5493=# \g
>>             Time: 952.346 ms
>>             shackle@postgres:5493=# \g
>>             Time: 948.623 ms
>>             shackle@postgres:5493=# \g
>>             Time: 939.819 ms
>>
>>         and here from master + the patch:
>>
>>             WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
>>             t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
>>             select md5(a::text) FROM t2;
>>             Time: 1129.178 ms
>>             shackle@postgres:5494=# \g
>>             Time: 1134.013 ms
>>             shackle@postgres:5494=# \g
>>             Time: 1124.387 ms
>>             shackle@postgres:5494=# \g
>>             Time: 1119.733 ms
>>             shackle@postgres:5494=# \g
>>             Time: 1104.906 ms
>>             shackle@postgres:5494=# \g
>>             Time: 1137.055 ms
>>             shackle@postgres:5494=# \g
>>             Time: 1128.977 ms
>>             shackle@postgres:5494=# \g
>>             Time: 1143.619 ms
>>
>>             Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms
>>             Avg, StddevPopulation with patch: 1127.73 ms, 11.06 ms
>
> The above was with -O0.  Please find below with vanilla ./configure (-O2):
>
> Master:
> shackle@postgres:5493=# WITH t1 AS (SELECT
> string_agg(chr(floor(255*random()+1)::int),'') AS a FROM
> generate_series(1,10000)),
> t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
> select md5(a::text) FROM t2;
> Time: 469.101 ms
> shackle@postgres:5493=# \g
> Time: 464.122 ms
> shackle@postgres:5493=# \g
> Time: 461.411 ms
> shackle@postgres:5493=# \g
> Time: 458.222 ms
> shackle@postgres:5493=# \g
> Time: 463.616 ms
> shackle@postgres:5493=# \g
> Time: 463.983 ms
> shackle@postgres:5493=# \g
> Time: 453.770 ms
> shackle@postgres:5493=# \g
> Time: 456.729 ms
>
> MD5:
> WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a
> FROM generate_series(1,10000)),
> t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
> select md5(a::text) FROM t2;
> Time: 436.862 ms
> shackle@postgres:5494=# \t
> Showing only tuples.
> shackle@postgres:5494=# \t
> Tuples only is off.
> shackle@postgres:5494=# \g
> Time: 438.686 ms
> shackle@postgres:5494=# \g
> Time: 443.789 ms
> shackle@postgres:5494=# \g
> Time: 452.522 ms
> shackle@postgres:5494=# \g
> Time: 447.824 ms
> shackle@postgres:5494=# \g
> Time: 448.547 ms
> shackle@postgres:5494=# \g
> Time: 446.901 ms
> shackle@postgres:5494=# \g
> Time: 448.537 ms
>
> Average, stddev population for master: 461.36925, 4.5883631545
> Average, stddev population for md5:    445.4585,  4.9892286729
>
> so slightly faster on MD5, as expected.
>
> Cheers,
> David.

Thanks for the review and testing!

Regards,
Dean



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Dean Rasheed
Дата:
Сообщение: Re: FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]
Следующее
От: Kevin Grittner
Дата:
Сообщение: Re: wrong state of patch in CF