Обсуждение: check for null value before looking up the hash function

Поиск
Список
Период
Сортировка

check for null value before looking up the hash function

От
Zhihong Yu
Дата:
Hi,
I was looking at the code in hash_record() of src/backend/utils/adt/rowtypes.c

It seems if nulls[i] is true, we don't need to look up the hash function.

Please take a look at the patch.

Thanks
Вложения

Re: check for null value before looking up the hash function

От
Tom Lane
Дата:
Zhihong Yu <zyu@yugabyte.com> writes:
> I was looking at the code in hash_record()
> of src/backend/utils/adt/rowtypes.c
> It seems if nulls[i] is true, we don't need to look up the hash function.

I don't think this is worth changing.  It complicates the logic,
rendering it unlike quite a few other functions written in the same
style.  In cases where the performance actually matters, the hash
function is cached across multiple calls anyway.  You might save
something if you have many calls in a query and not one of them
receives a non-null input, but how likely is that?

            regards, tom lane



Re: check for null value before looking up the hash function

От
Ranier Vilela
Дата:
>Zhihong Yu <zyu(at)yugabyte(dot)com> writes:
>> I was looking at the code in hash_record()
>> of src/backend/utils/adt/rowtypes.c
>> It seems if nulls[i] is true, we don't need to look up the hash function.

>I don't think this is worth changing. It complicates the logic,
>rendering it unlike quite a few other functions written in the same
>style. In cases where the performance actually matters, the hash
>function is cached across multiple calls anyway. You might save
>something if you have many calls in a query and not one of them
>receives a non-null input, but how likely is that?

I disagree.
I think that is worth changing. The fact of complicating the logic
is irrelevant.
But maybe the v2 attached would be a little better.
My doubt is the result calc when nulls are true.

regards,
Ranier Vilela
Вложения

Re: check for null value before looking up the hash function

От
Ranier Vilela
Дата:
Em sáb., 21 de mai. de 2022 às 10:06, Ranier Vilela <ranier.vf@gmail.com> escreveu:
>Zhihong Yu <zyu(at)yugabyte(dot)com> writes:
>> I was looking at the code in hash_record()
>> of src/backend/utils/adt/rowtypes.c
>> It seems if nulls[i] is true, we don't need to look up the hash function.

>I don't think this is worth changing. It complicates the logic,
>rendering it unlike quite a few other functions written in the same
>style. In cases where the performance actually matters, the hash
>function is cached across multiple calls anyway. You might save
>something if you have many calls in a query and not one of them
>receives a non-null input, but how likely is that?

I disagree.
I think that is worth changing. The fact of complicating the logic
is irrelevant.
But maybe the v2 attached would be a little better.
Or v3.
 
regards,
Ranier Vilela
Вложения

Re: check for null value before looking up the hash function

От
Tomas Vondra
Дата:

On 5/21/22 15:06, Ranier Vilela wrote:
>>Zhihong Yu <zyu(at)yugabyte(dot)com> writes:
>>> I was looking at the code in hash_record()
>>> of src/backend/utils/adt/rowtypes.c
>>> It seems if nulls[i] is true, we don't need to look up the hash function.
> 
>>I don't think this is worth changing. It complicates the logic,
>>rendering it unlike quite a few other functions written in the same
>>style. In cases where the performance actually matters, the hash
>>function is cached across multiple calls anyway. You might save
>>something if you have many calls in a query and not one of them
>>receives a non-null input, but how likely is that?
> 
> I disagree.
> I think that is worth changing. The fact of complicating the logic
> is irrelevant.

That's a quite bold claim, and yet you haven't supported it by any
argument whatsoever. Trade-offs between complexity and efficiency are a
crucial development task, so complicating the logic clearly does matter.

It might be out-weighted by efficiency benefits, but as Tom suggested
the cases that might benefit from this are extremely unlikely (data with
just NULL values). And even for those cases no benchmark quantifying the
difference was presented.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: check for null value before looking up the hash function

От
Ranier Vilela
Дата:
Em sáb., 21 de mai. de 2022 às 12:05, Tomas Vondra <tomas.vondra@enterprisedb.com> escreveu:


On 5/21/22 15:06, Ranier Vilela wrote:
>>Zhihong Yu <zyu(at)yugabyte(dot)com> writes:
>>> I was looking at the code in hash_record()
>>> of src/backend/utils/adt/rowtypes.c
>>> It seems if nulls[i] is true, we don't need to look up the hash function.
>
>>I don't think this is worth changing. It complicates the logic,
>>rendering it unlike quite a few other functions written in the same
>>style. In cases where the performance actually matters, the hash
>>function is cached across multiple calls anyway. You might save
>>something if you have many calls in a query and not one of them
>>receives a non-null input, but how likely is that?
>
> I disagree.
> I think that is worth changing. The fact of complicating the logic
> is irrelevant.

That's a quite bold claim, and yet you haven't supported it by any
argument whatsoever. Trade-offs between complexity and efficiency are a
crucial development task, so complicating the logic clearly does matter.
What I meant is that complicating the logic in search of efficiency is worth it, and that's what everyone is looking for in this thread.
Likewise, not complicating the logic, losing a little bit of efficiency, applied to all the code, leads to a big loss of efficiency.
In other words, I never miss an opportunity to gain efficiency.

regards,
Ranier Vilela

Re: check for null value before looking up the hash function

От
"David G. Johnston"
Дата:
On Sat, May 21, 2022 at 8:32 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
Em sáb., 21 de mai. de 2022 às 12:05, Tomas Vondra <tomas.vondra@enterprisedb.com> escreveu:


On 5/21/22 15:06, Ranier Vilela wrote:
>>Zhihong Yu <zyu(at)yugabyte(dot)com> writes:
>>> I was looking at the code in hash_record()
>>> of src/backend/utils/adt/rowtypes.c
>>> It seems if nulls[i] is true, we don't need to look up the hash function.
>
>>I don't think this is worth changing. It complicates the logic,
>>rendering it unlike quite a few other functions written in the same
>>style. In cases where the performance actually matters, the hash
>>function is cached across multiple calls anyway. You might save
>>something if you have many calls in a query and not one of them
>>receives a non-null input, but how likely is that?
>
> I disagree.
> I think that is worth changing. The fact of complicating the logic
> is irrelevant.

That's a quite bold claim, and yet you haven't supported it by any
argument whatsoever. Trade-offs between complexity and efficiency are a
crucial development task, so complicating the logic clearly does matter.
What I meant is that complicating the logic in search of efficiency is worth it, and that's what everyone is looking for in this thread.
Likewise, not complicating the logic, losing a little bit of efficiency, applied to all the code, leads to a big loss of efficiency.
In other words, I never miss an opportunity to gain efficiency.


I disliked the fact that 80% of the patch was adding indentation.  Instead, remove indentation for the normal flow case and move the loop short-circuit to the top of the loop where it is traditionally found.

This seems like a win on both efficiency and complexity grounds.  Having the "/* see hash_array() */" comment and logic twice is a downside but a minor one that could be replaced with a function call if desired.

diff --git a/src/backend/utils/adt/rowtypes.c b/src/backend/utils/adt/rowtypes.c
index db843a0fbf..0bc28d1742 100644
--- a/src/backend/utils/adt/rowtypes.c
+++ b/src/backend/utils/adt/rowtypes.c
@@ -1838,6 +1838,13 @@ hash_record(PG_FUNCTION_ARGS)
        TypeCacheEntry *typentry;
        uint32      element_hash;
 
+       if (nulls[i])
+       {
+           /* see hash_array() */
+           result = (result << 5) - result + 0;
+           continue;
+       }
+
        att = TupleDescAttr(tupdesc, i);
 
        if (att->attisdropped)
@@ -1860,24 +1867,16 @@ hash_record(PG_FUNCTION_ARGS)
            my_extra->columns[i].typentry = typentry;
        }
 
-       /* Compute hash of element */
-       if (nulls[i])
-       {
-           element_hash = 0;
-       }
-       else
-       {
-           LOCAL_FCINFO(locfcinfo, 1);
+       LOCAL_FCINFO(locfcinfo, 1);
 
-           InitFunctionCallInfoData(*locfcinfo, &typentry->hash_proc_finfo, 1,
-                                    att->attcollation, NULL, NULL);
-           locfcinfo->args[0].value = values[i];
-           locfcinfo->args[0].isnull = false;
-           element_hash = DatumGetUInt32(FunctionCallInvoke(locfcinfo));
+       InitFunctionCallInfoData(*locfcinfo, &typentry->hash_proc_finfo, 1,
+                                   att->attcollation, NULL, NULL);
+       locfcinfo->args[0].value = values[i];
+       locfcinfo->args[0].isnull = false;
+       element_hash = DatumGetUInt32(FunctionCallInvoke(locfcinfo));
 
-           /* We don't expect hash support functions to return null */
-           Assert(!locfcinfo->isnull);
-       }
+       /* We don't expect hash support functions to return null */
+       Assert(!locfcinfo->isnull);
 
        /* see hash_array() */
        result = (result << 5) - result + element_hash;

Re: check for null value before looking up the hash function

От
Tom Lane
Дата:
Ranier Vilela <ranier.vf@gmail.com> writes:
> Em sáb., 21 de mai. de 2022 às 12:05, Tomas Vondra <
> tomas.vondra@enterprisedb.com> escreveu:
>> That's a quite bold claim, and yet you haven't supported it by any
>> argument whatsoever. Trade-offs between complexity and efficiency are a
>> crucial development task, so complicating the logic clearly does matter.

> What I meant is that complicating the logic in search of efficiency is
> worth it, and that's what everyone is looking for in this thread.
> Likewise, not complicating the logic, losing a little bit of efficiency,
> applied to all the code, leads to a big loss of efficiency.
> In other words, I never miss an opportunity to gain efficiency.

[ shrug... ]  You quietly ignored Tomas' main point, which is that no
evidence has been provided that there's actually any efficiency gain.

(1) Sure, in the case where only null values are encountered during a
query, we can save a cache lookup, but will that be even micro-measurable
compared to general query overhead?  Seems unlikely, especially if this is
changed in only one place.  That ties into my complaint about how this is
just one instance of a fairly widely used coding pattern.

(2) What are the effects when we *do* eventually encounter a non-null
value?  The existing coding will perform all the necessary lookups
at first call, but with the proposed change those may be spread across
query execution.  It's not implausible that that leads to a net loss
of efficiency, due to locality-of-access effects.

I'm also concerned that this increases the size of the "state space"
of this function, in that there are now more possible states of its
cached information.  While that probably doesn't add any new bugs,
it does add complication and make things harder to reason about.
So the bottom line remains that I don't think it's worth changing.

            regards, tom lane



Re: check for null value before looking up the hash function

От
Ranier Vilela
Дата:
Em sáb., 21 de mai. de 2022 às 13:13, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Ranier Vilela <ranier.vf@gmail.com> writes:
> Em sáb., 21 de mai. de 2022 às 12:05, Tomas Vondra <
> tomas.vondra@enterprisedb.com> escreveu:
>> That's a quite bold claim, and yet you haven't supported it by any
>> argument whatsoever. Trade-offs between complexity and efficiency are a
>> crucial development task, so complicating the logic clearly does matter.

> What I meant is that complicating the logic in search of efficiency is
> worth it, and that's what everyone is looking for in this thread.
> Likewise, not complicating the logic, losing a little bit of efficiency,
> applied to all the code, leads to a big loss of efficiency.
> In other words, I never miss an opportunity to gain efficiency.

[ shrug... ]  You quietly ignored Tomas' main point, which is that no
evidence has been provided that there's actually any efficiency gain.
IMHO, the point here, is for-non-commiters everything needs benchmarks.
But, I have been see many commits withouts benchmarks or any evidence gains.
And of course, having benchmarks is better, but for micro-optimizations,
It doesn't seem like it's needed that much.


(1) Sure, in the case where only null values are encountered during a
query, we can save a cache lookup, but will that be even micro-measurable
compared to general query overhead?  Seems unlikely, especially if this is
changed in only one place.  That ties into my complaint about how this is
just one instance of a fairly widely used coding pattern.
Of course, changing only in one place, the gain is tiny, but, step by step,
the coding pattern is changing too, becoming new "fairly widely".
 

(2) What are the effects when we *do* eventually encounter a non-null
value?  The existing coding will perform all the necessary lookups
at first call, but with the proposed change those may be spread across
query execution.  It's not implausible that that leads to a net loss
of efficiency, due to locality-of-access effects.
Weel the current code, test branch for nulls first.
Most of the time, this is not true.
So, the current code has poor branch prediction, at least.
What I proposed, improves the branch prediction, at least.


I'm also concerned that this increases the size of the "state space"
of this function, in that there are now more possible states of its
cached information.  While that probably doesn't add any new bugs,
it does add complication and make things harder to reason about.
So the bottom line remains that I don't think it's worth changing.
Of course, your arguments are all valids.
That would all be clarified with benchmarks, maybe the OP is interested in doing them.

regards,
Ranier Vilela

Re: check for null value before looking up the hash function

От
"David G. Johnston"
Дата:
On Sat, May 21, 2022 at 10:04 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
Em sáb., 21 de mai. de 2022 às 13:13, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Ranier Vilela <ranier.vf@gmail.com> writes:
> Em sáb., 21 de mai. de 2022 às 12:05, Tomas Vondra <
> tomas.vondra@enterprisedb.com> escreveu:
>> That's a quite bold claim, and yet you haven't supported it by any
>> argument whatsoever. Trade-offs between complexity and efficiency are a
>> crucial development task, so complicating the logic clearly does matter.

> What I meant is that complicating the logic in search of efficiency is
> worth it, and that's what everyone is looking for in this thread.
> Likewise, not complicating the logic, losing a little bit of efficiency,
> applied to all the code, leads to a big loss of efficiency.
> In other words, I never miss an opportunity to gain efficiency.

[ shrug... ]  You quietly ignored Tomas' main point, which is that no
evidence has been provided that there's actually any efficiency gain.
IMHO, the point here, is for-non-commiters everything needs benchmarks.
But, I have been see many commits withouts benchmarks or any evidence gains.
And of course, having benchmarks is better, but for micro-optimizations,
It doesn't seem like it's needed that much.

Mostly because committers don't tend to do this kind of drive-by patching that changes long-established code, which are fairly categorized as premature optimizations.



(1) Sure, in the case where only null values are encountered during a
query, we can save a cache lookup, but will that be even micro-measurable
compared to general query overhead?  Seems unlikely, especially if this is
changed in only one place.  That ties into my complaint about how this is
just one instance of a fairly widely used coding pattern.
Of course, changing only in one place, the gain is tiny, but, step by step,
the coding pattern is changing too, becoming new "fairly widely".

Agreed, but that isn't what was done here, there was no investigation of the overall coding practice and suggestions to change them all to the improved form.
 

(2) What are the effects when we *do* eventually encounter a non-null
value?  The existing coding will perform all the necessary lookups
at first call, but with the proposed change those may be spread across
query execution.  It's not implausible that that leads to a net loss
of efficiency, due to locality-of-access effects.
Weel the current code, test branch for nulls first.
Most of the time, this is not true.
So, the current code has poor branch prediction, at least.
What I proposed, improves the branch prediction, at least.

Per my other reply, the v3 proposal did not, IMHO, do a good job of branch prediction either.

I find an improvement on code complexity grounds to be warranted, though the benefit seems unlikely to outweigh the cost of doing it everywhere (fixing only one place actually increases the cost component).

Even without the plausible locality-of-access argument the benefit here is likely to be a micro-optimization that provides only minimal benefit.  Evidence to the contrary is welcomed but, yes, the burden is going to be placed squarely on the patch author(s) to demonstrate the benefit accrued from the code churn is worth the cost.

David J.

Re: check for null value before looking up the hash function

От
Tomas Vondra
Дата:
On 5/21/22 19:24, David G. Johnston wrote:
> On Sat, May 21, 2022 at 10:04 AM Ranier Vilela <ranier.vf@gmail.com
> <mailto:ranier.vf@gmail.com>> wrote:
> 
>     Em sáb., 21 de mai. de 2022 às 13:13, Tom Lane <tgl@sss.pgh.pa.us
>     <mailto:tgl@sss.pgh.pa.us>> escreveu:
> 
>         Ranier Vilela <ranier.vf@gmail.com <mailto:ranier.vf@gmail.com>>
>         writes:
>         > Em sáb., 21 de mai. de 2022 às 12:05, Tomas Vondra <
>         > tomas.vondra@enterprisedb.com
>         <mailto:tomas.vondra@enterprisedb.com>> escreveu:
>         >> That's a quite bold claim, and yet you haven't supported it
>         by any
>         >> argument whatsoever. Trade-offs between complexity and
>         efficiency are a
>         >> crucial development task, so complicating the logic clearly
>         does matter.
> 
>         > What I meant is that complicating the logic in search of
>         efficiency is
>         > worth it, and that's what everyone is looking for in this thread.
>         > Likewise, not complicating the logic, losing a little bit of
>         efficiency,
>         > applied to all the code, leads to a big loss of efficiency.
>         > In other words, I never miss an opportunity to gain efficiency.
> 
>         [ shrug... ]  You quietly ignored Tomas' main point, which is
>         that no
>         evidence has been provided that there's actually any efficiency
>         gain.
> 
>     IMHO, the point here, is for-non-commiters everything needs benchmarks.
>     But, I have been see many commits withouts benchmarks or any
>     evidence gains.
>     And of course, having benchmarks is better, but for
>     micro-optimizations,
>     It doesn't seem like it's needed that much.
> 
> 
> Mostly because committers don't tend to do this kind of drive-by
> patching that changes long-established code, which are fairly
> categorized as premature optimizations.
> 

FWIW I find the argument that committers are somehow absolved of having
to demonstrate the benefits of a change rather misleading. Perhaps even
offensive, as it hints committers are less demanding/careful when it
comes to their own patches. Which goes directly against my experience
and understanding of what being a committer is.

I'm not going to claim every "optimization" patch submitted by a
committer had a benchmark - some patches certainly are pushed without
it, with just some general reasoning why the change is beneficial.

But I'm sure that when someone suggest the reasoning is wrong, it's
taken seriously - the discussion continues, there's a benchmark etc. And
I can't recall a committer suggesting it's fine because some other patch
didn't have a benchmark either.

> 
> 
>         (1) Sure, in the case where only null values are encountered
>         during a
>         query, we can save a cache lookup, but will that be even
>         micro-measurable
>         compared to general query overhead?  Seems unlikely, especially
>         if this is
>         changed in only one place.  That ties into my complaint about
>         how this is
>         just one instance of a fairly widely used coding pattern.
> 
>     Of course, changing only in one place, the gain is tiny, but, step
>     by step,
>     the coding pattern is changing too, becoming new "fairly widely".
> 
> 
> Agreed, but that isn't what was done here, there was no investigation of
> the overall coding practice and suggestions to change them all to the
> improved form.
> 

Right. If we think the coding pattern is an improvement, we should tweak
all the places, not just one (and hope the other places will magically
switch on their own).

More importantly, though, I kinda doubt tweaking more places will
actually make the difference more significant (assuming it actually does
improve things). How likely is it you need to hash the same data type
multiple times, with just NULL values? And even if you do, improvements
like this tend to sum, not multiply - i.e. if the improvement is 1% for
one place, it's still 1% even if the query hits 10 such places.

> 
>         (2) What are the effects when we *do* eventually encounter a
>         non-null
>         value?  The existing coding will perform all the necessary lookups
>         at first call, but with the proposed change those may be spread
>         across
>         query execution.  It's not implausible that that leads to a net loss
>         of efficiency, due to locality-of-access effects.
> 
>     Weel the current code, test branch for nulls first.
>     Most of the time, this is not true.
>     So, the current code has poor branch prediction, at least.
>     What I proposed, improves the branch prediction, at least.
> 
> 
> Per my other reply, the v3 proposal did not, IMHO, do a good job of
> branch prediction either.
> 
> I find an improvement on code complexity grounds to be warranted, though
> the benefit seems unlikely to outweigh the cost of doing it everywhere
> (fixing only one place actually increases the cost component).
> 
> Even without the plausible locality-of-access argument the benefit here
> is likely to be a micro-optimization that provides only minimal
> benefit.  Evidence to the contrary is welcomed but, yes, the burden is
> going to be placed squarely on the patch author(s) to demonstrate the
> benefit accrued from the code churn is worth the cost.
> 

IMHO questions like this are exactly why some actual benchmark results
would be the best thing to move this forward. I certainly can't look at
C code and say how good it is for branch prediction.


regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company