Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Дата
Msg-id 20220701172639.ty4iu5almspueriu@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?  (Lukas Fittl <lukas@fittl.com>)
Ответы Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?  (Maciek Sakrejda <m.sakrejda@gmail.com>)
Список pgsql-hackers
Hi,

On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
> On Fri, Jun 12, 2020 at 4:28 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > The attached patches are really just a prototype. I'm also not really
> > planning to work on getting this into a "production ready" patchset
> > anytime soon. I developed it primarily because I found it the overhead
> > made it too hard to nail down in which part of a query tree performance
> > changed.  If somebody else wants to continue from here...
> >
> > I do think it's be a pretty significant improvement if we could reduce
> > the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
> > bunch of low-level code.
> >
> 
> Based on an off-list conversation with Andres, I decided to dust off this
> old
> patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
> improvements (especially when using rdtsc instead of rdtsc*p*) seem to
> warrant
> giving this a more thorough look.
> 
> See attached an updated patch (adding it to the July commitfest), with a few
> changes:
> 
> - Keep using clock_gettime() as a fallback if we decide to not use rdtsc

Yep.


> - Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work

I suspect that this might not be needed anymore. Seems like it'd be ok to just
fall back to clock_gettime() in that case.


> - In an abundance of caution, for now I've decided to only enable this if we
>   are on Linux/x86, and the current kernel clocksource is TSC (the kernel
> has
>   quite sophisticated logic around making this decision, see [1])

I think our requirements are a bit lower than the kernel's - we're not
tracking wall clock over an extended period...


> Note that if we implemented the decision logic ourselves (instead of relying
> on the Linux kernel), I'd be most worried about older virtualization
> technology. In my understanding getting this right is notably more
> complicated
> than just checking cpuid, see [2].


> Known WIP problems with this patch version:
> 
> * There appears to be a timing discrepancy I haven't yet worked out, where
>   the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
>   reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
>   higher for \timing than for the EXPLAIN ANALYZE time reported on the
> server
>   side, only when rdtsc measurement is used -- its likely there is a problem
>   somewhere with how we perform the cycles to time conversion

Could you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.


> * Possibly related, the floating point handling for the cycles_to_sec
> variable
>   is problematic in terms of precision (see FIXME, taken over from Andres'
> POC)

And probably also performance...


> Open questions from me:
> 
> 1) Do we need to account for different TSC offsets on different CPUs in SMP
>    systems? (the Linux kernel certainly has logic to that extent, but [3]
>    suggests this is no longer a problem on Nehalem and newer chips, i.e.
> those
>    having an invariant TSC)

I don't think we should cater to systems where we need that.


> 2) Should we have a setting "--with-tsc" for configure? (instead of always
>    enabling it when on Linux/x86 with a TSC clocksource)

Probably not worth it.


> 3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
>    current instructions to finish -- the prior discussion seemed to suggest
>    we don't want it for node instruction measurements, but possibly we do
> want
>    this in other cases?)

I was wondering about that too... Perhaps we should add a
INSTR_TIME_SET_CURRENT_BARRIER() or such?


> 4) Should we support using the "mrs" instruction on ARM? (which is similar
> to
>    rdtsc, see [4])

I'd leave that for later personally.



>  #define NS_PER_S INT64CONST(1000000000)
>  #define US_PER_S INT64CONST(1000000)
>  #define MS_PER_S INT64CONST(1000)
> @@ -95,17 +104,37 @@ typedef int64 instr_time;
>  
>  #define INSTR_TIME_SET_ZERO(t)    ((t) = 0)
>  
> -static inline instr_time pg_clock_gettime_ns(void)
> +extern double cycles_to_sec;
> +
> +bool use_rdtsc;

This should be extern and inside the ifdef below.


> +#if defined(__x86_64__) && defined(__linux__)
> +extern void pg_clock_gettime_initialize_rdtsc(void);
> +#endif
> +
> +static inline instr_time pg_clock_gettime_ref_cycles(void)
>  {
>      struct timespec tmp;
>  
> +#if defined(__x86_64__) && defined(__linux__)
> +    if (use_rdtsc)
> +        return __rdtsc();
> +#endif
> +
>      clock_gettime(PG_INSTR_CLOCK, &tmp);
>  
>      return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
>  }
>  

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Mark Wong
Дата:
Сообщение: Re: real/float example for testlibpq3
Следующее
От: Andres Freund
Дата:
Сообщение: Re: EINTR in ftruncate()