Обсуждение: Re: [PERFORM] EXPLAIN ANALYZE on 8.2

Поиск
Список
Период
Сортировка

Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
"Simon Riggs"
Дата:
On Thu, 2006-12-14 at 19:05 -0500, Tom Lane wrote:
> "Kelly Burkhart" <kelly.burkhart@gmail.com> writes:
> > I hope this isn't a "crummy mainboard" but I can't seem to affect
> > things by changing clock source (kernel 2.6.16 SLES10).  I tried
> > kernel command option clock=XXX where XXX in
> > (cyclone,hpet,pmtmr,tsc,pit), no option was significantly better than
> > the default.
>
> I believe that on machines where gettimeofday is really nice and fast,
> it doesn't require entry to the kernel at all; there's some hack that
> makes the clock readable from userspace.  (Obviously a switch to kernel
> mode would set you back a lot of the cycles involved here.)  So it's not
> so much the kernel that you need to teach as glibc.  How you do that is
> beyond my expertise, but maybe that will help you google for the right
> thing ...

Until we work out a better solution we can fix this in two ways:

1. EXPLAIN ANALYZE [ [ WITH | WITHOUT ] TIME STATISTICS ] ...

2. enable_analyze_timer = off | on (default) (USERSET)

A performance drop of 4x-10x is simply unacceptable when trying to tune
queries where the current untuned time is already too high. Tying down
production servers for hours on end when we know for certain all they
are doing is calling gettimeofday millions of times is not good. This
quickly leads to the view from objective people that PostgreSQL doesn't
have a great optimizer, whatever we say in its defence. I don't want to
leave this alone, but I don't want to spend a month fixing it either.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Martijn van Oosterhout
Дата:
On Fri, Dec 15, 2006 at 10:28:08AM +0000, Simon Riggs wrote:
> Until we work out a better solution we can fix this in two ways:
>
> 1. EXPLAIN ANALYZE [ [ WITH | WITHOUT ] TIME STATISTICS ] ...
>
> 2. enable_analyze_timer = off | on (default) (USERSET)

What exactly would this do? Only count actual rows or something? I
wrote a patch that tried statistical sampling, but the figures were too
far off for people's liking.

> A performance drop of 4x-10x is simply unacceptable when trying to tune
> queries where the current untuned time is already too high. Tying down
> production servers for hours on end when we know for certain all they
> are doing is calling gettimeofday millions of times is not good. This
> quickly leads to the view from objective people that PostgreSQL doesn't
> have a great optimizer, whatever we say in its defence. I don't want to
> leave this alone, but I don't want to spend a month fixing it either.

I think the best option is setitimer(), but it's not POSIX so
platform support is going to be patchy.

BTW, doing gettimeofday() without kernel entry is not really possible.
You could use the cycle counter but it has the problem that if you have
multiple CPUs you need to calibrate the result. If the CPU goes to
sleep, there's is no way for the userspace process to know. Only the
kernel has all the relevent information about what "time" is to get a
reasonable result.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Dimitri Fontaine
Дата:
Hi list,

Le vendredi 15 décembre 2006 11:50, Martijn van Oosterhout a écrit :
> BTW, doing gettimeofday() without kernel entry is not really possible.
> You could use the cycle counter but it has the problem that if you have
> multiple CPUs you need to calibrate the result. If the CPU goes to
> sleep, there's is no way for the userspace process to know. Only the
> kernel has all the relevent information about what "time" is to get a
> reasonable result.

I remember having played with intel RDTSC (time stamp counter) for some timing
measurement, but just read from several sources (including linux kernel
hackers considering its usage for gettimeofday()  implementation) that TSC is
not an accurate method to have elapsed time information.

May be some others method than gettimeofday() are available (Lamport
Timestamps, as PGDG may have to consider having a distributed processing
ready EA in some future), cheaper and accurate?
After all, the discussion, as far as I understand it, is about having a
accurate measure of duration of events, knowing when they occurred in the day
does not seem to be the point.

My 2¢, hoping this could be somehow helpfull,
--
Dimitri Fontaine
http://www.dalibo.com/

Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Gregory Stark
Дата:
"Martijn van Oosterhout" <kleptog@svana.org> writes:

> BTW, doing gettimeofday() without kernel entry is not really possible.

That's too strong a conclusion. Doing gettimeofday() without some help from
the kernel isn't possible but it isn't necessary to enter the kernel for each
call.

There are various attempts at providing better timing infrastructure at low
overhead but I'm not sure what's out there currently. I expect to do this what
we'll have to do is invent a pg_* abstraction that has various implementations
on different architectures. On Solaris it can use DTrace internally, on Linux
it might have something else (or more likely several different options
depending on the age and config options of the kernel). 

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
"Simon Riggs"
Дата:
On Fri, 2006-12-15 at 11:50 +0100, Martijn van Oosterhout wrote:
> On Fri, Dec 15, 2006 at 10:28:08AM +0000, Simon Riggs wrote:
> > Until we work out a better solution we can fix this in two ways:
> >
> > 1. EXPLAIN ANALYZE [ [ WITH | WITHOUT ] TIME STATISTICS ] ...
> >
> > 2. enable_analyze_timer = off | on (default) (USERSET)
>
> What exactly would this do? Only count actual rows or something?

Yes. It's better to have this than nothing at all.

> I
> wrote a patch that tried statistical sampling, but the figures were too
> far off for people's liking.

Well, I like your ideas, so if you have any more...

Maybe sampling every 10 rows will bring things down to an acceptable
level (after the first N). You tried less than 10 didn't you?

Maybe we can count how many real I/Os were required to perform each
particular row, so we can adjust the time per row based upon I/Os. ISTM
that sampling at too low a rate means we can't spot the effects of cache
and I/O which can often be low frequency but high impact.

> I think the best option is setitimer(), but it's not POSIX so
> platform support is going to be patchy.

Don't understand that. I thought that was to do with alarms and signals.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Martijn van Oosterhout
Дата:
On Fri, Dec 15, 2006 at 12:15:59PM +0000, Gregory Stark wrote:
> There are various attempts at providing better timing infrastructure at low
> overhead but I'm not sure what's out there currently. I expect to do this what
> we'll have to do is invent a pg_* abstraction that has various implementations
> on different architectures. On Solaris it can use DTrace internally, on Linux
> it might have something else (or more likely several different options
> depending on the age and config options of the kernel).

I think we need to move to a sampling approach. setitimer is good,
except it doesn't tell you if signals have been lost. Given they are
most likely to be lost during high disk I/O, they're actually
significant. I'm trying to think of a way around that. Then you don't
need a cheap gettimeofday at all...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Martijn van Oosterhout
Дата:
On Fri, Dec 15, 2006 at 12:20:46PM +0000, Simon Riggs wrote:
> > I
> > wrote a patch that tried statistical sampling, but the figures were too
> > far off for people's liking.
>
> Well, I like your ideas, so if you have any more...
>
> Maybe sampling every 10 rows will bring things down to an acceptable
> level (after the first N). You tried less than 10 didn't you?

Yeah, it reduced the number of calls as the count got larger. It broke
somewhere, though I don't quite remember why.

> Maybe we can count how many real I/Os were required to perform each
> particular row, so we can adjust the time per row based upon I/Os. ISTM
> that sampling at too low a rate means we can't spot the effects of cache
> and I/O which can often be low frequency but high impact.

One idea is to sample at fixed interval. Where the I/O cost is high,
there'll be a lot of sampling points. The issue being that the final
result are not totally accurate anymore.

> > I think the best option is setitimer(), but it's not POSIX so
> > platform support is going to be patchy.
>
> Don't understand that. I thought that was to do with alarms and signals.

On my system the manpage say setitimer() conforms to: SVr4, 4.4BSD. I'm
unsure how many of the supported platforms fall under that
catagorisation.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Florian Weimer
Дата:
* Simon Riggs:

>> I think the best option is setitimer(), but it's not POSIX so
>> platform support is going to be patchy.
>
> Don't understand that. I thought that was to do with alarms and
> signals.

You could use it for sampling.  Every few milliseconds, you record
which code is executing (possibly using a global variable which is set
and reset accordingly).  But this might lead to tons of interrupted
system calls, and not all systems mask them, so this might not be an
option for the PostgreSQL code base.

On the other hand, if performance in more recent Linux releases (from
kernel.org) are acceptable, you should assume that the problem will
eventually fix itself.  FWIW, I see the 9x overhead on something that
is close to 2.6.17 (on AMD64/Opteron), so this could be wishful
thinking. 8-(

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Peter Eisentraut
Дата:
Am Freitag, 15. Dezember 2006 11:28 schrieb Simon Riggs:
> Until we work out a better solution we can fix this in two ways:
>
> 1. EXPLAIN ANALYZE [ [ WITH | WITHOUT ] TIME STATISTICS ] ...
>
> 2. enable_analyze_timer = off | on (default) (USERSET)

The second one is enough in my mind.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Tom Lane
Дата:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Fri, Dec 15, 2006 at 12:20:46PM +0000, Simon Riggs wrote:
>> Maybe sampling every 10 rows will bring things down to an acceptable
>> level (after the first N). You tried less than 10 didn't you?

> Yeah, it reduced the number of calls as the count got larger. It broke
> somewhere, though I don't quite remember why.

The fundamental problem with it was the assumption that different
executions of a plan node will have the same timing.  That's not true,
in fact not even approximately true.  IIRC the patch did realize
that first-time-through is not a predictor for the rest, but some of
our plan nodes have enormous variance even after the first time.
I think the worst case is batched hash joins.

            regards, tom lane

Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Tom Lane
Дата:
Gregory Stark <stark@enterprisedb.com> writes:
> There are various attempts at providing better timing infrastructure at low
> overhead but I'm not sure what's out there currently. I expect to do this what
> we'll have to do is invent a pg_* abstraction that has various implementations
> on different architectures.

You've got to be kidding.  Surely it's glibc's responsibility, not ours,
to implement gettimeofday correctly for the hardware.

            regards, tom lane

Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Martijn van Oosterhout
Дата:
On Fri, Dec 15, 2006 at 09:56:57AM -0500, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > On Fri, Dec 15, 2006 at 12:20:46PM +0000, Simon Riggs wrote:
> >> Maybe sampling every 10 rows will bring things down to an acceptable
> >> level (after the first N). You tried less than 10 didn't you?
>
> > Yeah, it reduced the number of calls as the count got larger. It broke
> > somewhere, though I don't quite remember why.
>
> The fundamental problem with it was the assumption that different
> executions of a plan node will have the same timing.  That's not true,
> in fact not even approximately true.  IIRC the patch did realize
> that first-time-through is not a predictor for the rest, but some of
> our plan nodes have enormous variance even after the first time.
> I think the worst case is batched hash joins.

It didn't assume that because that's obviously bogus. It assumed the
durations would be spread as a normal distribution. Which meant that
over time the average of the measured iterations would approch the
actual average. It tried to take enough measurements to try and keep
expected error small, but it's statistics, you can only say "this will
give the right answer >95% of the time".

You are correct though, the error was caused by unexpectedly large
variations, or more likely, an unexpected distribution curve.
Statistically, we took enough samples to not be affected significantly
by large variations. Even if it looked more like a gamma distribution
it should not have been as far off as it was.

Looking at alternative approaches, like sampling with a timer, you end
up with the same problem: sometimes the calculations will fail and
produce something strange. The simplest example being than a 100Hz
timer is not going to produce any useful information for queries in the
millisecond range. A higher frequency timer than that is not going to
be available portably.

You could probably throw more effort into refining the statistics behind
it, but at some point we're going to have to draw a line and say: it's
going to be wrong X% of the time, deal with it. If we're not willing to
say that, then there's no point going ahead with any statistical
approach.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Tom Lane
Дата:
Peter Eisentraut <peter_e@gmx.net> writes:
> Am Freitag, 15. Dezember 2006 11:28 schrieb Simon Riggs:
>> Until we work out a better solution we can fix this in two ways:
>>
>> 1. EXPLAIN ANALYZE [ [ WITH | WITHOUT ] TIME STATISTICS ] ...
>> 2. enable_analyze_timer = off | on (default) (USERSET)

> The second one is enough in my mind.

I don't see any point in either one.  If you're not going to collect
timing data then the only useful info EXPLAIN ANALYZE could provide is
knowledge of which rowcount estimates are badly off ... and to get that,
you have to wait for the query to finish, which may well be impractical
even without the gettimeofday overhead.  We had discussed upthread the
idea of having an option to issue a NOTICE as soon as any actual
rowcount exceeds the estimate by some-configurable-percentage, and that
seems to me to be a much more useful response to the problem of
"E.A. takes too long" than removing gettimeofday.

One thing that's not too clear to me though is how the NOTICE would
identify the node at which the rowcount was exceeded...

            regards, tom lane

Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Gregory Stark
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
> > There are various attempts at providing better timing infrastructure at low
> > overhead but I'm not sure what's out there currently. I expect to do this what
> > we'll have to do is invent a pg_* abstraction that has various implementations
> > on different architectures.
>
> You've got to be kidding.  Surely it's glibc's responsibility, not ours,
> to implement gettimeofday correctly for the hardware.

Except for two things:

a) We don't really need gettimeofday. That means we don't need something
sensitive to adjustments made by ntp etc. In fact that would be actively bad.
Currently if the user runs "date" to reset his clock back a few days I bet
interesting things happen to a large explain analyze that's running.

In fact we don't need something that represents any absolute time, only time
elapsed since some other point we choose. That might be easier to implement
than what glibc has to do to implement gettimeofday fully.

b) glibc may not want to endure an overhead on every syscall and context
switch to make gettimeofday faster on the assumption that gettimeofday is a
rare call and it should pay the price rather than imposing an overhead on
everything else.

Postgres knows when it's running an explain analyze and a 1% overhead would be
entirely tolerable, especially if it affected the process pretty much evenly
unlike the per-gettimeofday-overhead which can get up as high as 100% on some
types of subplans and is negligible on others. And more to the point Postgres
wouldn't have to endure this overhead at all when it's not needed whereas
glibc has no idea when you're going to need gettimeofday next.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Ron
Дата:
At 10:45 AM 12/15/2006, Tom Lane wrote:
>Gregory Stark <stark@enterprisedb.com> writes:
> > There are various attempts at providing better timing infrastructure at low
> > overhead but I'm not sure what's out there currently. I expect to
> do this what
> > we'll have to do is invent a pg_* abstraction that has various
> implementations
> > on different architectures.
>
>You've got to be kidding.  Surely it's glibc's responsibility, not ours,
>to implement gettimeofday correctly for the hardware.
>
>                         regards, tom lane

I agree with Tom on this.  Perhaps the best compromise is for the pg
community to make thoughtful suggestions to the glibc community?

Ron Peacetree


Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
"Simon Riggs"
Дата:
On Fri, 2006-12-15 at 10:57 -0500, Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > Am Freitag, 15. Dezember 2006 11:28 schrieb Simon Riggs:
> >> Until we work out a better solution we can fix this in two ways:
> >>
> >> 1. EXPLAIN ANALYZE [ [ WITH | WITHOUT ] TIME STATISTICS ] ...
> >> 2. enable_analyze_timer = off | on (default) (USERSET)
>
> > The second one is enough in my mind.
>
> I don't see any point in either one.  If you're not going to collect
> timing data then the only useful info EXPLAIN ANALYZE could provide is
> knowledge of which rowcount estimates are badly off ... and to get that,
> you have to wait for the query to finish, which may well be impractical
> even without the gettimeofday overhead.

On a different part of this thread, you say:

On Fri, 2006-12-15 at 09:56 -0500, Tom Lane wrote:

> The fundamental problem with it was the assumption that different
> executions of a plan node will have the same timing.  That's not true,
> in fact not even approximately true.

It doesn't make sense to me to claim that the timing is so important
that we cannot do without it, at the same time as saying it isn't even
approximately true that is highly variable.

>  We had discussed upthread the
> idea of having an option to issue a NOTICE as soon as any actual
> rowcount exceeds the estimate by some-configurable-percentage, and that
> seems to me to be a much more useful response to the problem of
> "E.A. takes too long" than removing gettimeofday.

> One thing that's not too clear to me though is how the NOTICE would
> identify the node at which the rowcount was exceeded...

We'd have to output the whole EXPLAIN as a NOTICE for it to make any
sense. If we can't do without the timings, then half an EXPLAIN would be
even worse.

We'd need to take account of non-linear nodes. Hash nodes react badly
beyond a certain point, HashAgg even worse. Sort performs poorly after
the end of memory, as does Materialize. Other nodes are more linear so
would need a different percentage. I don't like the sound of a whole
gaggle of GUCs to describe that. Any ideas?

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PERFORM] EXPLAIN ANALYZE on 8.2

От
Tom Lane
Дата:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Fri, 2006-12-15 at 09:56 -0500, Tom Lane wrote:
>> The fundamental problem with it was the assumption that different
>> executions of a plan node will have the same timing.  That's not true,
>> in fact not even approximately true.

> It doesn't make sense to me to claim that the timing is so important
> that we cannot do without it, at the same time as saying it isn't even
> approximately true that is highly variable.

Huh?  What I said was that successive executions of the same plan node
may take considerably different amounts of time, and the proposed
sampling patch failed to handle that situation with acceptable accuracy.

            regards, tom lane