Re: RFC: replace pg_stat_activity.waiting with something more descriptive

Поиск

Список

Период

Сортировка

От	Alexander Korotkov
Тема	Re: RFC: replace pg_stat_activity.waiting with something more descriptive
Дата	16 сентября 2015 г. 19:30:29
Msg-id	CAPpHfdtYWKZq5iKHtYLWnba8oc98eN27uyN5NQN835B=wjd43g@mail.gmail.com обсуждение исходный текст
Ответ на	Re: RFC: replace pg_stat_activity.waiting with something more descriptive (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: RFC: replace pg_stat_activity.waiting with something more descriptive
Список	pgsql-hackers

Дерево обсуждения

On Mon, Sep 14, 2015 at 3:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 14, 2015 at 5:32 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> In order to build the consensus we need the roadmap for waits monitoring.
> Would single byte in PgBackendStatus be the only way for tracking wait
> events? Could we have pluggable infrastructure in waits monitoring: for
> instance, hooks for wait event begin and end?

No, it's not the only way of doing it. I proposed doing that way
because it's simple and cheap, but I'm not hell-bent on it. My basic
concern here is about the cost of this. I think that the most data we
can report without some kind of synchronization protocol is one 4-byte
integer. If we want to report anything more than that, we're going to
need something like the st_changecount protocol, or a lock, and that's
going to add very significantly - and in my view unacceptably - to the
cost. I care very much about having this facility be something that
we can use in lots of places, even extremely frequent operations like
buffer reads and contended lwlock acquisition.

Yes, the major question is cost. But I think we should validate our thoughts by experiments assuming there are more possible synchronization protocols. Ildus posted implemention of double buffering approach that showed quite low cost.

I think that there may be some *kinds of waits* for which it's
practical to report additional detail. For example, suppose that when
a heavyweight lock wait first happens, we just report the lock type
(relation, tuple, etc.) but then when the deadlock detector expires,
if we're still waiting, we report the entire lock tag. Well, that's
going to happen infrequently enough, and is expensive enough anyway,
that the cost doesn't matter. But if, every time we read a disk
block, we take a lock (or bump a changecount and do a write barrier),
dump the whole block tag in there, release the lock (or do another
write barrier and bump the changecount again) that sounds kind of
expensive to me. Maybe we can prove that it doesn't matter on any
workload, but I doubt it. We're fighting for every cycle in some of
these code paths, and there's good evidence that we're burning too
many of them compared to competing products already.

Yes, but some competing products also provides comprehensive waits monitoring too. That makes me think it should be possible for us too.

I am not a big fan of hooks as a way of resolving disagreements about
the design. We may find that there are places where it's useful to
have hooks so that different extensions can do different things, and
that is fine. But we shouldn't use that as a way of punting the
difficult questions. There isn't enough common understanding here of
what we're all trying to get done and why we're trying to do it in
particular ways rather than in other ways to jump to the conclusion
that a hook is the right answer. I'd prefer to have a nice, built-in
system that everyone agrees represents a good set of trade-offs than
an extensible system.

I think the reason for hooks could be not only disagreements about design, but platform dependent issues too.

Next step after we have view with current wait events will be gathering some statistics of them. We can oppose at least two approaches here:

1) Periodical sampling of current wait events.

2) Measure each wait event duration. We could collect statistics for short period locally and update shared memory structure periodically (using some synchronization protocol).

In the previous attempt to gather lwlocks statistics, you predict that sampling could have a significant overhead [1]. In contrast, on many systems time measurements are cheap. We have implemented both approaches and it shows that sampling every 1 milliseconds produce higher overhead than individual duration measurements for each wait event. We can share another version of waits monitoring based on sampling to make these results reproducible for everybody. However, cheap time measurements are available not for each platform. For instance, ISTM that on Windows time measurements are too expensive [2].

That makes me think that we need pluggable solution, at least for statistics: direct measuring of events durations for majority of systems and sampling for others as the least harm.

I think it's reasonable to consider reporting this data in the PGPROC
using a 4-byte integer rather than reporting it through a singe byte
in the backend status structure. I believe that addresses the
concerns about reporting from auxiliary processes, and it also allows
a little more data to be reported. For anything in excess of that, I
think we should think rather harder. Most likely, such addition
detail should be reported only for certain types of wait events, or on
a delay, or something like that, so that the core mechanism remains
really, really fast.

That sounds reasonable. There are many pending questions, but it seems like step forward to me.

[1] http://www.postgresql.org/message-id/CA+Tgmobf1NJD+_DfQG5qccG5YFSnxk3CgC2mh0-UHabznCQtYA@mail.gmail.com

[2] http://www.postgresql.org/message-id/CAPpHfdvFLevfEtzGjPrNreXfLiyrKSGkdu98Xo5Xx0hw5GuBNw@mail.gmail.com

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

В списке pgsql-hackers по дате отправления: