Re: bg worker: general purpose requirements

Поиск
Список
Период
Сортировка
От Markus Wanner
Тема Re: bg worker: general purpose requirements
Дата
Msg-id 4C93896A.8080908@bluegap.ch
обсуждение исходный текст
Ответ на Re: bg worker: patch 1 of 6 - permanent process  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: bg worker: general purpose requirements  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi,

On 09/16/2010 07:47 PM, Robert Haas wrote:
> It would be nice if there were a way to create
> a general facility here that we could then build various applications
> on, but I'm not sure whether that's the case.  We had some
> back-and-forth about what is best for replication vs. what is best for
> vacuum vs. what is best for parallel query.  If we could somehow
> conceive of a system that could serve all of those needs without
> introducing any more configuration complexity than what we have now,
> that would of course be very interesting.

Lets think about this again from a little distance. We have the existing 
autovacuum and the Postgres-R project. Then there are the potential 
features 'parallel querying' and 'autonomous transactions' that could in 
principle benefit from the bgworker infrastructure.

For all of those, one could head for a multi-threaded, a multi-process 
or an async, event based approach. Multi-threading seems to be out of 
question for Postgres. We don't have much of an async event framework 
anywhere, so at least for parallel querying that seems out of question 
as well. Only the 'autonomous transactions' feature seems simple enough 
to be doable within a single process. That approach would still miss the 
isolation that a separate process features (not sure that's required, 
but 'autonomous' sounds like it could be a good thing to have).

So assuming we use the multi-process approach provided by bgworkers for 
both potential features. What are the requirements?

autovacuum: only very few jobs at a time, not very resource intensive, 
not passing around lots of data

Postgres-R: lots of concurrent jobs, easily more than normal backends, 
depending on the amount of nodes in the cluster and read/write ratio, 
lots of data to be passed around

parallel querying: a couple dozen concurrent jobs (by number of CPUs or 
spindles available?), more doesn't help, lots of data to be passed around

autonomous transactions: max. one per normal backend (correct?), way 
fewer should suffice in most cases, only control data to be passed around


So, for both potential features as well as for autovacuum, a ratio of 
1:10 (or even less) for max_bgworkers:max_connections would suffice. 
Postgres-R clearly seems to be the out-breaker here. It needs special 
configuration anyway, so I'd have no problem with defaults that target 
the other use cases.

All of the potential users of bgworkers benefit from a pre-connected 
bgworker. Meaning having at least one spare bgworker around per database 
could be beneficial, potentially more depending on how often spike loads 
occur. As long as there are only few databases, it's easily possible to 
have at least one spare process around per database, but with thousands 
of databases, that might get prohibitively expensive (not sure where the 
boundary between win vs loose is, though. Idle backends vs. connection 
cost).

None the less, bgworkers would make the above features easier to 
implement, as they provide the controlled background worker process 
infrastructure, including job handling (and even queuing) in the 
coordinator process. Having spare workers available is not a perquisite 
to use bgworkers, it's just an optimization.

Autovacuum could possibly benefit from bgworkers by enabling a finer 
grained choice for what database and table to vacuum when. I didn't look 
too much into that, though.

Regarding the additional configuration overhead of the bgworkers patch: 
max_autovacuum_workers gets turned into max_background_workers, so the 
only additional GUCs currently are: min_spare_background_workers and 
max_spare_background_workers (sorry, I thought I named them idle 
workers, looks like I've gone with spare workers for the GUCs).

Those are used to control and limit (in both directions) the amount of 
spare workers (per database). It's the simplest possible variant I could 
think of. But I'm open to other mechanisms, especially ones that require 
less configuration. Simply keeping spare workers around for a given 
timeout *could* be a replacement and would save us one GUC.

However, I feel like this gives less control over how the bgworkers are 
used. For example, I'd prefer to be able to prevent the system from 
allocating all bgworkers to a single database at once. And as mentioned 
above, it also makes sense to pre-fork some bgworkers in advance, if 
there are still enough available. The timeout approach doesn't take care 
of that, but assumes that the past is a good indicator of use for the 
future.

Hope that sheds some more light on how bgworkers could be useful. Maybe 
I just need to describe the job handling features of the coordinator 
better as well? (Simon also requested better documentation...)

Regards

Markus Wanner


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Configuring synchronous replication
Следующее
От: Aidan Van Dyk
Дата:
Сообщение: Re: Configuring synchronous replication