Обсуждение: [GENERAL] # of connections and architecture design

Поиск
Список
Период
Сортировка

[GENERAL] # of connections and architecture design

От
Moreno Andreo
Дата:
Hi all,
     As many of you has read last Friday (and many has tired to help,
too, and I still thank you very much), I had a bad service outage.
I was pointed to reduce number of maximum connections using a pooler,
and that's what I'm building in test lab, but I'm wondering if there's
something I can do with my overall architecture design.
ATM we host one database per customer (about 400 now) and every customer
has two points of access to data:
- Directly to database, via rubyrep, to replicate the database he has in
his own machine
- Wia WCF self-hosted web services to read other customers data
Every customer can access (and replicate) his database from a number of
different positions (max 3).
Customers are organized in groups (max 10 per group), and there is the
chance that someone accesses someone else's data via WCF.
For example, pick up a group of 5: everyone running rubyrep with only
one position enabled, and getting data from others' database.
If I'm not wrong, it's 5 connections (rubyrep) plus 5*4 (everyone
connecting to everyone else's database) for WCF, so 25 connections
Now imagine a group of 10....
Last friday I've been told that 350 connections is quite a big number
and things can begin to slow down. Ok. When something slows down I'm
used to search and find the bottleneck (CPU, RAM, IO, etc). If
everything was running apparently fine (CPU < 10%, RAM used < 20%, I/O
rarely over 20%), how can I say there's a bottleneck that's slowing down
things? Am I missing something?
Another thing is that on a customer server (with a similar, smaller
architecture)  I _do_ have a connection leak problem that's under
investigation, but when things begin to slow down I simply run a
pg_terminate_backend on all connection with an age > 10 min and
everything goes back to normal. On my server, last friday, it did not
help, so I thought that was not the main problem.
I've got no problems in splitting this architecture in how many servers
I need, but I think I need some tips on how to design this, in order to
avoid major issues in the near future (ask for details if needed).

The current PostgreSQL 9.5.6 server is an 8 core VM with 52 GB RAM and
Debian 8.
WCF server is Windows 2012 R2 4-core, 16 GB RAM.

While facing the issue none of them showed up any kind of overload and
their logs were clean.

I'm a bit scared it can happen again.........

Thanks again
Moreno



Re: [GENERAL] # of connections and architecture design

От
Jeff Janes
Дата:
On Tue, Apr 18, 2017 at 2:42 AM, Moreno Andreo <moreno.andreo@evolu-s.it> wrote:
Hi all,
    As many of you has read last Friday (and many has tired to help, too, and I still thank you very much), I had a bad service outage.
I was pointed to reduce number of maximum connections using a pooler, and that's what I'm building in test lab, but I'm wondering if there's something I can do with my overall architecture design.
ATM we host one database per customer (about 400 now) and every customer has two points of access to data:
- Directly to database, via rubyrep, to replicate the database he has in his own machine
- Wia WCF self-hosted web services to read other customers data
Every customer can access (and replicate) his database from a number of different positions (max 3).
Customers are organized in groups (max 10 per group), and there is the chance that someone accesses someone else's data via WCF.
For example, pick up a group of 5: everyone running rubyrep with only one position enabled, and getting data from others' database.
If I'm not wrong, it's 5 connections (rubyrep) plus 5*4 (everyone connecting to everyone else's database) for WCF, so 25 connections
Now imagine a group of 10....
Last friday I've been told that 350 connections is quite a big number and things can begin to slow down. Ok. When something slows down I'm used to search and find the bottleneck (CPU, RAM, IO, etc). If everything was running apparently fine (CPU < 10%, RAM used < 20%, I/O rarely over 20%), how can I say there's a bottleneck that's slowing down things? Am I missing something?
Another thing is that on a customer server (with a similar, smaller architecture)  I _do_ have a connection leak problem that's under investigation, but when things begin to slow down I simply run a pg_terminate_backend on all connection with an age > 10 min and everything goes back to normal. On my server, last friday, it did not help, so I thought that was not the main problem.
I've got no problems in splitting this architecture in how many servers I need, but I think I need some tips on how to design this, in order to avoid major issues in the near future (ask for details if needed).

The current PostgreSQL 9.5.6 server is an 8 core VM with 52 GB RAM and Debian 8.
WCF server is Windows 2012 R2 4-core, 16 GB RAM.

While facing the issue none of them showed up any kind of overload and their logs were clean.

I'm a bit scared it can happen again.........

The logs being clean doesn't help much, if your log settings are set to be too terse.

Is log_lock_waits on?  log_checkpoints?  track_io_timing (doesn't show up in the logs, you have to query database views)?  

Is log_min_duration_statement set to a reasonable value?  log_autovacuum_min_duration?

Are you using pg_stat_statement (also doesn't show up in the logs, you have to query it), and perhaps auto_explain?

Cheers,

Jeff


Re: [GENERAL] # of connections and architecture design

От
Moreno Andreo
Дата:
Il 18/04/2017 18:51, Jeff Janes ha scritto:
On Tue, Apr 18, 2017 at 2:42 AM, Moreno Andreo <moreno.andreo@evolu-s.it> wrote:
Hi all,
    As many of you has read last Friday (and many has tired to help, too, and I still thank you very much), I had a bad service outage.
I was pointed to reduce number of maximum connections using a pooler, and that's what I'm building in test lab, but I'm wondering if there's something I can do with my overall architecture design.
ATM we host one database per customer (about 400 now) and every customer has two points of access to data:
- Directly to database, via rubyrep, to replicate the database he has in his own machine
- Wia WCF self-hosted web services to read other customers data
Every customer can access (and replicate) his database from a number of different positions (max 3).
Customers are organized in groups (max 10 per group), and there is the chance that someone accesses someone else's data via WCF.
For example, pick up a group of 5: everyone running rubyrep with only one position enabled, and getting data from others' database.
If I'm not wrong, it's 5 connections (rubyrep) plus 5*4 (everyone connecting to everyone else's database) for WCF, so 25 connections
Now imagine a group of 10....
Last friday I've been told that 350 connections is quite a big number and things can begin to slow down. Ok. When something slows down I'm used to search and find the bottleneck (CPU, RAM, IO, etc). If everything was running apparently fine (CPU < 10%, RAM used < 20%, I/O rarely over 20%), how can I say there's a bottleneck that's slowing down things? Am I missing something?
Another thing is that on a customer server (with a similar, smaller architecture)  I _do_ have a connection leak problem that's under investigation, but when things begin to slow down I simply run a pg_terminate_backend on all connection with an age > 10 min and everything goes back to normal. On my server, last friday, it did not help, so I thought that was not the main problem.
I've got no problems in splitting this architecture in how many servers I need, but I think I need some tips on how to design this, in order to avoid major issues in the near future (ask for details if needed).

The current PostgreSQL 9.5.6 server is an 8 core VM with 52 GB RAM and Debian 8.
WCF server is Windows 2012 R2 4-core, 16 GB RAM.

While facing the issue none of them showed up any kind of overload and their logs were clean.

I'm a bit scared it can happen again.........

The logs being clean doesn't help much, if your log settings are set to be too terse.

Is log_lock_waits on?
It's off
 log_checkpoints?
off
 track_io_timing (doesn't show up in the logs, you have to query database views)? 
off (never referred to)

Is log_min_duration_statement set to a reasonable value? 
default value (it's commented out)
log_autovacuum_min_duration?
default value (commented)

Are you using pg_stat_statement (also doesn't show up in the logs, you have to query it),
I'm using pg_stat_statements to keep track of the connections, their origin and if there's some lock (waiting = true)... obviously in a given moment I have only a maximum of 2 or 3 with state = 'active', all others are 'idle'
and perhaps auto_explain?
never used. (I'll check the docs).

Thanks
Moreno.

Cheers,

Jeff