Обсуждение: Problems with PG 9.3

Поиск

Список

Период

Сортировка

Problems with PG 9.3

От

Dhruv Shukla

Дата:

22 августа 2014 г., 22:23:57

Hello,

We quite recently moved our production database from 8.4 to 9.3. Its been just 2 months and the features which it allows for parallel backup and recovery has been a great use for us.

But with 9.3 we have also run into a lot of troubles.

Some background:

We run loaders to load data into the DB. These loaders are perl programs which run on a different server than the db server. The amount of data loaded each day is pretty high.

Problem:

Some of the loaders are performing in a bad way. they get stuck or EOF happens on client connection and things just dies. There is one piece which gets stuck daily ... but once killed runs normal.

One of the episode of today one of the loader got stuck for 2hr 15 min. I killed the whole process, restarted it and things ran within 10 mins. What I observed was the DB load hadnt changed or shifted significantly between the killing and the end of the program. At first we were thinking it was loaders but now it looks as if its the postgres database.

The piece where it was stuck for 2hrs was it was trying to insert 37K rows in a 500K row table, which is pretty small considering PG potential. But the bottleneck here we cant solve it.

Kindly help me find the solution, or the right place to dig into to find the exact problem, its been troubling some of our loaders for quite some time now.

Please let me know if you need further information regarding the settings on our DB server.

--
Regards
Dhruv

Re: Problems with PG 9.3

От

jayknowsunix@gmail.com

Дата:

22 августа 2014 г., 23:11:14

Have you tried attaching strace to one the loaders? That might give a clue where it's getting stuck.
-
Jay

Sent from my iPad

> On Aug 22, 2014, at 3:23 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
>
> Hello,
> We quite recently moved our production database from 8.4 to 9.3. Its been just 2 months and the features which it
allowsfor parallel backup and recovery has been a great use for us. 
> But with 9.3 we have also run into a lot of troubles.
>
> Some background:
> We run loaders to load data into the DB. These loaders are perl programs which run on a different server than the db
server.The amount of data loaded each day is pretty high. 
>
> Problem:
> Some of the loaders are performing in a bad way. they get stuck or EOF happens on client connection and things just
dies.There is one piece which gets stuck daily ... but once killed runs normal. 
>
> One of the episode of today one of the loader got stuck for 2hr 15 min. I killed the whole process, restarted it and
thingsran within 10 mins. What I observed was the DB load hadnt changed or shifted significantly between the killing
andthe end of the program. At first we were thinking it was loaders but now it looks as if its the postgres database.  
>
> The piece where it was stuck for 2hrs was it was trying to insert 37K rows in a 500K row table, which is pretty small
consideringPG potential. But the bottleneck here we cant solve it. 
>
>
> Kindly help me find the solution, or the right place to dig into to find the exact problem, its been troubling some
ofour loaders for quite some time now.  
>
> Please let me know if you need further information regarding the settings on our DB server.
>
> --
> Regards
> Dhruv

Re: Problems with PG 9.3

От

Dhruv Shukla

Дата:

22 августа 2014 г., 23:31:58

Right now we are doing is only logs where on each statement for a sql function we are raising an info.

This way we know which step its getting stuck. Today it got stuck on the insert statement.

And we ran the same SQL function after killing the process, it ran normal without any issues.

-Dhruv

On Fri, Aug 22, 2014 at 3:11 PM, <jayknowsunix@gmail.com> wrote:

Have you tried attaching strace to one the loaders? That might give a clue where it's getting stuck.
-
Jay

Sent from my iPad

> On Aug 22, 2014, at 3:23 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
>
> Hello,
> We quite recently moved our production database from 8.4 to 9.3. Its been just 2 months and the features which it allows for parallel backup and recovery has been a great use for us.
> But with 9.3 we have also run into a lot of troubles.
>
> Some background:
> We run loaders to load data into the DB. These loaders are perl programs which run on a different server than the db server. The amount of data loaded each day is pretty high.
>
> Problem:
> Some of the loaders are performing in a bad way. they get stuck or EOF happens on client connection and things just dies. There is one piece which gets stuck daily ... but once killed runs normal.
>
> One of the episode of today one of the loader got stuck for 2hr 15 min. I killed the whole process, restarted it and things ran within 10 mins. What I observed was the DB load hadnt changed or shifted significantly between the killing and the end of the program. At first we were thinking it was loaders but now it looks as if its the postgres database.
>
> The piece where it was stuck for 2hrs was it was trying to insert 37K rows in a 500K row table, which is pretty small considering PG potential. But the bottleneck here we cant solve it.
>
>
> Kindly help me find the solution, or the right place to dig into to find the exact problem, its been troubling some of our loaders for quite some time now.
>
> Please let me know if you need further information regarding the settings on our DB server.
>
> --
> Regards
> Dhruv

--
Regards
Dhruv
404-551-2578

Re: Problems with PG 9.3

От

Albe Laurenz

Дата:

25 августа 2014 г., 09:54:15

Dhruv Shukla wrote:
> Right now we are doing is only logs where on each statement for a sql function we are raising an info.
> 
> This way we know which step its getting stuck. Today it got stuck on the insert statement.
> 
> And we ran the same SQL function after killing the process, it ran normal without any issues.

That sounds a lot like locking problems.

Examine pg_locks when a process gets stuck an see if it is waiting for a lock
and who is holding the lock.

Yours,
Laurenz Albe

Re: Problems with PG 9.3

От

Dhruv Shukla

Дата:

25 августа 2014 г., 18:05:10

We have a script which notifies us if there are any locks that its waiting for and it sends an email out. But nothing has came thru ...

here is one pid which has got stuck from midnight and is still running, but doesnt seem like it will finish anytime soon. It was run manually by me using an strace.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

64144 postgres 20 0 81.2g 4.1g 4.1g R 100.0 1.1 537:59.86 postmaster

I even checked the lock scripts by running each of the query but nothing has came thru.

https://wiki.postgresql.org/wiki/Lock_Monitoring

Cent OS 6.3, Postgres 9.3.3, Postgis.

On Mon, Aug 25, 2014 at 1:54 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:

Dhruv Shukla wrote:
> Right now we are doing is only logs where on each statement for a sql function we are raising an info.
>
> This way we know which step its getting stuck. Today it got stuck on the insert statement.
>
> And we ran the same SQL function after killing the process, it ran normal without any issues.

That sounds a lot like locking problems.

Examine pg_locks when a process gets stuck an see if it is waiting for a lock
and who is holding the lock.

Yours,
Laurenz Albe

--
Regards
Dhruv
404-551-2578

Re: Problems with PG 9.3

От

Dhruv Shukla

Дата:

25 августа 2014 г., 18:14:52

Other informational details about configurations are:

shared_buffers = 80GB # min 128kB

# (change requires restart)

temp_buffers = 2GB # min 800kB

work_mem = 2GB # min 64kB

maintenance_work_mem = 16GB # min 1MB

checkpoint_segments = 64 # in logfile segments, min 1, 16MB each

#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers

#checkpoint_timeout = 5min # range 30s-1h

#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0

#checkpoint_warning = 30s # 0 disables

effective_cache_size = 130GB

Let me know if you need more information.

On Mon, Aug 25, 2014 at 10:04 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:

We have a script which notifies us if there are any locks that its waiting for and it sends an email out. But nothing has came thru ...
here is one pid which has got stuck from midnight and is still running, but doesnt seem like it will finish anytime soon. It was run manually by me using an strace.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
64144 postgres 20 0 81.2g 4.1g 4.1g R 100.0 1.1 537:59.86 postmaster

I even checked the lock scripts by running each of the query but nothing has came thru.
https://wiki.postgresql.org/wiki/Lock_Monitoring

Cent OS 6.3, Postgres 9.3.3, Postgis.

On Mon, Aug 25, 2014 at 1:54 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
Dhruv Shukla wrote:
> Right now we are doing is only logs where on each statement for a sql function we are raising an info.
>
> This way we know which step its getting stuck. Today it got stuck on the insert statement.
>
> And we ran the same SQL function after killing the process, it ran normal without any issues.

That sounds a lot like locking problems.

Examine pg_locks when a process gets stuck an see if it is waiting for a lock
and who is holding the lock.

Yours,
Laurenz Albe

--
Regards
Dhruv
404-551-2578

--
Regards
Dhruv
404-551-2578

Re: Problems with PG 9.3

От

Kevin Grittner

Дата:

25 августа 2014 г., 18:51:13

Dhruv Shukla <dhruvshukla82@gmail.com> wrote:

> Other informational details about configurations are:

> shared_buffers = 80GB
> temp_buffers = 2GB
> work_mem = 2GB
> maintenance_work_mem = 16GB

Well, at the default max_connections of 100 that could easily
result in the server trying to allocate about 567GB of RAM.  If
your number of connections is more than 100, adjust proportionally
higher.  If the server doesn't have that, you could see extreme
swapping or various other problems as the OS tries to survive.

> effective_cache_size = 130GB

And the planner will be generating plans based on the assumption
that the combination of shared_buffers and the OS cache will have
this much RAM available for caching, so if you have max_connections
= 100 this configuration would only make sense on a machine with
RAM of 617GB or more, plus whatever is needed for the OS and
anything besides PostgreSQL that you want to run on the machine.

How much RAM is on the machine (or VM)?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Problems with PG 9.3

От

Dhruv Shukla

Дата:

25 августа 2014 г., 19:24:47

Kevin,

Currently we have a max connection setting for 1000 connections.

And RAM on server is 384GB RAM.

4 Octa core CPU hyperthreaded so total 64 cores.

-Dhruv

On Mon, Aug 25, 2014 at 10:51 AM, Kevin Grittner <kgrittn@ymail.com> wrote:

Dhruv Shukla <dhruvshukla82@gmail.com> wrote:

> Other informational details about configurations are:

> shared_buffers = 80GB
> temp_buffers = 2GB
> work_mem = 2GB
> maintenance_work_mem = 16GB

Well, at the default max_connections of 100 that could easily
result in the server trying to allocate about 567GB of RAM. If
your number of connections is more than 100, adjust proportionally
higher. If the server doesn't have that, you could see extreme
swapping or various other problems as the OS tries to survive.

> effective_cache_size = 130GB

And the planner will be generating plans based on the assumption
that the combination of shared_buffers and the OS cache will have
this much RAM available for caching, so if you have max_connections
= 100 this configuration would only make sense on a machine with
RAM of 617GB or more, plus whatever is needed for the OS and
anything besides PostgreSQL that you want to run on the machine.

How much RAM is on the machine (or VM)?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Regards
Dhruv
404-551-2578

Re: Problems with PG 9.3

От

Dhruv Shukla

Дата:

25 августа 2014 г., 19:33:23

Some more information about memory:

$free -m

total used free shared buffers cached

Mem: 387551 295326 92224 0 49 207903

-/+ buffers/cache: 87373 300177

Swap: 4095 46 4049

$ cat /proc/swaps

Filename Type Size Used Priority

/dev/dm-1 partition 4194296 48100 -1

DB resides on its own server and isnt hosted on a VM.

-Dhruv

On Mon, Aug 25, 2014 at 11:24 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:

Kevin,
Currently we have a max connection setting for 1000 connections.

And RAM on server is 384GB RAM.

4 Octa core CPU hyperthreaded so total 64 cores.

-Dhruv

On Mon, Aug 25, 2014 at 10:51 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
Dhruv Shukla <dhruvshukla82@gmail.com> wrote:

> Other informational details about configurations are:

> shared_buffers = 80GB
> temp_buffers = 2GB
> work_mem = 2GB
> maintenance_work_mem = 16GB

Well, at the default max_connections of 100 that could easily
result in the server trying to allocate about 567GB of RAM. If
your number of connections is more than 100, adjust proportionally
higher. If the server doesn't have that, you could see extreme
swapping or various other problems as the OS tries to survive.

> effective_cache_size = 130GB

And the planner will be generating plans based on the assumption
that the combination of shared_buffers and the OS cache will have
this much RAM available for caching, so if you have max_connections
= 100 this configuration would only make sense on a machine with
RAM of 617GB or more, plus whatever is needed for the OS and
anything besides PostgreSQL that you want to run on the machine.

How much RAM is on the machine (or VM)?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Regards
Dhruv
404-551-2578

--
Regards
Dhruv
404-551-2578

Re: Problems with PG 9.3

От

Kevin Grittner

Дата:

25 августа 2014 г., 20:53:35

Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
> Kevin Grittner <kgrittn@ymail.com> wrote:
>> Dhruv Shukla <dhruvshukla82@gmail.com> wrote:

>>> Other informational details about configurations are:
>>> shared_buffers = 80GB
>>> temp_buffers = 2GB
>>> work_mem = 2GB
>>> maintenance_work_mem = 16GB

>> How much RAM is on the machine (or VM)?

> Currently we have a max connection setting for 1000 connections.

The temp_buffers setting is a limit *per connection*, and once RAM
for a temporary table is allocated to a connection it is never
released; so even when all connections are idle they could be
setting aside 2TG of RAM for possible use for caching temporary
tables.

The work_mem setting is an attempt to limit RAM per node of an
executing query; one connection can create many allocations of the
size set for work_mem.  Since not all queries have nodes that
require such allocations, and not all connections are necessarily
active at the same time, a general rule of thumb is to allow for
one work_mem allocation per connection allowed by max_connections.
So these settings can easily result in another 2TB of allocations,
beyond the temp_buffers mentioned above.

A high shared_buffers setting can result in "write gluts" at the OS
level when a high percentage of that memory becomes dirty and is
dumped to the OS in a short time.  This can result in the OS
appearing to "lock up" for as long as it takes it to flush all of
the dirty data to disk.  I don't know what your system is capable
of, but most database servers I've seen fall between 20MB/second
and 200MB/second.  That would correspond to apparent stalls lasting
between 6.8 minutes and 1.1 hours.  By tweaking the bgwriter
settings and the OS dirty settings you can drastically reduce this,
but I don't think you've indicated having done that, so 80GB can be
expected to cause apparent lockups of those durations.

A high shared_buffers setting makes you more vulnerable to long
stalls because of transparent huge page operations of the OS.

> And RAM on server is 384GB RAM.

And overcommitting RAM by a factor of more than 10x is not going to
be pretty in any event.

If I could not reduce max_connections, I would set work_mem to no
more than 100MB and temp_buffers to no more than 50MB.  I would
drop maintenance_work_mem to 2GB.  I would probably drastically
reduce shared_buffers and would most certainly make autovacuum and
bgwriter more aggressive than the default.

If you make those changes and still see a problem, only then is it
worth looking at other possible causes.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Problems with PG 9.3

От

Dhruv Shukla

Дата:

25 августа 2014 г., 21:48:51

Kevin,

This has been valuable information. Thanks a lot for it.

I was able to reduce the max connections down to 200, but still the other parameters I wanted to run over with you before I go ahead and make changes.

Now here comes the dilemma for me, we run memory intensive processes in morning using easily resident memory of 4-10G of memory for processing. I thought of using work_mem =1Gb so that should commit to around 200GB of RAM. and then there are other processes which uses temporary tables in which around 200K records are stored in temporary tables

A temp_mem setting of 1GB sounds good, but i am slightly worried if it could delay the process for those memory extensive processes.

Planning on getting shared buffers down to 60GB from 80GB.

OS:

vm.dirty_background_ratio = 10

In regards to vm.dirty_background_ratio, what number do you think will be reasonable for such kind of scenarios?

Thanks,

Dhruv

On Mon, Aug 25, 2014 at 12:53 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
> Kevin Grittner <kgrittn@ymail.com> wrote:
>> Dhruv Shukla <dhruvshukla82@gmail.com> wrote:

>>> Other informational details about configurations are:
>>> shared_buffers = 80GB
>>> temp_buffers = 2GB
>>> work_mem = 2GB
>>> maintenance_work_mem = 16GB

>> How much RAM is on the machine (or VM)?

> Currently we have a max connection setting for 1000 connections.

The temp_buffers setting is a limit *per connection*, and once RAM
for a temporary table is allocated to a connection it is never
released; so even when all connections are idle they could be
setting aside 2TG of RAM for possible use for caching temporary
tables.

The work_mem setting is an attempt to limit RAM per node of an
executing query; one connection can create many allocations of the
size set for work_mem. Since not all queries have nodes that
require such allocations, and not all connections are necessarily
active at the same time, a general rule of thumb is to allow for
one work_mem allocation per connection allowed by max_connections.
So these settings can easily result in another 2TB of allocations,
beyond the temp_buffers mentioned above.

A high shared_buffers setting can result in "write gluts" at the OS
level when a high percentage of that memory becomes dirty and is
dumped to the OS in a short time. This can result in the OS
appearing to "lock up" for as long as it takes it to flush all of
the dirty data to disk. I don't know what your system is capable
of, but most database servers I've seen fall between 20MB/second
and 200MB/second. That would correspond to apparent stalls lasting
between 6.8 minutes and 1.1 hours. By tweaking the bgwriter
settings and the OS dirty settings you can drastically reduce this,
but I don't think you've indicated having done that, so 80GB can be
expected to cause apparent lockups of those durations.

A high shared_buffers setting makes you more vulnerable to long
stalls because of transparent huge page operations of the OS.

> And RAM on server is 384GB RAM.

And overcommitting RAM by a factor of more than 10x is not going to
be pretty in any event.

If I could not reduce max_connections, I would set work_mem to no
more than 100MB and temp_buffers to no more than 50MB. I would
drop maintenance_work_mem to 2GB. I would probably drastically
reduce shared_buffers and would most certainly make autovacuum and
bgwriter more aggressive than the default.

If you make those changes and still see a problem, only then is it
worth looking at other possible causes.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Regards
Dhruv
404-551-2578

Re: Problems with PG 9.3

От

Dhruv Shukla

Дата:

26 августа 2014 г., 18:27:35

Hello All,

We changed the settings to

max_connections=200

work_mem=1Gb

temp_mem=1Gb

shared_buffers=70GB

maintenance_work_mem = 2GB

effective_cache_size=180GB

max_locks_per_transaction=128

autovacuum_max_workers = 12

checkpoint_segments = 256

And ran an strace on one of the programs that was getting stuck and dying by itself. Here is the log of strace (Ran from another machine)

poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])

recvfrom(3, "N\0\0\0LSINFO\0C00000\0M -"..., 16384, 0, NULL, NULL) = 77

write(2, "INFO: ---> did 2\n", 30INFO: ---> did 2

) = 30

poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])

recvfrom(3, "N\0\0\0MSINFO\0C00000\0M -"..., 16384, 0, NULL, NULL) = 78

write(2, "INFO: ---> did 14\n", 31INFO: ---> did 14

) = 31

poll([{fd=3, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>

On viewing server logs on postgres server found the error as

FATAL: connection to client lost

There has been some improvements, like few of the getting stuck DB functions ran good. But the main ones are still failing.

-Dhruv

On Mon, Aug 25, 2014 at 1:48 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:

Kevin,
This has been valuable information. Thanks a lot for it.

I was able to reduce the max connections down to 200, but still the other parameters I wanted to run over with you before I go ahead and make changes.

Now here comes the dilemma for me, we run memory intensive processes in morning using easily resident memory of 4-10G of memory for processing. I thought of using work_mem =1Gb so that should commit to around 200GB of RAM. and then there are other processes which uses temporary tables in which around 200K records are stored in temporary tables

A temp_mem setting of 1GB sounds good, but i am slightly worried if it could delay the process for those memory extensive processes.

Planning on getting shared buffers down to 60GB from 80GB.

OS:
vm.dirty_background_ratio = 10

In regards to vm.dirty_background_ratio, what number do you think will be reasonable for such kind of scenarios?

Thanks,
Dhruv

On Mon, Aug 25, 2014 at 12:53 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
> Kevin Grittner <kgrittn@ymail.com> wrote:
>> Dhruv Shukla <dhruvshukla82@gmail.com> wrote:

>>> Other informational details about configurations are:
>>> shared_buffers = 80GB
>>> temp_buffers = 2GB
>>> work_mem = 2GB
>>> maintenance_work_mem = 16GB

>> How much RAM is on the machine (or VM)?

> Currently we have a max connection setting for 1000 connections.

The temp_buffers setting is a limit *per connection*, and once RAM
for a temporary table is allocated to a connection it is never
released; so even when all connections are idle they could be
setting aside 2TG of RAM for possible use for caching temporary
tables.

The work_mem setting is an attempt to limit RAM per node of an
executing query; one connection can create many allocations of the
size set for work_mem. Since not all queries have nodes that
require such allocations, and not all connections are necessarily
active at the same time, a general rule of thumb is to allow for
one work_mem allocation per connection allowed by max_connections.
So these settings can easily result in another 2TB of allocations,
beyond the temp_buffers mentioned above.

A high shared_buffers setting can result in "write gluts" at the OS
level when a high percentage of that memory becomes dirty and is
dumped to the OS in a short time. This can result in the OS
appearing to "lock up" for as long as it takes it to flush all of
the dirty data to disk. I don't know what your system is capable
of, but most database servers I've seen fall between 20MB/second
and 200MB/second. That would correspond to apparent stalls lasting
between 6.8 minutes and 1.1 hours. By tweaking the bgwriter
settings and the OS dirty settings you can drastically reduce this,
but I don't think you've indicated having done that, so 80GB can be
expected to cause apparent lockups of those durations.

A high shared_buffers setting makes you more vulnerable to long
stalls because of transparent huge page operations of the OS.

> And RAM on server is 384GB RAM.

And overcommitting RAM by a factor of more than 10x is not going to
be pretty in any event.

If I could not reduce max_connections, I would set work_mem to no
more than 100MB and temp_buffers to no more than 50MB. I would
drop maintenance_work_mem to 2GB. I would probably drastically
reduce shared_buffers and would most certainly make autovacuum and
bgwriter more aggressive than the default.

If you make those changes and still see a problem, only then is it
worth looking at other possible causes.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Regards
Dhruv
404-551-2578

--
Regards
Dhruv
404-551-2578

Re: Problems with PG 9.3

От

Scott Marlowe

Дата:

26 августа 2014 г., 18:57:22

On Tue, Aug 26, 2014 at 9:27 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:

> On viewing server logs on postgres server found the error as
> FATAL: connection to client lost
>
> There has been some improvements, like few of the getting stuck DB functions
> ran good. But the main ones are still failing.

This sounds like a firewall issue to me, but it could be a backend
running out of memory and the oom killer in linux killing it. I
believe oom killer logs to /var/log/syslog or /var/log/messages. Got
anything in those logs about that time for the oom killer?

If not, it could be a firewall, if you have one, between the app and
db boxes. If that's the case you can stop it from happening by
lowering the tcp keepalive settings.

Re: Problems with PG 9.3

От

Scott Marlowe

Дата:

26 августа 2014 г., 18:58:52

On Tue, Aug 26, 2014 at 9:57 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Tue, Aug 26, 2014 at 9:27 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
>
>> On viewing server logs on postgres server found the error as
>> FATAL: connection to client lost
>>
>> There has been some improvements, like few of the getting stuck DB functions
>> ran good. But the main ones are still failing.
>
> This sounds like a firewall issue to me, but it could be a backend
> running out of memory and the oom killer in linux killing it. I
> believe oom killer logs to /var/log/syslog or /var/log/messages. Got
> anything in those logs about that time for the oom killer?
>
> If not, it could be a firewall, if you have one, between the app and
> db boxes. If that's the case you can stop it from happening by
> lowering the tcp keepalive settings.

Please note that the oom killer is deadly to a postgres box, as it can
kill the postmaster which is a bad thing. It should be disabled by
setting vm.overcommit_memory=2 in /etc/sysctl.conf and running sysctl
-p

--
To understand recursion, one must first understand recursion.

Re: Problems with PG 9.3

От

Dhruv Shukla

Дата:

26 августа 2014 г., 19:24:22

Scott,

Nothing appreared in /var/log/messages about the oom killer on both the boxes. So could be more like an firewall issue, will look into this and let you know. Current firewall setting that we have are:

tcp_keepalive_time=7200

tcp_keepalive_intvl=75

tcp_keepalive_probes=9

Thanks,
Drhuv Shukla.

On Tue, Aug 26, 2014 at 10:58 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

On Tue, Aug 26, 2014 at 9:57 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Tue, Aug 26, 2014 at 9:27 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
>
>> On viewing server logs on postgres server found the error as
>> FATAL: connection to client lost
>>
>> There has been some improvements, like few of the getting stuck DB functions
>> ran good. But the main ones are still failing.
>
> This sounds like a firewall issue to me, but it could be a backend
> running out of memory and the oom killer in linux killing it. I
> believe oom killer logs to /var/log/syslog or /var/log/messages. Got
> anything in those logs about that time for the oom killer?
>
> If not, it could be a firewall, if you have one, between the app and
> db boxes. If that's the case you can stop it from happening by
> lowering the tcp keepalive settings.

Please note that the oom killer is deadly to a postgres box, as it can
kill the postmaster which is a bad thing. It should be disabled by
setting vm.overcommit_memory=2 in /etc/sysctl.conf and running sysctl
-p

--
To understand recursion, one must first understand recursion.

--
Regards
Dhruv
404-551-2578

Re: Problems with PG 9.3

От

Scott Marlowe

Дата:

26 августа 2014 г., 19:32:23

On Tue, Aug 26, 2014 at 10:24 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
> Scott,
>
> Nothing appreared in /var/log/messages about the oom killer on both the
> boxes. So could be more like an firewall issue, will look into this and let
> you know. Current firewall setting that we have are:
>
> tcp_keepalive_time=7200
> tcp_keepalive_intvl=75
> tcp_keepalive_probes=9

In my experience dropping the keepalive to 300, and retries to 2 will
keep connections alive without sending out a flood of keepalive pings

Re: Problems with PG 9.3

От

Dhruv Shukla

Дата:

26 августа 2014 г., 21:22:23

Its 15 hours now ... that the DB was restarted and things have started to get stuck. Apparently taking too long to finish with these settings.... any further suggesstions??

-Dhruv.

On Tue, Aug 26, 2014 at 11:32 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

On Tue, Aug 26, 2014 at 10:24 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
> Scott,
>
> Nothing appreared in /var/log/messages about the oom killer on both the
> boxes. So could be more like an firewall issue, will look into this and let
> you know. Current firewall setting that we have are:
>
> tcp_keepalive_time=7200
> tcp_keepalive_intvl=75
> tcp_keepalive_probes=9

In my experience dropping the keepalive to 300, and retries to 2 will
keep connections alive without sending out a flood of keepalive pings

--
Regards
Dhruv
404-551-2578

Re: Problems with PG 9.3

От

Scott Marlowe

Дата:

26 августа 2014 г., 22:01:40

On Tue, Aug 26, 2014 at 12:22 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
> Its 15 hours now ... that the DB was restarted and things have started to
> get stuck. Apparently taking too long to finish with these settings.... any
> further suggesstions??

Troubleshoot it while it's stuck. If your app isn't stopping /
erroring out when it loses connection then it's broken and someone
needs to code real error handling into it (or you're using a language
that's fundamentally broken in terms of handling network errors). Esp
because with a lower tcp keepalive the app should be told that the
conn died in < 10 minutes.

So I'm going on the assumption that you're losing connection. YOU need
to figure out why. tools like netstat and strace etc are useful here.
If a backend is crashing out, there'll be an entry in the pg logs, if
networking is killing it then maybe a firewall will have logs, if the
oom is killing it then the linux logs on the db server will say. Use
tools like sar and sysstat and zabbix and other monitoring packages to
see if you're running out of ram and oom is killing processes.

I assume you've lowered your work_mem etc down to something more
reasonable, like 16Meg etc. And that you restarted the server after
dropping max conns down to 200. Note that 200 is still far too many,
and you need to look into a pooler to reduce that number down to < 2 x
CPU cores. Anything over that is counter productive and likely to
cause performance issues.

Using netstat -an can you find matching connections from the stalled
machine to the db server? If not you've lost network connection. If
there's no obvious cause in pg or sys logs on the db server then it's
networking.

Re: Problems with PG 9.3

От

Dhruv Shukla

Дата:

26 августа 2014 г., 22:32:56

Scott,

Thanks for such a valuable information. I will have a look into it and have it debugged.

But now things have started slowing down really. Some programs have started to throw errors like

ERROR: DB COPY failed: DBD::Pg::db pg_endcopy
failed: ERROR: out of shared memory
HINT: You might need to increase max_locks_per_transaction.

These were things which were working before with higher memory settings. And some are just running slow, slow means literally slow even though they are connected via netstat and all.

Thanks,
Dhruv

On Tue, Aug 26, 2014 at 2:01 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

On Tue, Aug 26, 2014 at 12:22 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
> Its 15 hours now ... that the DB was restarted and things have started to
> get stuck. Apparently taking too long to finish with these settings.... any
> further suggesstions??

Troubleshoot it while it's stuck. If your app isn't stopping /
erroring out when it loses connection then it's broken and someone
needs to code real error handling into it (or you're using a language
that's fundamentally broken in terms of handling network errors). Esp
because with a lower tcp keepalive the app should be told that the
conn died in < 10 minutes.

So I'm going on the assumption that you're losing connection. YOU need
to figure out why. tools like netstat and strace etc are useful here.
If a backend is crashing out, there'll be an entry in the pg logs, if
networking is killing it then maybe a firewall will have logs, if the
oom is killing it then the linux logs on the db server will say. Use
tools like sar and sysstat and zabbix and other monitoring packages to
see if you're running out of ram and oom is killing processes.

I assume you've lowered your work_mem etc down to something more
reasonable, like 16Meg etc. And that you restarted the server after
dropping max conns down to 200. Note that 200 is still far too many,
and you need to look into a pooler to reduce that number down to < 2 x
CPU cores. Anything over that is counter productive and likely to
cause performance issues.

Using netstat -an can you find matching connections from the stalled
machine to the db server? If not you've lost network connection. If
there's no obvious cause in pg or sys logs on the db server then it's
networking.

--
Regards
Dhruv
404-551-2578

Re: Problems with PG 9.3

От

Scott Marlowe

Дата:

26 августа 2014 г., 22:37:46

On Tue, Aug 26, 2014 at 1:32 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
> Scott,
> Thanks for such a valuable information. I will have a look into it and have
> it debugged.
>
> But now things have started slowing down really. Some programs have started
> to throw errors like
>
> ERROR: DB COPY failed: DBD::Pg::db pg_endcopy
> failed: ERROR:  out of shared memory
> HINT:  You might need to increase max_locks_per_transaction.
>
>
> These were things which were working before with higher memory settings. And
> some are just running slow, slow means literally slow even though they are
> connected via netstat and all.

OK increasing max locks per transaction is pretty safe. What did you
change shared_mem from and too? Setting it a bit higher isn't terribly
dangerous, like work_mem x max_conns can be.

Re: Problems with PG 9.3

От

Dhruv Shukla

Дата:

26 августа 2014 г., 22:44:17

Here are all the changes

max_connections =1000 -> 200

work_mem =4GB -> 1GB

temp_mem =2GB -> 1GB

shared_buffers =80GB -> 70GB

maintenance_work_mem = 16GB -> 2GB

effective_cache_size =130GB -> 180GB

max_locks_per_transaction =64 -> 128

autovacuum_max_workers = 3 -> 12

checkpoint_segments = 64 -> 256

-Dhruv

On Tue, Aug 26, 2014 at 2:37 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

On Tue, Aug 26, 2014 at 1:32 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
> Scott,
> Thanks for such a valuable information. I will have a look into it and have
> it debugged.
>
> But now things have started slowing down really. Some programs have started
> to throw errors like
>
> ERROR: DB COPY failed: DBD::Pg::db pg_endcopy
> failed: ERROR: out of shared memory
> HINT: You might need to increase max_locks_per_transaction.
>
>
> These were things which were working before with higher memory settings. And
> some are just running slow, slow means literally slow even though they are
> connected via netstat and all.

OK increasing max locks per transaction is pretty safe. What did you
change shared_mem from and too? Setting it a bit higher isn't terribly
dangerous, like work_mem x max_conns can be.

--
Regards
Dhruv
404-551-2578

Re: Problems with PG 9.3

От

Scott Marlowe

Дата:

26 августа 2014 г., 23:26:13

On Tue, Aug 26, 2014 at 1:44 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
> Here are all the changes
>
> max_connections               =1000 -> 200
> work_mem                        =4GB -> 1GB
> temp_mem                        =2GB -> 1GB
> shared_buffers                    =80GB -> 70GB
> maintenance_work_mem     = 16GB -> 2GB
> effective_cache_size           =130GB -> 180GB
> max_locks_per_transaction =64 -> 128
> autovacuum_max_workers   = 3 -> 12
> checkpoint_segments         = 64 -> 256

OK, so while this is happening, what's you memory look like on the db
box? Is it running low on memory? Or does it have plenty left over for
kernel cache? Is it releasing a lot of kernel cache to make room for
work_mem? You need to profile your server while this is happening.

vmstat
iostat
htop

are good tools to start with.

Are you running out of memory, CPU, or IO?

Re: Problems with PG 9.3

От

Scott Marlowe

Дата:

08 сентября 2014 г., 19:05:57

So have you had any luck so far?