Обсуждение: vacuumdb not letting me connect to db
Hi, I have 160 GB of RAM, postgres 9.6 is running on the server. after upgrade I ran the below command: "/usr/edb/as9.6/bin/vacuumdb" -U 'enterprisedb' --verbose --all -j 300 --analyze-only after running that command I was not able to connect the database using psql for few minutes. After 20-30 minutes i was able to connect to the db and at that time I checked the pg_stst_activity, the active connections was reduced to 27. my max_connections is set to 700. I tried to find out the reason for not being abled to connect the db (when 300 jobs were running) but still not got the answer. So suggestions are welcome.
On Thu, 2021-02-04 at 16:56 +0530, Atul Kumar wrote: > I have 160 GB of RAM, postgres 9.6 is running on the server. > > after upgrade I ran the below command: > > "/usr/edb/as9.6/bin/vacuumdb" -U 'enterprisedb' --verbose --all -j 300 > --analyze-only > > after running that command I was not able to connect the database > using psql for few minutes. That is to be expected. If you have 300 processes performing I/O and using CPU, your machine will vertainly be overloaded. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
Hi,
On Thursday, February 4, 2021, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
The CPU and RAM are normal even on 300 jobs ( only 1-4% of consumption) but I don’t understand one thing here that if max_connections is set to 700 then why I am not able to connect the db.
As the running jobs (300) are lesser than half of max_connections.
Regards
Regards
Atul
On Thursday, February 4, 2021, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Thu, 2021-02-04 at 16:56 +0530, Atul Kumar wrote:
> I have 160 GB of RAM, postgres 9.6 is running on the server.
>
> after upgrade I ran the below command:
>
> "/usr/edb/as9.6/bin/vacuumdb" -U 'enterprisedb' --verbose --all -j 300
> --analyze-only
>
> after running that command I was not able to connect the database
> using psql for few minutes.
That is to be expected.
If you have 300 processes performing I/O and using CPU, your machine
will vertainly be overloaded.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
On 2/4/21 5:26 AM, Atul Kumar wrote:
What's the exact error message?
Connect to the cluster before running "vacuumdb -j300", and start looking at pg_stst_activity while vacuumdb is running.
Hi, I have 160 GB of RAM, postgres 9.6 is running on the server. after upgrade I ran the below command: "/usr/edb/as9.6/bin/vacuumdb" -U 'enterprisedb' --verbose --all -j 300 --analyze-only after running that command I was not able to connect the database using psql for few minutes.
What's the exact error message?
After 20-30 minutes i was able to connect to the db and at that time I checked the pg_stst_activity, the active connections was reduced to 27. my max_connections is set to 700. I tried to find out the reason for not being abled to connect the db (when 300 jobs were running) but still not got the answer. So
Connect to the cluster before running "vacuumdb -j300", and start looking at pg_stst_activity while vacuumdb is running.
--
Angular momentum makes the world go 'round.
Angular momentum makes the world go 'round.
What about disk IO? That's what really gets saturated when running 300 threads.
On 2/4/21 11:00 AM, Atul Kumar wrote:
Hi,The CPU and RAM are normal even on 300 jobs ( only 1-4% of consumption) but I don’t understand one thing here that if max_connections is set to 700 then why I am not able to connect the db.As the running jobs (300) are lesser than half of max_connections.
RegardsAtul
On Thursday, February 4, 2021, Laurenz Albe <laurenz.albe@cybertec.at> wrote:On Thu, 2021-02-04 at 16:56 +0530, Atul Kumar wrote:
> I have 160 GB of RAM, postgres 9.6 is running on the server.
>
> after upgrade I ran the below command:
>
> "/usr/edb/as9.6/bin/vacuumdb" -U 'enterprisedb' --verbose --all -j 300
> --analyze-only
>
> after running that command I was not able to connect the database
> using psql for few minutes.
That is to be expected.
If you have 300 processes performing I/O and using CPU, your machine
will vertainly be overloaded.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
--
Angular momentum makes the world go 'round.
Angular momentum makes the world go 'round.
>The CPU and RAM are normal even on 300 jobs ( only 1-4% of consumption) >but I don’t understand one thing here that if max_connections is set to 700 then >why I am not able to connect the db. As the running jobs (300) are lesser than >half of max_connections. Please paste the error message
There is no error message, when I try to connect the database while running vacuumdb with 300 jobs, it gets stuck.
On Thursday, February 4, 2021, Ravi Krishna <rkrishna_pg@aol.com> wrote:
On Thursday, February 4, 2021, Ravi Krishna <rkrishna_pg@aol.com> wrote:
>The CPU and RAM are normal even on 300 jobs ( only 1-4% of consumption)
>but I don’t understand one thing here that if max_connections is set to 700 then
>why I am not able to connect the db. As the running jobs (300) are lesser than
>half of max_connections.
Please paste the error message
There is no error message, when I try to connect the database while running vacuumdb with 300 jobs, it gets stuck.
On Thursday, February 4, 2021, Ron <ronljohnsonjr@gmail.com> wrote:
On Thursday, February 4, 2021, Ron <ronljohnsonjr@gmail.com> wrote:
On 2/4/21 5:26 AM, Atul Kumar wrote:Hi, I have 160 GB of RAM, postgres 9.6 is running on the server. after upgrade I ran the below command: "/usr/edb/as9.6/bin/vacuumdb" -U 'enterprisedb' --verbose --all -j 300 --analyze-only after running that command I was not able to connect the database using psql for few minutes.
What's the exact error message?After 20-30 minutes i was able to connect to the db and at that time I checked the pg_stst_activity, the active connections was reduced to 27. my max_connections is set to 700. I tried to find out the reason for not being abled to connect the db (when 300 jobs were running) but still not got the answer. So
Connect to the cluster before running "vacuumdb -j300", and start looking at pg_stst_activity while vacuumdb is running.--
Angular momentum makes the world go 'round.
Your problem screams "IO saturation".
On 2/4/21 12:07 PM, Atul Kumar wrote:
There is no error message, when I try to connect the database while running vacuumdb with 300 jobs, it gets stuck.
On Thursday, February 4, 2021, Ravi Krishna <rkrishna_pg@aol.com> wrote:>The CPU and RAM are normal even on 300 jobs ( only 1-4% of consumption)
>but I don’t understand one thing here that if max_connections is set to 700 then
>why I am not able to connect the db. As the running jobs (300) are lesser than
>half of max_connections.
Please paste the error message
--
Angular momentum makes the world go 'round.
Angular momentum makes the world go 'round.
>There is no error message, when I try to connect the database while >running vacuumdb with 300 jobs, it gets stuck. But you mentioned max connection which now seems to be a red herring. Based on your description, the impression I got is that you are getting "sorry, too many clients already" error. As others have pointed out, the db is probably saturated with I/O error.
ok, How do I resolve it ? Any suggestions ? On 2/5/21, Ron <ronljohnsonjr@gmail.com> wrote: > Your problem screams "IO saturation". > > On 2/4/21 12:07 PM, Atul Kumar wrote: >> There is no error message, when I try to connect the database while >> running vacuumdb with 300 jobs, it gets stuck. >> >> On Thursday, February 4, 2021, Ravi Krishna <rkrishna_pg@aol.com >> <mailto:rkrishna_pg@aol.com>> wrote: >> >> >The CPU and RAM are normal even on 300 jobs ( only 1-4% of >> consumption) >> >but I don’t understand one thing here that if max_connections is set >> to 700 then >> >why I am not able to connect the db. As the running jobs (300) are >> lesser than >> >half of max_connections. >> >> Please paste the error message >> > > -- > Angular momentum makes the world go 'round. >
Obviously... don't use 300 threads. On 2/5/21 2:15 AM, Atul Kumar wrote: > ok, How do I resolve it ? > > Any suggestions ? > > On 2/5/21, Ron <ronljohnsonjr@gmail.com> wrote: >> Your problem screams "IO saturation". >> >> On 2/4/21 12:07 PM, Atul Kumar wrote: >>> There is no error message, when I try to connect the database while >>> running vacuumdb with 300 jobs, it gets stuck. >>> >>> On Thursday, February 4, 2021, Ravi Krishna <rkrishna_pg@aol.com >>> <mailto:rkrishna_pg@aol.com>> wrote: >>> >>> >The CPU and RAM are normal even on 300 jobs ( only 1-4% of >>> consumption) >>> >but I don’t understand one thing here that if max_connections is set >>> to 700 then >>> >why I am not able to connect the db. As the running jobs (300) are >>> lesser than >>> >half of max_connections. >>> >>> Please paste the error message >>> >> -- >> Angular momentum makes the world go 'round. >> -- Angular momentum makes the world go 'round.
On 2/5/21 9:11 AM, Ron wrote: > Obviously... don't use 300 threads. > No, no Ron. Clearly the answer is more CPUs
On 2/5/21 10:22 AM, Rob Sargent wrote: > > > On 2/5/21 9:11 AM, Ron wrote: >> Obviously... don't use 300 threads. >> > No, no Ron. Clearly the answer is more CPUs I hope you're being sarcastic. -- Angular momentum makes the world go 'round.
On 6 Feb 2021, at 3:37, Ron wrote: > On 2/5/21 10:22 AM, Rob Sargent wrote: >> >> >> On 2/5/21 9:11 AM, Ron wrote: >>> Obviously... don't use 300 threads. >>> >> No, no Ron. Clearly the answer is more CPUs > > I hope you're being sarcastic. > A reasonable conjecture… though there is the consideration that 300 CPU intensive tasks spread across a given number of CPUs is going to waste some resources with context switching., i.e., need more CPUs :) Basically if there is plenty of wait time for I/O completion then CPU task switching can get more total work done. So far so obvious. In this thread I can see where it is disappointing to have a system considered capable of 700 connections getting saturated by a “mere” 300 threads. But this is only a “problem” if connections are equated to threads. PG max connection count is about external users having access to resources needed to get a task done. Like all resource allocations this relies on estimated average usage, i.e., each connection only asks for a lot of CPU in brief bursts and then the result is transmitted with a time lag before the connection makes another CPU demand. The system designer should use estimations about usage and load to budget and configure the system, and, monitor it all against actual performance in the real world. Of course estimates are a standing request for outliers and the system will show stress under an unexpected load. So far I have not seen an analysis of where the bottle neck has occurred: CPU RAM HD and/or the data bus connecting these. Some of these hardware resources maxed out to the extent the system would not immediately pick up an additional work unit. As I see it OP started 300 CPU intensive tasks on hardware intended for 700 connections. If the connection count was designed with say 50% CPU intensive time per connection you would expect this hardware to be fully saturated with 300 CPU intensive tasks. More than that, doing the task with 300 threads would probably take longer than (say) 200 threads as the increased CPU context swapping time is just wasted effort. OP now has a choice: decrease threads or (seriously) upgrade the hardware. We in the gallery would love to see a plot of total time to completion as a function of threads invoked (50-300 increments of 50) assuming the starting conditions are the same :) Gavan Schneider —— Gavan Schneider, Sodwalls, NSW, Australia Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong. — H. L. Mencken, 1920
Hi Gavan,
Thanks for providing the details, I need more clarification on this as how should I analyze that what should be ideal no. of connections should we set to avoid IO overhead based on the available hardware resources.
On Saturday, February 6, 2021, Gavan Schneider <list.pg.gavan@pendari.org> wrote:
Thanks for providing the details, I need more clarification on this as how should I analyze that what should be ideal no. of connections should we set to avoid IO overhead based on the available hardware resources.
How to do this calculation ?
Note: even during 300 threads, my RAM utilisation is totally normal.
Regards
Atul
On Saturday, February 6, 2021, Gavan Schneider <list.pg.gavan@pendari.org> wrote:
On 6 Feb 2021, at 3:37, Ron wrote:On 2/5/21 10:22 AM, Rob Sargent wrote:A reasonable conjecture… though there is the consideration that 300 CPU intensive tasks spread across a given number of CPUs is going to waste some resources with context switching., i.e., need more CPUs :)
On 2/5/21 9:11 AM, Ron wrote:Obviously... don't use 300 threads.No, no Ron. Clearly the answer is more CPUs
I hope you're being sarcastic.
Basically if there is plenty of wait time for I/O completion then CPU task switching can get more total work done. So far so obvious. In this thread I can see where it is disappointing to have a system considered capable of 700 connections getting saturated by a “mere” 300 threads. But this is only a “problem” if connections are equated to threads. PG max connection count is about external users having access to resources needed to get a task done. Like all resource allocations this relies on estimated average usage, i.e., each connection only asks for a lot of CPU in brief bursts and then the result is transmitted with a time lag before the connection makes another CPU demand. The system designer should use estimations about usage and load to budget and configure the system, and, monitor it all against actual performance in the real world. Of course estimates are a standing request for outliers and the system will show stress under an unexpected load.
So far I have not seen an analysis of where the bottle neck has occurred: CPU RAM HD and/or the data bus connecting these. Some of these hardware resources maxed out to the extent the system would not immediately pick up an additional work unit. As I see it OP started 300 CPU intensive tasks on hardware intended for 700 connections. If the connection count was designed with say 50% CPU intensive time per connection you would expect this hardware to be fully saturated with 300 CPU intensive tasks. More than that, doing the task with 300 threads would probably take longer than (say) 200 threads as the increased CPU context swapping time is just wasted effort.
OP now has a choice: decrease threads or (seriously) upgrade the hardware. We in the gallery would love to see a plot of total time to completion as a function of threads invoked (50-300 increments of 50) assuming the starting conditions are the same :)
Gavan Schneider
——
Gavan Schneider, Sodwalls, NSW, Australia
Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong.
— H. L. Mencken, 1920
On 6 Feb 2021, at 23:06, Atul Kumar wrote: > Thanks for providing the details, I need more clarification on this as > how > should I analyze that what should be ideal no. of connections should > we set > to avoid IO overhead based on the available hardware resources. > How to do this analysis properly is outside my expertise and likely off topic for this list. But you can get an overall idea doing what I suggested: measure the performance (i.e., total time to completion) when using different numbers of threads). In an ideal world the time will get less as you invoke more threads. But there will be a point where there is a bottle neck (or some resource hits its limit) and the time will be more or less the same no matter how many threads. > How to do this calculation ? > That’s likely too hard. Measurement will tell you what you need to know. Once you know how many threads it takes to saturate you will know you need to use a lesser number. This may, or may not, get you to revisit the max 700 connections setting but once you know there is a performance limit, and you decide it’s not good enough, then you can ask the hardware people for help. > Note: even during 300 threads, my RAM utilisation is totally normal. > That’s not unexpected. The CPU activity should be using the data held in RAM not spending all those cycles allocating RAM. Gavan Schneider —— Gavan Schneider, Sodwalls, NSW, Australia Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong. — H. L. Mencken, 1920
> On Feb 5, 2021, at 9:37 AM, Ron <ronljohnsonjr@gmail.com> wrote: > >> On 2/5/21 10:22 AM, Rob Sargent wrote: >> >> >>> On 2/5/21 9:11 AM, Ron wrote: >>> Obviously... don't use 300 threads. >>> >> No, no Ron. Clearly the answer is more CPUs > > I hope you're being sarcastic. > > -- > Angular momentum makes the world go 'round. > Ron, I was being sarcastic, but I’ve gone over the thread and I do not see any mention of the number of cores on the server. If it has a single quad core chip then the queue for each processor could be on average 75 deep at the outset, allof them anxiously awaiting data. -j connections are processes, correct? Not threads. Atul, if you list the tools you know and have used for watching system performance and then others can perhaps suggest alternatives(or re-interpret results)
On 2/6/21 6:06 AM, Atul Kumar wrote:
Hi Gavan,
Thanks for providing the details, I need more clarification on this as how should I analyze that what should be ideal no. of connections should we set to avoid IO overhead based on the available hardware resources.How to do this calculation ?
Run "iotop -o -u postgres", and then compare that with the total bandwidth available to the system. If it's (even almost) saturated, then everything else will be starved.
The "--jobs=" value should AT MOST be some *small* multiple of the number of CPUs (like 1x, 1.5x or maybe 2x if the core count is low, and nothing else is running on the system.
Note: even during 300 threads, my RAM utilisation is totally normal.RegardsAtul
On Saturday, February 6, 2021, Gavan Schneider <list.pg.gavan@pendari.org> wrote:On 6 Feb 2021, at 3:37, Ron wrote:On 2/5/21 10:22 AM, Rob Sargent wrote:A reasonable conjecture… though there is the consideration that 300 CPU intensive tasks spread across a given number of CPUs is going to waste some resources with context switching., i.e., need more CPUs :)
On 2/5/21 9:11 AM, Ron wrote:Obviously... don't use 300 threads.No, no Ron. Clearly the answer is more CPUs
I hope you're being sarcastic.
Basically if there is plenty of wait time for I/O completion then CPU task switching can get more total work done. So far so obvious. In this thread I can see where it is disappointing to have a system considered capable of 700 connections getting saturated by a “mere” 300 threads. But this is only a “problem” if connections are equated to threads. PG max connection count is about external users having access to resources needed to get a task done. Like all resource allocations this relies on estimated average usage, i.e., each connection only asks for a lot of CPU in brief bursts and then the result is transmitted with a time lag before the connection makes another CPU demand. The system designer should use estimations about usage and load to budget and configure the system, and, monitor it all against actual performance in the real world. Of course estimates are a standing request for outliers and the system will show stress under an unexpected load.
So far I have not seen an analysis of where the bottle neck has occurred: CPU RAM HD and/or the data bus connecting these. Some of these hardware resources maxed out to the extent the system would not immediately pick up an additional work unit. As I see it OP started 300 CPU intensive tasks on hardware intended for 700 connections. If the connection count was designed with say 50% CPU intensive time per connection you would expect this hardware to be fully saturated with 300 CPU intensive tasks. More than that, doing the task with 300 threads would probably take longer than (say) 200 threads as the increased CPU context swapping time is just wasted effort.
OP now has a choice: decrease threads or (seriously) upgrade the hardware. We in the gallery would love to see a plot of total time to completion as a function of threads invoked (50-300 increments of 50) assuming the starting conditions are the same :)
Gavan Schneider
——
Gavan Schneider, Sodwalls, NSW, Australia
Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong.
— H. L. Mencken, 1920
--
Angular momentum makes the world go 'round.
Angular momentum makes the world go 'round.
On 2021-02-06 12:54:11 -0600, Ron wrote: > On 2/6/21 6:06 AM, Atul Kumar wrote: > > Hi Gavan, > > Thanks for providing the details, I need more clarification on this as how > should I analyze that what should be ideal no. of connections should we set > to avoid IO overhead based on the available hardware resources. > How to do this calculation ? > > > > Run "iotop -o -u postgres", and then compare that with the total bandwidth > available to the system. If it's (even almost) saturated, then everything else > will be starved. > > The "--jobs=" value should AT MOST be some *small* multiple of the number of > CPUs (like 1x, 1.5x or maybe 2x if the core count is low, and nothing else is > running on the system. While this is generally good advice, Atul reported earlier in the thread that CPU consumption is only 1 to 4 %. So the problem is clearly not related to CPU usage, but very probably (as some have already noted) on I/O. You could use a similar rule of thumb for disks ("a small multiple of the number of disks") but with SSDs this might not be very accurate. I think the only method to get to the optimal number of vacuums that can be run in parallel is to determine it experimentally: 300 is clearly too high. So try 150, 75, 37, etc. For each number try to do some other work - is the performance acceptable? If yes, note the time until vacuum is finished. FInally among those where the performance was acceptable choose the value which was fastest. (Note: If you do this on the same database, subsequent runs will benefit from work already done, so the take the results with a grain of salt). hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"