Обсуждение: Scalability in postgres

От:
Fabrix
Дата:


HI.

Someone had some experience of bad performance with postgres in some server with many processors?

I have a server with 4 CPUS dual core  and gives me a very good performance but I have experienced problems with another server that has 8 CPUS quad core (32 cores). The second one only gives me about 1.5 of performance of the first one.

Monitoring (nmon, htop, vmstat) see that everything is fine (memory, HD, eth, etc) except that processors regularly climb to 100%.

I can see that the processes are waiting for CPU time:

vmstat 1

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 47123020 117320 17141932    0    0     8   416 1548 2189  1  1 99  0  0
 0  0      0 47121904 117328 17141940    0    0     8   148 1428 2107  1  1 98  0  0
 0  0      0 47123144 117336 17141956    0    0     8   172 1391 1930  1  0 99  0  0
 0  0      0 47124756 117352 17141940    0    0     8   276 1327 2171  1  1 98  0  0
 0  0      0 47118556 117360 17141956    0    0     0   100 1452 2254  1  1 98  0  0
 2  0      0 47120364 117380 17141952    0    0     8   428 1526 2477  1  0 99  0  0
 1  0      0 47119372 117388 17141972    0    0     0   452 1581 2662  1  1 98  0  0
 0  0      0 47117948 117396 17141988    0    0    16   468 1705 3243  1  1 97  0  0
 0  0      0 47116708 117404 17142020    0    0     0   268 1610 2115  1  1 99  0  0
 0  0      0 47119688 117420 17142044    0    0     0   200 1545 1810  1  1 98  0  0
318  0      0 47116464 117440 17142052    0    0     0   532 1416 2396  1  0 99  0  0
500  0      0 47115224 117440 17142052    0    0     0     0 1118 322144 91  5  4  0  0
440  0      0 47114728 117440 17142044    0    0     0     0 1052 333137 90  5  5  0  0
339  0      0 47114484 117440 17142048    0    0     0     0 1061 337528 85  4 11  0  0
179  0      0 47114112 117440 17142048    0    0     0     0 1066 312873 71  4 25  0  0
 5  1      0 47122180 117468 17142028    0    0   192  3128 1958 136804 23  2 75  1  0
 3  0      0 47114264 117476 17142968    0    0   608  5828 2688 4684  7  2 89  2  0
 0  1      0 47109940 117484 17142876    0    0   512  5084 2248 3727  3  1 94  2  0
 0  1      0 47119692 117500 17143816    0    0   520  4976 2231 2941  2  1 95  2  0


Have postgres problems of lock or degradation of performance with many CPU's?
Any comments?

RH 5.2. PG 8.3.6 and 64 bits.

thanks for the reply
regars ...
От:
David Rees
Дата:

On Thu, May 28, 2009 at 11:50 AM, Fabrix <> wrote:
> Monitoring (nmon, htop, vmstat) see that everything is fine (memory, HD,
> eth, etc) except that processors regularly climb to 100%.

What kind of load are you putting the server under when this happens?

> I can see that the processes are waiting for CPU time:
>
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  0  0      0 47119688 117420 17142044    0    0     0   200 1545 1810  1  1 98  0  0
> 318  0      0 47116464 117440 17142052    0    0     0   532 1416 2396  1  0 99  0  0
> 500  0      0 47115224 117440 17142052    0    0     0     0 1118 322144 91  5  4  0  0
> 440  0      0 47114728 117440 17142044    0    0     0     0 1052 333137 90  5  5  0  0
> 339  0      0 47114484 117440 17142048    0    0     0     0 1061 337528 85  4 11  0  0
> 179  0      0 47114112 117440 17142048    0    0     0     0 1066 312873 71  4 25  0  0
>  5  1      0 47122180 117468 17142028    0    0   192  3128 1958 136804 23  2 75  1  0
>  3  0      0 47114264 117476 17142968    0    0   608  5828 2688 4684  7  2 89  2  0

Wow, that's some serious context-switching right there - 300k context
switches a second mean that the processors are spending a lot of their
time fighting for CPU time instead of doing any real work.

It appears that you have the server configured with a very high number
of connections as well?  My first suggestion would be to look at a way
to limit the number of active connections to the server at a time
(pgPool or similar).

-Dave

От:
Scott Marlowe
Дата:

On Thu, May 28, 2009 at 12:50 PM, Fabrix <> wrote:
>
> HI.
>
> Someone had some experience of bad performance with postgres in some server
> with many processors?

Seems to depend on the processors and chipset a fair bit.

> I have a server with 4 CPUS dual core  and gives me a very good performance
> but I have experienced problems with another server that has 8 CPUS quad
> core (32 cores). The second one only gives me about 1.5 of performance of
> the first one.

What model CPUs and chipset on the mobo I wonder?

> Monitoring (nmon, htop, vmstat) see that everything is fine (memory, HD,
> eth, etc) except that processors regularly climb to 100%.
>
> I can see that the processes are waiting for CPU time:
>
> vmstat 1
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> wa st
>  0  0      0 47123020 117320 17141932    0    0     8   416 1548 2189  1  1
> 99  0  0
>  0  0      0 47121904 117328 17141940    0    0     8   148 1428 2107  1  1
> 98  0  0
>  0  0      0 47123144 117336 17141956    0    0     8   172 1391 1930  1  0
> 99  0  0
>  0  0      0 47124756 117352 17141940    0    0     8   276 1327 2171  1  1
> 98  0  0
>  0  0      0 47118556 117360 17141956    0    0     0   100 1452 2254  1  1
> 98  0  0
>  2  0      0 47120364 117380 17141952    0    0     8   428 1526 2477  1  0
> 99  0  0
>  1  0      0 47119372 117388 17141972    0    0     0   452 1581 2662  1  1
> 98  0  0
>  0  0      0 47117948 117396 17141988    0    0    16   468 1705 3243  1  1
> 97  0  0
>  0  0      0 47116708 117404 17142020    0    0     0   268 1610 2115  1  1
> 99  0  0
>  0  0      0 47119688 117420 17142044    0    0     0   200 1545 1810  1  1
> 98  0  0
> 318  0      0 47116464 117440 17142052    0    0     0   532 1416 2396  1  0
> 99  0  0
> 500  0      0 47115224 117440 17142052    0    0     0     0 1118 322144 91
> 5  4  0  0
> 440  0      0 47114728 117440 17142044    0    0     0     0 1052 333137 90
> 5  5  0  0
> 339  0      0 47114484 117440 17142048    0    0     0     0 1061 337528 85
> 4 11  0  0
> 179  0      0 47114112 117440 17142048    0    0     0     0 1066 312873 71
> 4 25  0  0
>  5  1      0 47122180 117468 17142028    0    0   192  3128 1958 136804 23
> 2 75  1  0
>  3  0      0 47114264 117476 17142968    0    0   608  5828 2688 4684  7  2
> 89  2  0
>  0  1      0 47109940 117484 17142876    0    0   512  5084 2248 3727  3  1
> 94  2  0
>  0  1      0 47119692 117500 17143816    0    0   520  4976 2231 2941  2  1
> 95  2  0
>
>
> Have postgres problems of lock or degradation of performance with many
> CPU's?
> Any comments?

Looks like a context switch storm, which was pretty common on older
Xeon CPUs.  I imagine with enough pg processes running on enough CPUs
it could still be a problem.

От:
Fabrix
Дата:

Thanks David...



2009/5/28 David Rees <>
On Thu, May 28, 2009 at 11:50 AM, Fabrix <> wrote:
> Monitoring (nmon, htop, vmstat) see that everything is fine (memory, HD,
> eth, etc) except that processors regularly climb to 100%.

What kind of load are you putting the server under when this happens?

I have many windows clients connecting to the database for odbc, they do select, insert and update data. All these operations are all answer very quickly in less than 1 second are well optimized, but when processors go up to 100% All queries go up from 10  to 18 seconds, and are the same type of operations when this happends.


> I can see that the processes are waiting for CPU time:
>
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  0  0      0 47119688 117420 17142044    0    0     0   200 1545 1810  1  1 98  0  0
> 318  0      0 47116464 117440 17142052    0    0     0   532 1416 2396  1  0 99  0  0
> 500  0      0 47115224 117440 17142052    0    0     0     0 1118 322144 91  5  4  0  0
> 440  0      0 47114728 117440 17142044    0    0     0     0 1052 333137 90  5  5  0  0
> 339  0      0 47114484 117440 17142048    0    0     0     0 1061 337528 85  4 11  0  0
> 179  0      0 47114112 117440 17142048    0    0     0     0 1066 312873 71  4 25  0  0
>  5  1      0 47122180 117468 17142028    0    0   192  3128 1958 136804 23  2 75  1  0
>  3  0      0 47114264 117476 17142968    0    0   608  5828 2688 4684  7  2 89  2  0

Wow, that's some serious context-switching right there - 300k context
switches a second mean that the processors are spending a lot of their
time fighting for CPU time instead of doing any real work.

It appears that you have the server configured with a very high number
of connections as well?  My first suggestion would be to look at a way
to limit the number of active connections to the server at a time
(pgPool or similar).

yes, i have max_connections = 5000
can lower, but at least i need 3500 connections
 


-Dave

От:
Scott Mead
Дата:

On Thu, May 28, 2009 at 4:53 PM, Fabrix <> wrote:



Wow, that's some serious context-switching right there - 300k context
switches a second mean that the processors are spending a lot of their
time fighting for CPU time instead of doing any real work.

  There is a bug in the quad core chips during a massive amount of connections that will cause all cores to go to 100% utilization and no work be done.  I'm digging to find links, but if I remember correctly, the only way to fix it was to disable the 4th core in linux (involved some black magic in /proc).  You really need to lower the number of processes you're forcing each processor bus to switch through (or switch to AMD's hyper-transport bus).
 


It appears that you have the server configured with a very high number
of connections as well?  My first suggestion would be to look at a way
to limit the number of active connections to the server at a time
(pgPool or similar).

yes, i have max_connections = 5000
can lower, but at least i need 3500 connections

Typically, it's a bad idea to run run with anything over 1000 connections (many will suggest lower than that).  If you need that many connections, you'll want to look at a connection pool like pgBouncer or pgPool.  

--Scott

 
От:
Scott Marlowe
Дата:

On Thu, May 28, 2009 at 2:53 PM, Fabrix <> wrote:
> yes, i have max_connections = 5000
> can lower, but at least i need 3500 connections


Whoa, that's a lot.  Can you look into connection pooling of some sort?

От:
Fabrix
Дата:

Thanks Scott

2009/5/28 Scott Marlowe <>
On Thu, May 28, 2009 at 12:50 PM, Fabrix <> wrote:
>
> HI.
>
> Someone had some experience of bad performance with postgres in some server
> with many processors?

Seems to depend on the processors and chipset a fair bit.

> I have a server with 4 CPUS dual core  and gives me a very good performance
> but I have experienced problems with another server that has 8 CPUS quad
> core (32 cores). The second one only gives me about 1.5 of performance of
> the first one.

What model CPUs and chipset on the mobo I wonder?

I have 2 Servers of this type in which I tested, an HP and IBM. HP gives me better performance, IBM already discarded

Server HP:
cat /proc/cpuinfo
processor    : 0
vendor_id    : AuthenticAMD
cpu family    : 16
model        : 2
model name    : Quad-Core AMD Opteron(tm) Processor 8360 SE
stepping    : 3
cpu MHz        : 2500.091
cache size    : 512 KB
physical id    : 0
siblings    : 4
core id        : 0
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips    : 5004.94
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]


Server IBM:
 cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Xeon(R) CPU           X7350  @ 2.93GHz
stepping    : 11
cpu MHz        : 2931.951
cache size    : 4096 KB
physical id    : 3
siblings    : 4
core id        : 0
cpu cores    : 4
apicid        : 12
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips    : 5867.00
clflush size    : 64
cache_alignment    : 64
address sizes    : 40 bits physical, 48 bits virtual
power management:



> Monitoring (nmon, htop, vmstat) see that everything is fine (memory, HD,
> eth, etc) except that processors regularly climb to 100%.
>
> I can see that the processes are waiting for CPU time:
>
> vmstat 1
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> wa st
>  0  0      0 47123020 117320 17141932    0    0     8   416 1548 2189  1  1
> 99  0  0
>  0  0      0 47121904 117328 17141940    0    0     8   148 1428 2107  1  1
> 98  0  0
>  0  0      0 47123144 117336 17141956    0    0     8   172 1391 1930  1  0
> 99  0  0
>  0  0      0 47124756 117352 17141940    0    0     8   276 1327 2171  1  1
> 98  0  0
>  0  0      0 47118556 117360 17141956    0    0     0   100 1452 2254  1  1
> 98  0  0
>  2  0      0 47120364 117380 17141952    0    0     8   428 1526 2477  1  0
> 99  0  0
>  1  0      0 47119372 117388 17141972    0    0     0   452 1581 2662  1  1
> 98  0  0
>  0  0      0 47117948 117396 17141988    0    0    16   468 1705 3243  1  1
> 97  0  0
>  0  0      0 47116708 117404 17142020    0    0     0   268 1610 2115  1  1
> 99  0  0
>  0  0      0 47119688 117420 17142044    0    0     0   200 1545 1810  1  1
> 98  0  0
> 318  0      0 47116464 117440 17142052    0    0     0   532 1416 2396  1  0
> 99  0  0
> 500  0      0 47115224 117440 17142052    0    0     0     0 1118 322144 91
> 5  4  0  0
> 440  0      0 47114728 117440 17142044    0    0     0     0 1052 333137 90
> 5  5  0  0
> 339  0      0 47114484 117440 17142048    0    0     0     0 1061 337528 85
> 4 11  0  0
> 179  0      0 47114112 117440 17142048    0    0     0     0 1066 312873 71
> 4 25  0  0
>  5  1      0 47122180 117468 17142028    0    0   192  3128 1958 136804 23
> 2 75  1  0
>  3  0      0 47114264 117476 17142968    0    0   608  5828 2688 4684  7  2
> 89  2  0
>  0  1      0 47109940 117484 17142876    0    0   512  5084 2248 3727  3  1
> 94  2  0
>  0  1      0 47119692 117500 17143816    0    0   520  4976 2231 2941  2  1
> 95  2  0
>
>
> Have postgres problems of lock or degradation of performance with many
> CPU's?
> Any comments?

Looks like a context switch storm, which was pretty common on older
Xeon CPUs.  I imagine with enough pg processes running on enough CPUs
it could still be a problem.

These CPUs are very new.

От:
Fabrix
Дата:



2009/5/28 Scott Mead <>
On Thu, May 28, 2009 at 4:53 PM, Fabrix <> wrote:



Wow, that's some serious context-switching right there - 300k context
switches a second mean that the processors are spending a lot of their
time fighting for CPU time instead of doing any real work.

  There is a bug in the quad core chips during a massive amount of connections that will cause all cores to go to 100% utilization and no work be done.  I'm digging to find links, but if I remember correctly, the only way to fix it was to disable the 4th core in linux (involved some black magic in /proc).  You really need to lower the number of processes you're forcing each processor bus to switch through (or switch to AMD's hyper-transport bus).

The server HP is already AMD  proccesor.
The server with 4 dual core had max_connections = 5000 too, but the maximum of connections at time were 1800 and work very well.

If you get the link on the bug's quad core I would greatly appreciate
 


It appears that you have the server configured with a very high number
of connections as well?  My first suggestion would be to look at a way
to limit the number of active connections to the server at a time
(pgPool or similar).

yes, i have max_connections = 5000
can lower, but at least i need 3500 connections

Typically, it's a bad idea to run run with anything over 1000 connections (many will suggest lower than that).  If you need that many connections, you'll want to look at a connection pool like pgBouncer or pgPool.  


Postgres does not support more than 1000? even the server is very robust?
I will try to lower... and already i have a pool (not pgpool and not pgBouncer). I have distributed all connections in three servers :).

--Scott

 

От:
Flavio Henrique Araque Gurgel
Дата:

----- "Scott Marlowe" <> escreveu:
> On Thu, May 28, 2009 at 12:50 PM, Fabrix <> wrote:
> >
> > HI.
> >
> > Someone had some experience of bad performance with postgres in some server
> > with many processors?

I had.

> > but I have experienced problems with another server that has 8 CPUS quad
> > core (32 cores). The second one only gives me about 1.5 of performance of
> > the first one.

I have had problems with 4 CPUS dual core Hyper Threading (16 logical CPUS).

> What model CPUs and chipset on the mobo I wonder?
>
> > Monitoring (nmon, htop, vmstat) see that everything is fine (memory, HD,
> > eth, etc) except that processors regularly climb to 100%.
> >
> > I can see that the processes are waiting for CPU time:

> > Have postgres problems of lock or degradation of performance with many
> > CPU's?
> > Any comments?
>
> Looks like a context switch storm, which was pretty common on older
> Xeon CPUs.  I imagine with enough pg processes running on enough CPUs
> it could still be a problem.

I would ask for your kernel version. uname -a please?

It was possible to make the context work better with 2.4.24 with kswapd patched around here. 1600 connections working fine at this moment.
Try to lower your memory requirements too. Linux kernel needs some space to page and scale up. Install some more memory otherwise.

Flavio

От:
Fabrix
Дата:



2009/5/28 Flavio Henrique Araque Gurgel <>
----- "Scott Marlowe" <> escreveu:
> On Thu, May 28, 2009 at 12:50 PM, Fabrix <> wrote:
> >
> > HI.
> >
> > Someone had some experience of bad performance with postgres in some server
> > with many processors?

I had.


> > but I have experienced problems with another server that has 8 CPUS quad
> > core (32 cores). The second one only gives me about 1.5 of performance of
> > the first one.

I have had problems with 4 CPUS dual core Hyper Threading (16 logical CPUS).


> What model CPUs and chipset on the mobo I wonder?
>
> > Monitoring (nmon, htop, vmstat) see that everything is fine (memory, HD,
> > eth, etc) except that processors regularly climb to 100%.
> >
> > I can see that the processes are waiting for CPU time:

> > Have postgres problems of lock or degradation of performance with many
> > CPU's?
> > Any comments?
>
> Looks like a context switch storm, which was pretty common on older
> Xeon CPUs.  I imagine with enough pg processes running on enough CPUs
> it could still be a problem.

I would ask for your kernel version. uname -a please?

sure, and thanks for you answer Flavio...

uname -a
Linux SERVIDOR-A 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
 
cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.2 (Tikanga)



It was possible to make the context work better with 2.4.24 with kswapd patched around here. 1600 connections working fine at this moment.

2.4 is very old, or not?

Try to lower your memory requirements too. Linux kernel needs some space to page and scale up. Install some more memory otherwise.

how much?
already I have a lot of  memory installed in the server 128GB.


Flavio


От:
Scott Marlowe
Дата:

On Thu, May 28, 2009 at 7:04 PM, Fabrix <> wrote:
>> I would ask for your kernel version. uname -a please?
>
> sure, and thanks for you answer Flavio...
>
> uname -a
> Linux SERVIDOR-A 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64
> x86_64 x86_64 GNU/Linux
>
> cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 5.2 (Tikanga)

I'm running the same thing for an 8 core opteron (dual x4) and I have
gotten better numbers from later distros (ubuntu 8.04 and 8.10) but I
just don't trust them yet as server OSes for a database.

>> It was possible to make the context work better with 2.4.24 with kswapd
>> patched around here. 1600 connections working fine at this moment.
>
> 2.4 is very old, or not?

I'm sure he meant 2.6.24

There's been a LOT of work on the stuff that was causing context
swtich storms from the 2.6.18 to the latest 2.6 releases.

От:
Flavio Henrique Araque Gurgel
Дата:

I would ask for your kernel version. uname -a please?

> sure, and thanks for you answer Flavio...
>

> uname -a
> Linux SERVIDOR-A 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
>  
> cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 5.2 (Tikanga)
>

I had the same problem you're saying with Debian Etch 2.6.18 when the system needed more then 1000 connections.

>
>
>
> It was possible to make the context work better with 2.4.24 with kswapd patched around here. 1600 connections working fine at this moment.

> 2.4 is very old, or not?

My mistake. It is 2.6.24
We had to apply the kswapd patch also. It's important specially if you see your system % going as high as 99% in top and loosing the machine's control. I have read something about 2.6.28 had this patch accepted in mainstream.

>
> Try to lower your memory requirements too. Linux kernel needs some space to page and scale up. Install some more memory otherwise.
>

> how much?
> already I have a lot of  memory installed in the server 128GB.

Here we have 16GB. I had to limit PostgreSQL memory requirements (shared_buffers + max_connections * work_mem) to about 40% RAM. effective_cache_size was not an issue and about 30% of RAM is working fine. Of course the cache is a matter of your context.
Since we have fast queries with low memory requirements for sorting or nested loops, 1.5MB for work_mem was enough around here. 2GB of shared buffers worked like a charm but it's too low for the indexes I work with and I'm planning to increase it when I have more RAM.

Flavio
>
От:
Greg Smith
Дата:

On Thu, 28 May 2009, Flavio Henrique Araque Gurgel wrote:

> It is 2.6.24 We had to apply the kswapd patch also. It's important
> specially if you see your system % going as high as 99% in top and
> loosing the machine's control. I have read something about 2.6.28 had
> this patch accepted in mainstream.

It would help if you gave more specific information about what you're
talking about.  I know there was a bunch of back and forth on the "kswapd
should only wait on IO if there is IO" patch, where it was commited and
then reverted etc, but it's not clear to me if that's what you're talking
about--and if so, what that has to do with the context switch problem.

Back to Fabrix's problem.  You're fighting a couple of losing battles
here.  Let's go over the initial list:

1) You have 32 cores.  You think they should be allowed to schedule
>3500 active connections across them.  That doesn't work, and what happens
is exactly the sort of context switch storm you're showing data for.
Think about it for a minute:  how many of those can really be doing work
at any time?  32, that's how many.  Now, you need some multiple of the
number of cores to try to make sure everybody is always busy, but that
multiple should be closer to 10X the number of cores rather than 100X.
You need to adjust the connection pool ratio so that the PostgreSQL
max_connections is closer to 500 than 5000, and this is by far the most
critical thing for you to do.  The PostgreSQL connection handler is known
to be bad at handling high connection loads compared to the popular
pooling projects, so you really shouldn't throw this problem at it.
While kernel problems stack on top of that, you really shouldn't start at
kernel fixes; nail the really fundamental and obvious problem first.

2) You have very new hardware and a very old kernel.  Once you've done the
above, if you're still not happy with performance, at that point you
should consider using a newer one.  It's fairly simple to build a Linux
kernel using the same basic kernel parameters as the stock RedHat one.
2.6.28 is six months old now, is up to 2.6.28.10, and has gotten a lot
more testing than most kernels due to it being the Ubuntu 9.04 default.
I'd suggest you try out that version.

3) A system with 128GB of RAM is in a funny place where by using the
defaults or the usual rules of thumb for a lot of parameters ("set
shared_buffers to 1/4 of RAM") are all bad ideas.  shared_buffers seems to
top out its usefulness around 10GB on current generation
hardware/software, and some Linux memory tunables have defaults on 2.6.18
that are insane for your system; vm_dirty_ratio at 40 comes to mind as the
one I run into most.  Some of that gets fixed just by moving to a newer
kernel, some doesn't.  Again, these aren't the problems you're having now
though; they're the ones you'll have in the future *if* you fix the more
fundamental problems first.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
Grzegorz Jaśkiewicz
Дата:

On Fri, May 29, 2009 at 2:54 AM, Greg Smith <> wrote:

>  The PostgreSQL connection handler is known to be bad at handling high
> connection loads compared to the popular pooling projects, so you really
> shouldn't throw this problem at it. While kernel problems stack on top of
> that, you really shouldn't start at kernel fixes; nail the really
> fundamental and obvious problem first.

if it is implemented somewhere else better, shouldn't that make it
obvious that postgresql should solve it internally ? It is really
annoying to hear all the time that you should add additional path of
execution to already complex stack, and rely on more code to handle
something (poolers).


--
GJ

От:
Scott Marlowe
Дата:

2009/5/29 Grzegorz Jaśkiewicz <>:
> On Fri, May 29, 2009 at 2:54 AM, Greg Smith <> wrote:
>
>>  The PostgreSQL connection handler is known to be bad at handling high
>> connection loads compared to the popular pooling projects, so you really
>> shouldn't throw this problem at it. While kernel problems stack on top of
>> that, you really shouldn't start at kernel fixes; nail the really
>> fundamental and obvious problem first.
>
> if it is implemented somewhere else better, shouldn't that make it
> obvious that postgresql should solve it internally ? It is really
> annoying to hear all the time that you should add additional path of
> execution to already complex stack, and rely on more code to handle
> something (poolers).

Well Oracle I know suffers from the same issue under any real load.
On a large transactional system where I last worked, we had to keep
the live connections to the db down to under 100 to keep it running
well.  The complexity has to go somewhere, and poolers are usually a
better choice.

От:
Scott Marlowe
Дата:

2009/5/29 Grzegorz Jaśkiewicz <>:
> On Fri, May 29, 2009 at 2:54 AM, Greg Smith <> wrote:
>
>>  The PostgreSQL connection handler is known to be bad at handling high
>> connection loads compared to the popular pooling projects, so you really
>> shouldn't throw this problem at it. While kernel problems stack on top of
>> that, you really shouldn't start at kernel fixes; nail the really
>> fundamental and obvious problem first.
>
> if it is implemented somewhere else better, shouldn't that make it
> obvious that postgresql should solve it internally ? It is really
> annoying to hear all the time that you should add additional path of
> execution to already complex stack, and rely on more code to handle
> something (poolers).

OTOH, you're always free to submit a patch.

От:
Grzegorz Jaśkiewicz
Дата:

2009/5/29 Scott Marlowe <>:

>> if it is implemented somewhere else better, shouldn't that make it
>> obvious that postgresql should solve it internally ? It is really
>> annoying to hear all the time that you should add additional path of
>> execution to already complex stack, and rely on more code to handle
>> something (poolers).
>
> OTOH, you're always free to submit a patch.
:P

I thought that's where the difference is between postgresql and oracle
mostly, ability to handle more transactions and better scalability .




--
GJ

От:
Scott Marlowe
Дата:

2009/5/29 Grzegorz Jaśkiewicz <>:
> 2009/5/29 Scott Marlowe <>:
>
>>> if it is implemented somewhere else better, shouldn't that make it
>>> obvious that postgresql should solve it internally ? It is really
>>> annoying to hear all the time that you should add additional path of
>>> execution to already complex stack, and rely on more code to handle
>>> something (poolers).
>>
>> OTOH, you're always free to submit a patch.
> :P
>
> I thought that's where the difference is between postgresql and oracle
> mostly, ability to handle more transactions and better scalability .

Both Oracle and PostgreSQL have fairly heavy backend processes, and
running hundreds of them on either database is a mistake.    Sure,
Oracle can handle more transactions and scales a bit better, but no
one wants to have to buy a 128 way E15K to handle the load rather than
implementing connection pooling.  Show me an Oracle server with 5000
live, active connections and I'll show you a VERY large and expensive
cluster of machines.

От:
Grzegorz Jaśkiewicz
Дата:

2009/5/29 Scott Marlowe <>:

>
> Both Oracle and PostgreSQL have fairly heavy backend processes, and
> running hundreds of them on either database is a mistake.    Sure,
> Oracle can handle more transactions and scales a bit better, but no
> one wants to have to buy a 128 way E15K to handle the load rather than
> implementing connection pooling.  Show me an Oracle server with 5000
> live, active connections and I'll show you a VERY large and expensive
> cluster of machines.

yes, because for that, oracle has nicer set of features that allows
you to create cluster on cheaper machines, instead of buying one ;)

But other thing, worth noticing from my own experience is that you
have to pay for Oracle so much, just to be able to enjoy it for a bit,
people tend to buy better servers.
It feels more pro if you have to pay for it. That's the observation
from UK, at least.



--
GJ

От:
Scott Marlowe
Дата:

2009/5/29 Grzegorz Jaśkiewicz <>:
> 2009/5/29 Scott Marlowe <>:
>
>>
>> Both Oracle and PostgreSQL have fairly heavy backend processes, and
>> running hundreds of them on either database is a mistake.    Sure,
>> Oracle can handle more transactions and scales a bit better, but no
>> one wants to have to buy a 128 way E15K to handle the load rather than
>> implementing connection pooling.  Show me an Oracle server with 5000
>> live, active connections and I'll show you a VERY large and expensive
>> cluster of machines.
>
> yes, because for that, oracle has nicer set of features that allows
> you to create cluster on cheaper machines, instead of buying one ;)

OTOH, I can buy a rather large Solaris box for the price of the
licenses on a RAC cluster.  Then I can take what's leftover and go on
vacation, then hire Tom Lane for a year to hack pgsql to run even
faster on my big sun server.  And then buy everyone at my office an
espresso machine.  Meanwhile, a team of several Oracle DBAs will still
be trying to get RAC up and running reliably.  But yes, they'll be
using a few $20k servers to do it.

> But other thing, worth noticing from my own experience is that you
> have to pay for Oracle so much, just to be able to enjoy it for a bit,
> people tend to buy better servers.

In my experience that's not really true.  Since Oracle charges per CPU
it's not uncommon to see people running it on a machine with the
absolute minimum # of CPUs to do the job.

> It feels more pro if you have to pay for it. That's the observation
> from UK, at least.

I don't think you can rightly speak for the whole of the UK on the
subject, anymore than I can speak for the whole of the US. :)

I think that professional is as professional does.  The real costs for
Oracle are the salaries you have go pay to the team of DBAs to make it
work and stay up.  I've never dealt with a database that needs as much
constant hand holding as Oracle seems to.  And, Oracle DBAs tend to do
one thing, and that one thing really well, since they're used to being
on a team.  So, you've got one guy to do production patching and
support, another guy to do query tuning, another guy to plan
deployments, and another guy to write plpgsql.  Then one or two other
folks for your app / db interfacing.

Much like a Formula One car, Oracle is an impressive bit of
technology.  But it's damned expensive to buy and more expensive to
operate.  Oracle's not some magic pixie dust that fixes all your DB
problems, it's got its own set of issues that it brings to the table.

От:
Grzegorz Jaśkiewicz
Дата:

damn I agree with you Scott. I wish I had enough cash here to employ
Tom and other pg magicians to improve performance for all of us ;)

Thing is tho, postgresql is mostly used by companies, that either
don't have that sort of cash, but still like to get the performance,
or companies that have 'why pay if it is free' policy.

now, about UK, combined with Ireland still is bit smaller than US ;)
I don't know how about US, but in UK most companies still believe that
MSCE personnel, few windows servers, with mssql are the best thing you
can get... so.

Oracle is really used by very very small minority.

От:
Scott Marlowe
Дата:

2009/5/29 Grzegorz Jaśkiewicz <>:
> damn I agree with you Scott. I wish I had enough cash here to employ
> Tom and other pg magicians to improve performance for all of us ;)
>
> Thing is tho, postgresql is mostly used by companies, that either
> don't have that sort of cash, but still like to get the performance,
> or companies that have 'why pay if it is free' policy.

It really is two very different kinds of companies.  I have a friend
who used to work somewhere that the boss was a total cheapskate, and
for that reason, and that reason alone, had chosen PostgreSQL.
Because of his overly cheap ways, the company just sucked the life out
of its employees.

OTOH, I now work for a company that uses PostgreSQL because it's the
best fit solution that allows great performance on reasonable hardware
for little money.  If Oracle provided a serious competitive advantage
we'd switch.  But it really doesn't for us, and for 90% of the db folk
out there it doesn't either.

> now, about UK, combined with Ireland still is bit smaller than US ;)
> I don't know how about US, but in UK most companies still believe that
> MSCE personnel, few windows servers, with mssql are the best thing you
> can get... so.

There's still plenty of that mentality here in the old USA as well.
Last company I was at was taken over by managers with a MS and only MS
mentality.  They spent 4 years replacing a pair of Linux servers that
provided web services and LDAP with about 40 MS machines.  CIO swore
there would be no more Linux in the company.  Since then they've
bought up a couple dozen smaller companies, most of which were running
on Linux, and they've had to hire back a lot of Linux talent to keep
it running.

> Oracle is really used by very very small minority.

And sadly, an awful lot of those installations are Oracle "by default"
not because it's the best choice.

От:
Greg Smith
Дата:

On Fri, 29 May 2009, Grzegorz Ja?kiewicz wrote:

> if it is implemented somewhere else better, shouldn't that make it
> obvious that postgresql should solve it internally ?

Opening a database connection has some overhead to it that can't go away
without losing *something* in the process that you want the database to
handle.  That something usually impacts either security or crash-safety.
This is why every serious database product in the world suggests using
connection pooling; examples:

http://blogs.oracle.com/opal/2006/10/oracle_announces_new_connectio.html
http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/conn/c0006170.htm
http://msdn.microsoft.com/en-us/library/8xx3tyca.aspx
http://dev.mysql.com/tech-resources/articles/connection_pooling_with_connectorj.html

The only difference here is that some of the commercial products bundle
the connection pooler into the main program.  In most cases, you're still
stuck with configuring a second piece of software, the only difference is
that said software might already be installed for you by the big DB
installer. Since this project isn't in the business of bundling every
piece of additional software that might be useful with the database, it's
never going to make it into the core code when it works quite happily
outside of it.  The best you could hope for is that people who bundle
large chunks of other stuff along with their PostgreSQL installer, like
Enterprise DB does, might include one of the popular poolers one day.

And that's how we got to here.  There are plenty of PostgreSQL problems
one might run into that there are no usable solutions to, but that other
database vendors have already solved nicely.  From a pragmatic standpoint,
I'd rather see people work on those, rather than try and forge new ground
on a problem everyone else in the industry has failed to solve.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
Fabrix
Дата:



2009/5/28 Greg Smith <>
On Thu, 28 May 2009, Flavio Henrique Araque Gurgel wrote:

It is 2.6.24 We had to apply the kswapd patch also. It's important specially if you see your system % going as high as 99% in top and loosing the machine's control. I have read something about 2.6.28 had this patch accepted in mainstream.

It would help if you gave more specific information about what you're talking about.  I know there was a bunch of back and forth on the "kswapd should only wait on IO if there is IO" patch, where it was commited and then reverted etc, but it's not clear to me if that's what you're talking about--and if so, what that has to do with the context switch problem.

Back to Fabrix's problem.  You're fighting a couple of losing battles here.  Let's go over the initial list:

1) You have 32 cores.  You think they should be allowed to schedule
3500 active connections across them.  That doesn't work, and what happens
is exactly the sort of context switch storm you're showing data for. Think about it for a minute:  how many of those can really be doing work at any time?  32, that's how many.  Now, you need some multiple of the number of cores to try to make sure everybody is always busy, but that multiple should be closer to 10X the number of cores rather than 100X. You need to adjust the connection pool ratio so that the PostgreSQL max_connections is closer to 500 than 5000, and this is by far the most critical thing for you to do.  The PostgreSQL connection handler is known to be bad at handling high connection loads compared to the popular pooling projects, so you really shouldn't throw this problem at it. While kernel problems stack on top of that, you really shouldn't start at kernel fixes; nail the really fundamental and obvious problem first.

In this application is not closing the connection, the development team is makeing the change for close the connection after getting the job done. So most connections are in idle state.  How much would this help? Does this could be the real problem?



2) You have very new hardware and a very old kernel.  Once you've done the above, if you're still not happy with performance, at that point you should consider using a newer one.  It's fairly simple to build a Linux kernel using the same basic kernel parameters as the stock RedHat one. 2.6.28 is six months old now, is up to 2.6.28.10, and has gotten a lot more testing than most kernels due to it being the Ubuntu 9.04 default. I'd suggest you try out that version.

ok, I'll test if updating the kernel this improves



3) A system with 128GB of RAM is in a funny place where by using the defaults or the usual rules of thumb for a lot of parameters ("set shared_buffers to 1/4 of RAM") are all bad ideas.  shared_buffers seems to top out its usefulness around 10GB on current generation hardware/software, and some Linux memory tunables have defaults on 2.6.18 that are insane for your system; vm_dirty_ratio at 40 comes to mind as the one I run into most.  Some of that gets fixed just by moving to a newer kernel, some doesn't.  Again, these aren't the problems you're having now though; they're the ones you'll have in the future *if* you fix the more fundamental problems first.

 


--
* Greg Smith http://www.gregsmith.com Baltimore, MD

От:
Greg Smith
Дата:

On Fri, 29 May 2009, Fabrix wrote:

> In this application is not closing the connection, the development team
> is makeing the change for close the connection after getting the job
> done. So most connections are in idle state.  How much would this help?
> Does this could be the real problem?

Ah, now you're getting somewhere.  This is actually another subtle problem
with making max_connections really high.  It lets developers get away with
being sloppy in ways that waste large amount of server resources.

Fix that problem, re-test, and then think about changing other things.
There's no need to go through kernel changes and the lot if you can nail
the problem at its true source and get acceptable performance.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
Scott Mead
Дата:

2009/5/29 Greg Smith <>
On Fri, 29 May 2009, Grzegorz Ja?kiewicz wrote:

if it is implemented somewhere else better, shouldn't that make it
obvious that postgresql should solve it internally ?

Opening a database connection has some overhead to it that can't go away without losing *something* in the process that you want the database to handle.  That something usually impacts either security or crash-safety. This is why every serious database product in the world suggests using connection pooling; examples:

http://blogs.oracle.com/opal/2006/10/oracle_announces_new_connectio.html
http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/conn/c0006170.htm
http://msdn.microsoft.com/en-us/library/8xx3tyca.aspx
http://dev.mysql.com/tech-resources/articles/connection_pooling_with_connectorj.html



  Exactly, here's the thing, if you have an open transaction somewhere to the system, there may be a REALLY good reason for it.  If you're app or dev team is keeping those open, it's very possible that 'reaping' them is going to cause some kind of data integrity issue in your database.  I would investigate the application and make sure that everything is actually rolling back or commiting.  If you're using an ORM, make sure that it's using autocommit, this usually makes the issue go away.
 As to the context switching point -- A connection pooler is what you need.  Why make your database server dedicate cycles to having to figure out who gets on the CPU next?  Why not lower the number of connections, and let a connection pool decide what to use.  That usually helps with your open transactions too (if they indeed are just abandoned by the application).  

 

The only difference here is that some of the commercial products bundle the connection pooler into the main program.  In most cases, you're still stuck with configuring a second piece of software, the only difference is that said software might already be installed for you by the big DB installer. Since this project isn't in the business of bundling every piece of additional software that might be useful with the database, it's never going to make it into the core code when it works quite happily outside of it.  The best you could hope for is that people who bundle large chunks of other stuff along with their PostgreSQL installer, like Enterprise DB does, might include one of the popular poolers one day.

 This sounds like a dirty plug (sorry sorry sorry, it's for informative purposes only)... 

Open Source:
  
      One-Click installers :    No connection pool bundled  (will be included in 8.4 one-click installers)
      PostgresPlus Standard Edition :  pgBouncer is bundled

Proprietary:

      PostgresPlus Advanced Server: pgBouncer is bundled

  That being said, the well known connection pools for postgres are pretty small and easy to download / build / configure and get up and running.


--Scott

От:
Scott Marlowe
Дата:

On Fri, May 29, 2009 at 12:20 PM, Scott Mead
<> wrote:
>  This sounds like a dirty plug (sorry sorry sorry, it's for informative
> purposes only)...

(Commercial applications mentioned deleted for brevity.)

Just sounded like useful information to me.  I'm not anti commercial,
just anti-marketing speak. :)  Like a lot of folks on this list
really.

От:
Fabrix
Дата:



2009/5/29 Scott Mead <>
2009/5/29 Greg Smith <>

On Fri, 29 May 2009, Grzegorz Ja?kiewicz wrote:

if it is implemented somewhere else better, shouldn't that make it
obvious that postgresql should solve it internally ?

Opening a database connection has some overhead to it that can't go away without losing *something* in the process that you want the database to handle.  That something usually impacts either security or crash-safety. This is why every serious database product in the world suggests using connection pooling; examples:

http://blogs.oracle.com/opal/2006/10/oracle_announces_new_connectio.html
http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/conn/c0006170.htm
http://msdn.microsoft.com/en-us/library/8xx3tyca.aspx
http://dev.mysql.com/tech-resources/articles/connection_pooling_with_connectorj.html



  Exactly, here's the thing, if you have an open transaction somewhere to the system, there may be a REALLY good reason for it.  If you're app or dev team is keeping those open, it's very possible that 'reaping' them is going to cause some kind of data integrity issue in your database.  I would investigate the application and make sure that everything is actually rolling back or commiting.  If you're using an ORM, make sure that it's using autocommit, this usually makes the issue go away.
 As to the context switching point -- A connection pooler is what you need.  Why make your database server dedicate cycles to having to figure out who gets on the CPU next?  Why not lower the number of connections, and let a connection pool decide what to use.  That usually helps with your open transactions too (if they indeed are just abandoned by the application).  

 

The only difference here is that some of the commercial products bundle the connection pooler into the main program.  In most cases, you're still stuck with configuring a second piece of software, the only difference is that said software might already be installed for you by the big DB installer. Since this project isn't in the business of bundling every piece of additional software that might be useful with the database, it's never going to make it into the core code when it works quite happily outside of it.  The best you could hope for is that people who bundle large chunks of other stuff along with their PostgreSQL installer, like Enterprise DB does, might include one of the popular poolers one day.

 This sounds like a dirty plug (sorry sorry sorry, it's for informative purposes only)... 

Open Source:
  
      One-Click installers :    No connection pool bundled  (will be included in 8.4 one-click installers)
      PostgresPlus Standard Edition :  pgBouncer is bundled

Proprietary:

      PostgresPlus Advanced Server: pgBouncer is bundled

  That being said, the well known connection pools for postgres are pretty small and easy to download / build / configure and get up and running.


Which is better and more complete, which have more features?
What you recommend? pgbouncer or pgpool?

--Scott


От:
Scott Mead
Дата:

On Fri, May 29, 2009 at 3:45 PM, Fabrix <> wrote:

Which is better and more complete, which have more features?
What you recommend? pgbouncer or pgpool?

   In your case, where you're looking to just get the connection overhead off of the machine, pgBouncer is probably going to be more efficient.  It's small and very lightweight, and you don't have to worry about a lot of extra features.  It is a '... to the wall' connection pool.

   pgPool is definitely more feature-full, but honestly, you probably don't need the ability (at least now) to balance selects / against replicated servers, or have the pooler do a write to multiple servers for H/A.  Both these things would take more time to implement.  

pgPool is real an all-around H/A / scalability architecture e decision whereas pgBouncer is a nice, lightweight and quick way to:

   *) Lower the number of connections to the dbserver
   *) Avoid connect / disconnect overhead

--Scott

От:
Scott Carey
Дата:

On 5/28/09 6:54 PM, "Greg Smith" <> wrote:

> 2) You have very new hardware and a very old kernel.  Once you've done the
> above, if you're still not happy with performance, at that point you
> should consider using a newer one.  It's fairly simple to build a Linux
> kernel using the same basic kernel parameters as the stock RedHat one.
> 2.6.28 is six months old now, is up to 2.6.28.10, and has gotten a lot
> more testing than most kernels due to it being the Ubuntu 9.04 default.
> I'd suggest you try out that version.


Comparing RedHat's 2.6.18, heavily patched, fix backported kernel to the
original 2.6.18 is really hard.  Yes, much of it is old, but a lot of stuff
has been backported.
I have no idea if things related to this case have been backported.  Virtual
memory management is complex and only bug fixes would likely go in however.
But RedHat 5.3 for example put all the new features for Intel's latest
processor in the release (which may not even be in 2.6.28!).

There are operations/IT people won't touch Ubuntu etc with a ten foot pole
yet for production.  That may be irrational, but such paranoia exists.  The
latest postgres release is generally a hell of a lot safer than the latest
linux kernel, and people get paranoid about their DB.

If you told someone who has to wake up at 3AM by page if the system has an
error that "oh, we patched our own kenrel build into the RedHat OS" they
might not be ok with that.

Its a good test to see if this problem is fixed in the kernel. I've seen
CentOS 5.2 go completely nuts with system CPU time and context switches with
kswapd many times before.  I haven't put the system under the same stress
with 5.3 yet however.


От:
Greg Smith
Дата:

On Fri, 29 May 2009, Scott Carey wrote:

> There are operations/IT people won't touch Ubuntu etc with a ten foot pole
> yet for production.

The only thing I was suggesting is that because 2.6.28 is the latest
Ubuntu kernel, that means it's gotten a lot more exposure and testing
than, say, other options like 2.6.27 or 2.6.29.

I build a fair number of RedHat/CentOS systems with an upgraded kernel
based on mature releases from kernel.org, and a config as close as
possible to the original RedHat one, with the generic kernel defaults for
all the new settings.  I keep liking that combination better than just
using an Ubuntu version with a newer kernel.  I've seen a couple of odd
kernel setting choices in Ubuntu releases before that motivate that
choice; the scheduler trainwreck described at
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/188226 comes to mind.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
Scott Marlowe
Дата:

On Sat, May 30, 2009 at 9:41 PM, Greg Smith <> wrote:
> On Fri, 29 May 2009, Scott Carey wrote:
>
>> There are operations/IT people won't touch Ubuntu etc with a ten foot pole
>> yet for production.
>
> The only thing I was suggesting is that because 2.6.28 is the latest Ubuntu
> kernel, that means it's gotten a lot more exposure and testing than, say,
> other options like 2.6.27 or 2.6.29.
>
> I build a fair number of RedHat/CentOS systems with an upgraded kernel based
> on mature releases from kernel.org, and a config as close as possible to the
> original RedHat one, with the generic kernel defaults for all the new
> settings.  I keep liking that combination better than just using an Ubuntu
> version with a newer kernel.  I've seen a couple of odd kernel setting
> choices in Ubuntu releases before that motivate that choice; the scheduler
> trainwreck described at
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/188226 comes to mind.

8.04  was a frakking train wreck in many ways.  It wasn't until 8.04.2
came out that it was even close to useable as a server OS, and even
then, not for databases yet.  It's still got broken bits and pieces
marked "fixed in 8.10"...  Uh, hello, it's your LTS release, fixes
should be made there as a priority.  There's a reason my dbs run on
Centos / RHEL.  It's not the fastest release ever, but it doesn't go
down on me and it just works.

От:
Fabrix
Дата:



2009/5/29 Scott Carey <>

On 5/28/09 6:54 PM, "Greg Smith" <> wrote:

> 2) You have very new hardware and a very old kernel.  Once you've done the
> above, if you're still not happy with performance, at that point you
> should consider using a newer one.  It's fairly simple to build a Linux
> kernel using the same basic kernel parameters as the stock RedHat one.
> 2.6.28 is six months old now, is up to 2.6.28.10, and has gotten a lot
> more testing than most kernels due to it being the Ubuntu 9.04 default.
> I'd suggest you try out that version.


Comparing RedHat's 2.6.18, heavily patched, fix backported kernel to the
original 2.6.18 is really hard.  Yes, much of it is old, but a lot of stuff
has been backported.
I have no idea if things related to this case have been backported.  Virtual
memory management is complex and only bug fixes would likely go in however.
But RedHat 5.3 for example put all the new features for Intel's latest
processor in the release (which may not even be in 2.6.28!).

There are operations/IT people won't touch Ubuntu etc with a ten foot pole
yet for production.  That may be irrational, but such paranoia exists.  The
latest postgres release is generally a hell of a lot safer than the latest
linux kernel, and people get paranoid about their DB.

If you told someone who has to wake up at 3AM by page if the system has an
error that "oh, we patched our own kenrel build into the RedHat OS" they
might not be ok with that.

Its a good test to see if this problem is fixed in the kernel. I've seen
CentOS 5.2 go completely nuts with system CPU time and context switches with
kswapd many times before.  I haven't put the system under the same stress
with 5.3 yet however.

One of the server is: Intel Xeon X7350 2.93GHz, RH 5.3 and kernel 2.6.18-128.el5.
and the perfonmace is bad too, so i don't  think the probles is the kernel

The two servers that I tested (HP-785 Opteron and IBM x3950 M2 Xeon) have NUMA architecture. and I thought the problem was caused by NUMA.

http://archives.postgresql.org/pgsql-admin/2008-11/msg00157.php

I'm trying another server, an HP blade bl 680 with Xeon E7450 (4 CPU x 6 cores= 24 cores) without NUMA architecture, but the CPUs are also going up.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 46949972 116908 17032964    0    0    15    31    2    2  1  0 98  0  0
 2  0      0 46945880 116916 17033068    0    0    72   140 2059 3140  1  1 97  0  0
329  0      0 46953260 116932 17033208    0    0    24   612 1435 194237 44  3 53  0  0
546  0      0 46952912 116940 17033208    0    0     4   136 1090 327047 96  4  0  0  0
562  0      0 46951052 116940 17033224    0    0     0     0 1095 323034 95  4  0  0  0
514  0      0 46949200 116952 17033212    0    0     0   224 1088 330178 96  3  1  0  0
234  0      0 46948456 116952 17033212    0    0     0     0 1106 315359 91  5  4  0  0
 4  0      0 46958376 116968 17033272    0    0    16   396 1379 223499 47  3 49  0  0
 1  1      0 46941644 116976 17033224    0    0   152  1140 2662 5540  4  2 93  1  0
 1  0      0 46943196 116984 17033248    0    0   104   604 2307 3992  4  2 94  0  0
 1  1      0 46931544 116996 17033568    0    0   104  4304 2318 3585  1  1 97  1  0
 0  0      0 46943572 117004 17033568    0    0    32   204 2007 2986  1  1 98  0  0


 Now i don't  think the probles is NUMA.


The developer team will fix de aplication  and then i will test again.

I believe that when the application closes the connection the problem could be solved, and then 16 cores in a server does the work instead of a 32 or 24.


Regards...

--Fabrix


От:
Scott Carey
Дата:

On 5/31/09 9:37 AM, "Fabrix" <> wrote:

>
>
> 2009/5/29 Scott Carey <>
>>
>> On 5/28/09 6:54 PM, "Greg Smith" <> wrote:
>>
>>> 2) You have very new hardware and a very old kernel.  Once you've done the
>>> above, if you're still not happy with performance, at that point you
>>> should consider using a newer one.  It's fairly simple to build a Linux
>>> kernel using the same basic kernel parameters as the stock RedHat one.
>>> 2.6.28 is six months old now, is up to 2.6.28.10, and has gotten a lot
>>> more testing than most kernels due to it being the Ubuntu 9.04 default.
>>> I'd suggest you try out that version.
>>
>>
>> Comparing RedHat's 2.6.18, heavily patched, fix backported kernel to the
>> original 2.6.18 is really hard.  Yes, much of it is old, but a lot of stuff
>> has been backported.
>> I have no idea if things related to this case have been backported.  Virtual
>> memory management is complex and only bug fixes would likely go in however.
>> But RedHat 5.3 for example put all the new features for Intel's latest
>> processor in the release (which may not even be in 2.6.28!).
>>
>> There are operations/IT people won't touch Ubuntu etc with a ten foot pole
>> yet for production.  That may be irrational, but such paranoia exists.  The
>> latest postgres release is generally a hell of a lot safer than the latest
>> linux kernel, and people get paranoid about their DB.
>>
>> If you told someone who has to wake up at 3AM by page if the system has an
>> error that "oh, we patched our own kenrel build into the RedHat OS" they
>> might not be ok with that.
>>
>> Its a good test to see if this problem is fixed in the kernel. I've seen
>> CentOS 5.2 go completely nuts with system CPU time and context switches with
>> kswapd many times before.  I haven't put the system under the same stress
>> with 5.3 yet however.
>
> One of the server is: Intel Xeon X7350 2.93GHz, RH 5.3 and kernel
> 2.6.18-128.el5.
> and the perfonmace is bad too, so i don't  think the probles is the kernel
>
> The two servers that I tested (HP-785 Opteron and IBM x3950 M2 Xeon) have NUMA
> architecture. and I thought the problem was caused by NUMA.
>
> http://archives.postgresql.org/pgsql-admin/2008-11/msg00157.php
>
> I'm trying another server, an HP blade bl 680 with Xeon E7450 (4 CPU x 6
> cores= 24 cores) without NUMA architecture, but the CPUs are also going up.
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> st
>  1  0      0 46949972 116908 17032964    0    0    15    31    2    2  1  0
> 98  0  0
>  2  0      0 46945880 116916 17033068    0    0    72   140 2059 3140  1  1
> 97  0  0
> 329  0      0 46953260 116932 17033208    0    0    24   612 1435 194237 44  3
> 53  0  0
> 546  0      0 46952912 116940 17033208    0    0     4   136 1090 327047 96 
> 4  0  0  0
> 562  0      0 46951052 116940 17033224    0    0     0     0 1095 323034 95 
> 4  0  0  0
> 514  0      0 46949200 116952 17033212    0    0     0   224 1088 330178 96 
> 3  1  0  0
> 234  0      0 46948456 116952 17033212    0    0     0     0 1106 315359 91 
> 5  4  0  0
>  4  0      0 46958376 116968 17033272    0    0    16   396 1379 223499 47  3
> 49  0  0
>  1  1      0 46941644 116976 17033224    0    0   152  1140 2662 5540  4  2
> 93  1  0
>  1  0      0 46943196 116984 17033248    0    0   104   604 2307 3992  4  2
> 94  0  0
>  1  1      0 46931544 116996 17033568    0    0   104  4304 2318 3585  1  1
> 97  1  0
>  0  0      0 46943572 117004 17033568    0    0    32   204 2007 2986  1  1
> 98  0  0
>
>
>  Now i don't  think the probles is NUMA.
>
>
> The developer team will fix de aplication  and then i will test again.
>
> I believe that when the application closes the connection the problem could be
> solved, and then 16 cores in a server does the work instead of a 32 or 24.

Hidden in the above data is that the context switch craziness is not
correlated with system CPU time, but user CPU time -- so this is not likely
related to the kswapd context switch stuff which is associated with high
system CPU use.

Its probably locking in Postgres.


>
>
> Regards...
>
> --Fabrix
>
>
>


От:
Ron Mayer
Дата:

Grzegorz Jaśkiewicz wrote:
>
> I thought that's where the difference is between postgresql and oracle
> mostly, ability to handle more transactions and better scalability .
>

Which were you suggesting had this "better scalability"?

I recall someone summarizing to a CFO where I used to work:
"Oracle may scale technologically, but it doesn't scale well financially."



От:
Greg Smith
Дата:

On Sat, 30 May 2009, Scott Marlowe wrote:

> 8.04  was a frakking train wreck in many ways.  It wasn't until 8.04.2
> came out that it was even close to useable as a server OS, and even
> then, not for databases yet.  It's still got broken bits and pieces
> marked "fixed in 8.10"...  Uh, hello, it's your LTS release, fixes
> should be made there as a priority.

Ubuntu doesn't really have LTS releases, they just have ones they claim
are supported longer than others.  But as you've also noticed, they really
aren't.  All they really offer is long-term critical security fixes for
those releases, that's it.  The longest I've ever gotten an Ubuntu box to
last before becoming overwhelmed by bugs that were only fixed in later
versions and not backported was around 2 years.

...but now we're wondering way off topic.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
James Mansion
Дата:

Greg Smith wrote:
>> 3500 active connections across them.  That doesn't work, and what
>> happens
> is exactly the sort of context switch storm you're showing data for.
> Think about it for a minute:  how many of those can really be doing
> work at any time?  32, that's how many.  Now, you need some multiple
> of the number of cores to try to make sure everybody is always busy,
> but that multiple should be closer to 10X the number of cores rather
> than 100X.
That's surely overly simplistic.  There is inherently nothing problematic
about having a lot of compute processes waiting for their timeslice, nor
of having IO- or semaphore-blocked processes waiting, and it doesn't
cause a context switch storm - this is a problem with postgres scalability,
not (inherently) lots of connections. I'm sure most of us evaluating
Postgres
from a background in Sybase or SQLServer would regard 5000
connections as no big deal.

This has the sniff of a badly contended spin-and-yield doesn't it?

I'd guess that if the yield were a sleep for a couple of milliseconds then
the lock holder would run an free everything up.


От:
"Kevin Grittner"
Дата:

James Mansion <> wrote:

> I'm sure most of us evaluating Postgres from a background in Sybase
> or SQLServer would regard 5000 connections as no big deal.

Sure, but the architecture of those products is based around all the
work being done by "engines" which try to establish affinity to
different CPUs, and loop through the various tasks to be done.  You
don't get a context switch storm because you normally have the number
of engines set at or below the number of CPUs.  The down side is that
they spend a lot of time spinning around queue access to see if
anything has become available to do -- which causes them not to play
nice with other processes on the same box.

If you do connection pooling and queue requests, you get the best of
both worlds.  If that could be built into PostgreSQL, it would
probably reduce the number of posts requesting support for bad
configurations, and help with benchmarks which don't use proper
connection pooling for the product; but it would actually not add any
capability which isn't there if you do your homework....

-Kevin

От:
Dimitri
Дата:

Few weeks ago tested a customer application on 16 cores with Oracle:
  - 20,000 sessions in total
  - 70,000 queries/sec

without any problem on a mid-range Sun box + Solaris 10..

Rgds,
-Dimitri

On 6/3/09, Kevin Grittner <> wrote:
> James Mansion <> wrote:
>
>> I'm sure most of us evaluating Postgres from a background in Sybase
>> or SQLServer would regard 5000 connections as no big deal.
>
> Sure, but the architecture of those products is based around all the
> work being done by "engines" which try to establish affinity to
> different CPUs, and loop through the various tasks to be done.  You
> don't get a context switch storm because you normally have the number
> of engines set at or below the number of CPUs.  The down side is that
> they spend a lot of time spinning around queue access to see if
> anything has become available to do -- which causes them not to play
> nice with other processes on the same box.
>
> If you do connection pooling and queue requests, you get the best of
> both worlds.  If that could be built into PostgreSQL, it would
> probably reduce the number of posts requesting support for bad
> configurations, and help with benchmarks which don't use proper
> connection pooling for the product; but it would actually not add any
> capability which isn't there if you do your homework....
>
> -Kevin
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

От:
"Kevin Grittner"
Дата:

Dimitri <> wrote:
> Few weeks ago tested a customer application on 16 cores with Oracle:
>   - 20,000 sessions in total
>   - 70,000 queries/sec
>
> without any problem on a mid-range Sun box + Solaris 10..

I'm not sure what point you are trying to make.  Could you elaborate?

(If it's that Oracle doesn't need an external connection pool, then
are you advocating that PostgreSQL include that in the base product?)

-Kevin

От:
Scott Carey
Дата:


On 6/3/09 10:45 AM, "Kevin Grittner" <> wrote:

> Dimitri <> wrote:
>> Few weeks ago tested a customer application on 16 cores with Oracle:
>>   - 20,000 sessions in total
>>   - 70,000 queries/sec
>>
>> without any problem on a mid-range Sun box + Solaris 10..
>
> I'm not sure what point you are trying to make.  Could you elaborate?
>
> (If it's that Oracle doesn't need an external connection pool, then
> are you advocating that PostgreSQL include that in the base product?)
>

Here is how I see it -- not speaking for Dimitri.

Although Oracle's connections are heavyweight and expensive to create,
having many of them around and idle does not affect scalability much if at
all.

Postgres could fix its connection scalability issues -- that is entirely
independent of connection pooling.

In most other databases (all others that I have used), pooling merely
prevents the expense of connection creation/destruction and helps save some
RAM and not much else.
The fact that it affects scalability and performance beyond that so
dramatically in Postgres is a problem with Postgres.



> -Kevin
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


От:
Dimitri
Дата:

Just to say you don't need a mega server to keep thousands connections
with Oracle, it's just trivial, nor CPU affinity and other stuff you
may or may not need with Sybase :-)

Regarding PostgreSQL, I think it'll only benefit to have an integrated
connection pooler as it'll make happy all populations anyway:
  - those who don't like the idea may always disable it :-)
  - those who have a lot but mostly inactive sessions will be happy to
simplify session pooling
  - those who really seeking for the most optimal workload on their
servers will be happy twice: if there are any PG scalability limits,
integrated pooler will be in most cases more performant than external;
if there are no PG scalability limits - it'll still help to size PG
most optimally according a HW or OS capacities..

Rgds,
-Dimitri


On 6/3/09, Kevin Grittner <> wrote:
> Dimitri <> wrote:
>> Few weeks ago tested a customer application on 16 cores with Oracle:
>>   - 20,000 sessions in total
>>   - 70,000 queries/sec
>>
>> without any problem on a mid-range Sun box + Solaris 10..
>
> I'm not sure what point you are trying to make.  Could you elaborate?
>
> (If it's that Oracle doesn't need an external connection pool, then
> are you advocating that PostgreSQL include that in the base product?)
>
> -Kevin
>

От:
Robert Haas
Дата:

On Wed, Jun 3, 2009 at 2:12 PM, Scott Carey <> wrote:
> Postgres could fix its connection scalability issues -- that is entirely
> independent of connection pooling.

Really?  I'm surprised.  I thought the two were very closely related.
Could you expand on your thinking here?

...Robert

От:
Scott Carey
Дата:


On 6/3/09 11:39 AM, "Robert Haas" <> wrote:

> On Wed, Jun 3, 2009 at 2:12 PM, Scott Carey <> wrote:
>> Postgres could fix its connection scalability issues -- that is entirely
>> independent of connection pooling.
>
> Really?  I'm surprised.  I thought the two were very closely related.
> Could you expand on your thinking here?
>

They are closely related only by coincidence of Postgres' flaws.
If Postgres did not scale so poorly as idle connections increase (or as
active ones increased), they would be rarely needed at all.

Most connection pools in clients (JDBC, ODBC, for example) are designed to
limit the connection create/close count, not the number of idle connections.
They reduce creation/deletion specifically by leaving connections idle for a
while to allow re-use. . .

Other things that can be called "connection concentrators" differ in that
they are additionally trying to put a band-aid over server design flaws that
make idle connections hurt scalability.  Or to prevent resource consumption
issues that the database doesn't have enough control over on its own (again,
a flaw -- a server should be as resilient to bad client behavior and its
resource consumption as possible).


Most 'modern' server designs throttle active actions internally.  Apache's
(very old, and truly somewhat 1995-ish) process or thread per connection
model is being abandoned for event driven models in the next version, so it
can scale like the higher performing web servers to 20K+ keep-alive
connections with significantly fewer threads / processes.

SQL is significantly more complicated than HTTP and requires a lot more
state which dictates a very different design, but nothing about it requires
idle connections to cause reduced SMP scalability.

In addition to making sure idle connections have almost no impact on
performance (just eat up some RAM), scalability as active queries increase
is important.  Although the OS is responsible for a lot of this, there are
many things that the application can do to help out.  If Postgres had a
"max_active_connections" parameter for example, then the memory used by
work_mem would be related to this value and not max_connections.  This would
further make connection poolers/concentrators less useful from a performance
and resource management perspective.

Once the above is done, connection pooling, whether integrated or provided
by a third party, would mostly only have value for clients who cannot pool
or cache connections on their own.  This is the state of connection pooling
with most other DB's today.

> ...Robert
>


От:
Marc Cousin
Дата:

It's not that trivial with Oracle either. I guess you had to use shared
servers to get to that amount of sessions. They're most of the time not
activated by default (dispatchers is at 0).

Granted, they are part of the 'main' product, so you just have to set up
dispatchers, shared servers, circuits, etc ... but there is still setup to
do : dispatchers are (if I recall correctly) a completely manual parameter
(and every dispatcher can only drive a certain amount of sessions, dependant
on the operating system), where shared servers is a bit more dynamic, but
still uses processes (so you may have to tweak max processes also).

What I mean to say is that Oracle does something quite alike PostgreSQL + a
connection pooler, even if it's more advanced (it's a shared memory structure
that is used to send messages between dispatchers and shared servers).

Or did you mean that you had thousands of sessions in dedicated mode ?



On Wednesday 03 June 2009 20:13:39 Dimitri wrote:
> Just to say you don't need a mega server to keep thousands connections
> with Oracle, it's just trivial, nor CPU affinity and other stuff you
> may or may not need with Sybase :-)
>
> Regarding PostgreSQL, I think it'll only benefit to have an integrated
> connection pooler as it'll make happy all populations anyway:
>   - those who don't like the idea may always disable it :-)
>   - those who have a lot but mostly inactive sessions will be happy to
> simplify session pooling
>   - those who really seeking for the most optimal workload on their
> servers will be happy twice: if there are any PG scalability limits,
> integrated pooler will be in most cases more performant than external;
> if there are no PG scalability limits - it'll still help to size PG
> most optimally according a HW or OS capacities..
>
> Rgds,
> -Dimitri
>
> On 6/3/09, Kevin Grittner <> wrote:
> > Dimitri <> wrote:
> >> Few weeks ago tested a customer application on 16 cores with Oracle:
> >>   - 20,000 sessions in total
> >>   - 70,000 queries/sec
> >>
> >> without any problem on a mid-range Sun box + Solaris 10..
> >
> > I'm not sure what point you are trying to make.  Could you elaborate?
> >
> > (If it's that Oracle doesn't need an external connection pool, then
> > are you advocating that PostgreSQL include that in the base product?)
> >
> > -Kevin



От:
Robert Haas
Дата:

On Wed, Jun 3, 2009 at 5:09 PM, Scott Carey <> wrote:
> On 6/3/09 11:39 AM, "Robert Haas" <> wrote:
>> On Wed, Jun 3, 2009 at 2:12 PM, Scott Carey <> wrote:
>>> Postgres could fix its connection scalability issues -- that is entirely
>>> independent of connection pooling.
>>
>> Really?  I'm surprised.  I thought the two were very closely related.
>> Could you expand on your thinking here?
>
> They are closely related only by coincidence of Postgres' flaws.
> If Postgres did not scale so poorly as idle connections increase (or as
> active ones increased), they would be rarely needed at all.
>
> Most connection pools in clients (JDBC, ODBC, for example) are designed to
> limit the connection create/close count, not the number of idle connections.
> They reduce creation/deletion specifically by leaving connections idle for a
> while to allow re-use. . .
>
> Other things that can be called "connection concentrators" differ in that
> they are additionally trying to put a band-aid over server design flaws that
> make idle connections hurt scalability.  Or to prevent resource consumption
> issues that the database doesn't have enough control over on its own (again,
> a flaw -- a server should be as resilient to bad client behavior and its
> resource consumption as possible).
>
>
> Most 'modern' server designs throttle active actions internally.  Apache's
> (very old, and truly somewhat 1995-ish) process or thread per connection
> model is being abandoned for event driven models in the next version, so it
> can scale like the higher performing web servers to 20K+ keep-alive
> connections with significantly fewer threads / processes.
>
> SQL is significantly more complicated than HTTP and requires a lot more
> state which dictates a very different design, but nothing about it requires
> idle connections to cause reduced SMP scalability.
>
> In addition to making sure idle connections have almost no impact on
> performance (just eat up some RAM), scalability as active queries increase
> is important.  Although the OS is responsible for a lot of this, there are
> many things that the application can do to help out.  If Postgres had a
> "max_active_connections" parameter for example, then the memory used by
> work_mem would be related to this value and not max_connections.  This would
> further make connection poolers/concentrators less useful from a performance
> and resource management perspective.
>
> Once the above is done, connection pooling, whether integrated or provided
> by a third party, would mostly only have value for clients who cannot pool
> or cache connections on their own.  This is the state of connection pooling
> with most other DB's today.

I think I see the distinction you're drawing here.  IIUC, you're
arguing that other database products use connection pooling to handle
rapid connect/disconnect cycles and to throttle the number of
simultaneous queries, but not to cope with the possibility of large
numbers of idle sessions.  My limited understanding of why PostgreSQL
has a problem in this area is that it has to do with the size of the
process array which must be scanned to derive an MVCC snapshot.  I'd
be curious to know if anyone thinks that's correct, or not.

Assuming for the moment that it's correct, databases that don't use
MVCC won't have this problem, but they give up a significant amount of
scalability in other areas due to increased blocking (in particular,
writers will block readers).  So how do other databases that *do* use
MVCC mitigate this problem?  The only one that we've discussed here is
Oracle, which seems to get around the problem by having a built-in
connection pooler.  That gets me back to thinking that the two issues
are related, unless there's some other technique for dealing with the
need to derive snapshots.

...Robert

От:
Scott Carey
Дата:

On 6/4/09 3:57 AM, "Robert Haas" <> wrote:

> On Wed, Jun 3, 2009 at 5:09 PM, Scott Carey <> wrote:
>> On 6/3/09 11:39 AM, "Robert Haas" <> wrote:
>>> On Wed, Jun 3, 2009 at 2:12 PM, Scott Carey <> wrote:
>>>> Postgres could fix its connection scalability issues -- that is entirely
>>>> independent of connection pooling.
>>>
>>> Really?  I'm surprised.  I thought the two were very closely related.
>>> Could you expand on your thinking here?
>>
>> They are closely related only by coincidence of Postgres' flaws.
>> If Postgres did not scale so poorly as idle connections increase (or as
>> active ones increased), they would be rarely needed at all.
>>
>> Most connection pools in clients (JDBC, ODBC, for example) are designed to
>> limit the connection create/close count, not the number of idle connections.
>> They reduce creation/deletion specifically by leaving connections idle for a
>> while to allow re-use. . .
>>
>> Other things that can be called "connection concentrators" differ in that
>> they are additionally trying to put a band-aid over server design flaws that
>> make idle connections hurt scalability.  Or to prevent resource consumption
>> issues that the database doesn't have enough control over on its own (again,
>> a flaw -- a server should be as resilient to bad client behavior and its
>> resource consumption as possible).
>>
>>
>> Most 'modern' server designs throttle active actions internally.  Apache's
>> (very old, and truly somewhat 1995-ish) process or thread per connection
>> model is being abandoned for event driven models in the next version, so it
>> can scale like the higher performing web servers to 20K+ keep-alive
>> connections with significantly fewer threads / processes.
>>
>> SQL is significantly more complicated than HTTP and requires a lot more
>> state which dictates a very different design, but nothing about it requires
>> idle connections to cause reduced SMP scalability.
>>
>> In addition to making sure idle connections have almost no impact on
>> performance (just eat up some RAM), scalability as active queries increase
>> is important.  Although the OS is responsible for a lot of this, there are
>> many things that the application can do to help out.  If Postgres had a
>> "max_active_connections" parameter for example, then the memory used by
>> work_mem would be related to this value and not max_connections.  This would
>> further make connection poolers/concentrators less useful from a performance
>> and resource management perspective.
>>
>> Once the above is done, connection pooling, whether integrated or provided
>> by a third party, would mostly only have value for clients who cannot pool
>> or cache connections on their own.  This is the state of connection pooling
>> with most other DB's today.
>
> I think I see the distinction you're drawing here.  IIUC, you're
> arguing that other database products use connection pooling to handle
> rapid connect/disconnect cycles and to throttle the number of
> simultaneous queries, but not to cope with the possibility of large
> numbers of idle sessions.  My limited understanding of why PostgreSQL
> has a problem in this area is that it has to do with the size of the
> process array which must be scanned to derive an MVCC snapshot.  I'd
> be curious to know if anyone thinks that's correct, or not.
>
> Assuming for the moment that it's correct, databases that don't use
> MVCC won't have this problem, but they give up a significant amount of
> scalability in other areas due to increased blocking (in particular,
> writers will block readers).  So how do other databases that *do* use
> MVCC mitigate this problem?  The only one that we've discussed here is
> Oracle, which seems to get around the problem by having a built-in
> connection pooler.  That gets me back to thinking that the two issues
> are related, unless there's some other technique for dealing with the
> need to derive snapshots.
>
> ...Robert
>

To clarify if needed:

I'm not saying the two issues are unrelated.  I'm saying that the
relationship between connection pooling and a database is multi-dimensional,
and the scalability improvement does not have a hard dependency on
connection pooling.

On one spectrum, you have the raw performance improvement by caching
connections so they do not need to be created and destroyed frequently.
This is a universal benefit to all databases, though some have higher
overhead of connection creation than others.  Any book on databases
mentioning connection pools will list this benefit.

On another spectrum, a connection pool can act as a concurrency throttle.
The benefit of such a thing varies greatly from database to database, but
the trend for each DB out there has been to solve this issue internally and
not trust client or third party tools to prevent concurrency/scalability
related disasters.

The latter should be treated separately, a solution to it does not have to
address the connection creation/destruction efficiency -- almost all clients
these days can do that part, and third party tools are simpler if they only
have to meet that goal and not also try and reduce idle connection count.

So a fix to the connection scalability issues only optionally involves what
most would call connection pooling.

-------
Postgres' MVCC nature has something to do with it, but I'm sure there are
ways to significantly improve the current situation.  Locks and processor
cache-line behavior on larger SMP systems are often strangely behaving
beasts.


От:
Robert Haas
Дата:

On Thu, Jun 4, 2009 at 2:04 PM, Scott Carey <> wrote:
> To clarify if needed:
>
> I'm not saying the two issues are unrelated.  I'm saying that the
> relationship between connection pooling and a database is multi-dimensional,
> and the scalability improvement does not have a hard dependency on
> connection pooling.
>
> On one spectrum, you have the raw performance improvement by caching
> connections so they do not need to be created and destroyed frequently.
> This is a universal benefit to all databases, though some have higher
> overhead of connection creation than others.  Any book on databases
> mentioning connection pools will list this benefit.
>
> On another spectrum, a connection pool can act as a concurrency throttle.
> The benefit of such a thing varies greatly from database to database, but
> the trend for each DB out there has been to solve this issue internally and
> not trust client or third party tools to prevent concurrency/scalability
> related disasters.
>
> The latter should be treated separately, a solution to it does not have to
> address the connection creation/destruction efficiency -- almost all clients
> these days can do that part, and third party tools are simpler if they only
> have to meet that goal and not also try and reduce idle connection count.
>
> So a fix to the connection scalability issues only optionally involves what
> most would call connection pooling.
>
> -------
> Postgres' MVCC nature has something to do with it, but I'm sure there are
> ways to significantly improve the current situation.  Locks and processor
> cache-line behavior on larger SMP systems are often strangely behaving
> beasts.

I think in the particular case of PostgreSQL the only suggestions I've
heard for improving performance with very large numbers of
simultaneous connections are (1) connection caching, not so much
because of the overhead of creating the connection as because it
involves creating a whole new process whose private caches start out
cold, (2) finding a way to reduce ProcArrayLock contention, and (3)
reducing the cost of deriving a snapshot.  I think (2) and (3) are
related but I'm not sure how closely.  As far as I know, Simon is the
only one to submit a patch in this area and I think I'm not being
unfair if I say that that particular patch is mostly nibbling around
the edges of the problem.  There was a discussion a few months ago on
some possible changes to the lock modes of ProcArrayLock, based I
believe on some ideas from Tom (might have been Heikki), but I don't
think anyone has coded that or tested it.

We probably won't be able to make significant improvements in this
area unless someone comes up with some new, good ideas.   I agree with
you that there are probably ways to significantly improve the current
situation, but I'm not sure anyone has figured out with any degree of
specificity what they are.

...Robert

От:
James Mansion
Дата:

Kevin Grittner wrote:
> Sure, but the architecture of those products is based around all the
> work being done by "engines" which try to establish affinity to
> different CPUs, and loop through the various tasks to be done.  You
> don't get a context switch storm because you normally have the number
> of engines set at or below the number of CPUs.  The down side is that
> they spend a lot of time spinning around queue access to see if
> anything has become available to do -- which causes them not to play
> nice with other processes on the same box.
>
This is just misleading at best.  I'm sorry, but (in particular) UNIX
systems have routinely
managed large numbers of runnable processes where the run queue lengths are
long without such an issue.  This is not an issue with the number of
runnable threads,
but with the way that they wait and what they do.

The context switch rate reported does not indicate processes using their
timeslices
productively, unless the load is from a continuous stream of trivial
RPCs and that
doesn't stack up with the good performance and then problematic load
that the
OP reported.



От:
"Kevin Grittner"
Дата:

James Mansion <> wrote:
> Kevin Grittner wrote:
>> Sure, but the architecture of those products is based around all
>> the work being done by "engines" which try to establish affinity to
>> different CPUs, and loop through the various tasks to be done.  You
>> don't get a context switch storm because you normally have the
>> number of engines set at or below the number of CPUs.  The down
>> side is that they spend a lot of time spinning around queue access
>> to see if anything has become available to do -- which causes them
>> not to play nice with other processes on the same box.
>>
> This is just misleading at best.

What part?  Last I checked, Sybase ASE and SQL Server worked as I
described.  Those are the products I was describing.  Or is it
misleading to say that you aren't likely to get a context switch storm
if you keep your active thread count at or below the number of CPUs?

> I'm sorry, but (in particular) UNIX systems have routinely
> managed large numbers of runnable processes where the run queue
> lengths are long without such an issue.

Well, the OP is looking at tens of thousands of connections.  If we
have a process per connection, how many tens of thousands can we
handle before we get into problems with exhausting possible pid
numbers (if nothing else)?

> This is not an issue with the number of runnable threads,
> but with the way that they wait and what they do.

Well, I rather think it's about both.  From a description earlier in
this thread, it sounds like Oracle effective builds a connection pool
into their product, which gets used by default.  The above-referenced
products use a more extreme method of limiting active threads.
Perhaps they're silly to do so; perhaps not.

I know that if you do use a large number of threads, you have to be
pretty adaptive.  In our Java app that pulls data from 72 sources and
replicates it to eight, plus feeding it to filters which determine
what publishers for interfaces might be interested, the Sun JVM does
very poorly, but the IBM JVM handles it nicely.  It seems they use
very different techniques for the monitors on objects which
synchronize the activity of the threads, and the IBM technique does
well when no one monitor is dealing with a very large number of
blocking threads.  They got complaints from people who had thousands
of threads blocking on one monitor, so they now keep a count and
switch techniques for an individual monitor if the count gets too
high.

Perhaps something like that (or some other new approach) might
mitigate the effects of tens of thousands of processes competing for
for a few resources, but it fundamentally seems unwise to turn those
loose to compete if requests can be queued in some way.

-Kevin

От:
"Kevin Grittner"
Дата:

James Mansion <> wrote:

>> they spend a lot of time spinning around queue access to see if
>> anything has become available to do -- which causes them not to
>> play nice with other processes on the same box.

> UNIX systems have routinely managed large numbers of runnable
> processes where the run queue lengths are long without such an
> issue.

Hmmm...  Did you think the queues I mentioned where OS run queues?  In
case that's a source of misunderstanding, let me clarify.

Sybase ASE (and derivatives) have a number of queues to schedule work.
When something needs doing, it's put on a queue.  The "engines" cycle
through checking these queues for work, using non-blocking methods for
I/O where possible.  There is a configurable parameter for how many
times an engine should check all queues without finding any work to do
before it voluntarily yields its CPU.  This number was always a tricky
one to configure, as it would starve other processes on the box if set
too high, and would cause the DBMS to context switch too much if set
too low.  Whenever a new release came out, or we changed the other
processes running on a database server, we had to benchmark to see
where the "sweet spot" was.  We ranged from 16 to 20000 for this value
at various times.

Those are the queues I meant.

-Kevin

От:
Mark Mielke
Дата:

Kevin Grittner wrote:
James Mansion <> wrote:  
Kevin Grittner wrote:   
Sure, but the architecture of those products is based around all
the work being done by "engines" which try to establish affinity to
different CPUs, and loop through the various tasks to be done.  You
don't get a context switch storm because you normally have the
number of engines set at or below the number of CPUs.  The down
side is that they spend a lot of time spinning around queue access
to see if anything has become available to do -- which causes them
not to play nice with other processes on the same box.     
This is just misleading at best.   
 
What part?  Last I checked, Sybase ASE and SQL Server worked as I
described.  Those are the products I was describing.  Or is it
misleading to say that you aren't likely to get a context switch storm
if you keep your active thread count at or below the number of CPUs? 

Context switch storm is about how the application and runtime implements concurrent accesses to shared resources, not about the potentials of the operating system. For example, if threads all spin every time a condition or event is raised, then yes, a context storm probably occurs if there are thousands of threads. But, it doesn't have to work that way. At it's very simplest, this is the difference between "wake one thread" (which is then responsible for waking the next thread) vs "wake all threads". This isn't necessarily the best solution - but it is one alternative. Other solutions might involve waking the *right* thread. For example, if I know that a particular thread is waiting on my change and it has the highest priority - perhaps I only need to wake that one thread. Or, if I know that 10 threads are waiting on my results and can act on it, I only need to wake these specific 10 threads. Any system which actually wakes all threads will probably exhibit scaling limitations.

The operating system itself only needs to keep threads in the run queue if they have work to do. Having thousands of idle thread does not need to cost *any* cpu time, if they're kept in an idle thread collection separate from the run queue.

I'm sorry, but (in particular) UNIX systems have routinely
managed large numbers of runnable processes where the run queue
lengths are long without such an issue.   
Well, the OP is looking at tens of thousands of connections.  If we
have a process per connection, how many tens of thousands can we
handle before we get into problems with exhausting possible pid
numbers (if nothing else)? 

This depends if it is 16-bit pid numbers or 32-bit pid numbers. I believe Linux supports 32-bit pid numbers although I'm not up-to-date on what the default configurations are for all systems in use today. In particular, Linux 2.6 added support for the O(1) task scheduler, with the express requirement of supporting hundreds of thousands of (mostly idle) threads. The support exists. Is it activated or in proper use? I don't know.

I know that if you do use a large number of threads, you have to be
pretty adaptive.  In our Java app that pulls data from 72 sources and
replicates it to eight, plus feeding it to filters which determine
what publishers for interfaces might be interested, the Sun JVM does
very poorly, but the IBM JVM handles it nicely.  It seems they use
very different techniques for the monitors on objects which
synchronize the activity of the threads, and the IBM technique does
well when no one monitor is dealing with a very large number of
blocking threads.  They got complaints from people who had thousands
of threads blocking on one monitor, so they now keep a count and
switch techniques for an individual monitor if the count gets too
high. 
Could be, and if so then Sun JVM should really address the problem. However, having thousands of threads waiting on one monitor probably isn't a scalable solution, regardless of whether the JVM is able to optimize around your usage pattern or not. Why have thousands of threads waiting on one monitor? That's a bit insane. :-)

You should really only have as 1X or 2X many threads as there are CPUs waiting on one monitor. Beyond that is waste. The idle threads can be pooled away, and only activated (with individual monitors which can be far more easily and effectively optimized) when the other threads become busy.

Perhaps something like that (or some other new approach) might
mitigate the effects of tens of thousands of processes competing for
for a few resources, but it fundamentally seems unwise to turn those
loose to compete if requests can be queued in some way. 

An alternative approach might be: 1) Idle processes not currently running a transaction do not need to be consulted for their snapshot (and other related expenses) - if they are idle for a period of time, they "unregister" from the actively used processes list - if they become active again, they "register" in the actively used process list, and 2) Processes could be reusable across different connections - they could stick around for a period after disconnect, and make themselves available again to serve the next connection.

Still heavy-weight in terms of memory utilization, but cheap in terms of other impacts. Without the cost of connection "pooling" in the sense of requests always being indirect through a proxy of some sort.

Cheers,
mark

-- 
Mark Mielke <>
От:
david@lang.hm
Дата:

On Thu, 4 Jun 2009, Robert Haas wrote:

> On Wed, Jun 3, 2009 at 5:09 PM, Scott Carey <> wrote:
>> On 6/3/09 11:39 AM, "Robert Haas" <> wrote:
>>> On Wed, Jun 3, 2009 at 2:12 PM, Scott Carey <> wrote:
>>>> Postgres could fix its connection scalability issues -- that is entirely
>>>> independent of connection pooling.
>
> I think I see the distinction you're drawing here.  IIUC, you're
> arguing that other database products use connection pooling to handle
> rapid connect/disconnect cycles and to throttle the number of
> simultaneous queries, but not to cope with the possibility of large
> numbers of idle sessions.  My limited understanding of why PostgreSQL
> has a problem in this area is that it has to do with the size of the
> process array which must be scanned to derive an MVCC snapshot.  I'd
> be curious to know if anyone thinks that's correct, or not.
>
> Assuming for the moment that it's correct, databases that don't use
> MVCC won't have this problem, but they give up a significant amount of
> scalability in other areas due to increased blocking (in particular,
> writers will block readers).  So how do other databases that *do* use
> MVCC mitigate this problem?  The only one that we've discussed here is
> Oracle, which seems to get around the problem by having a built-in
> connection pooler.  That gets me back to thinking that the two issues
> are related, unless there's some other technique for dealing with the
> need to derive snapshots.

if this is the case, how hard would it be to have threads add and remove
themselves from some list as they get busy/become idle?

I've been puzzled as I've been watching this conversation on what internal
locking/lookup is happening that is causing the problems with idle threads
(the pure memory overhead isn't enough to account for the problems that
are being reported)

David Lang

От:
Scott Carey
Дата:

On 6/4/09 3:08 PM, "Kevin Grittner" <> wrote:

> James Mansion <> wrote:

>> I'm sorry, but (in particular) UNIX systems have routinely
>> managed large numbers of runnable processes where the run queue
>> lengths are long without such an issue.
>
> Well, the OP is looking at tens of thousands of connections.  If we
> have a process per connection, how many tens of thousands can we
> handle before we get into problems with exhausting possible pid
> numbers (if nothing else)?

Well, the connections are idle much of the time.  The OS doesn't really care
about these threads until they are ready to run, and even if they were all
runnable, there is little overhead in scheduling.

A context switch storm will only happen if too many threads are woken up
that must yield soon after getting to run on the CPU.  If you wake up 10,000
threads, and they all can get significant work done before yielding no
matter what order they run, the system will scale extremely well.

How the lock data structures are built to avoid cache-line collisions and
minimize cache line updates can also make or break a concurrency scheme and
is a bit hardware dependent.


> I know that if you do use a large number of threads, you have to be
> pretty adaptive.  In our Java app that pulls data from 72 sources and
> replicates it to eight, plus feeding it to filters which determine
> what publishers for interfaces might be interested, the Sun JVM does
> very poorly, but the IBM JVM handles it nicely.  It seems they use
> very different techniques for the monitors on objects which
> synchronize the activity of the threads, and the IBM technique does
> well when no one monitor is dealing with a very large number of
> blocking threads.  They got complaints from people who had thousands
> of threads blocking on one monitor, so they now keep a count and
> switch techniques for an individual monitor if the count gets too
> high.
>

A generic locking solution must be adaptive, yes.  But specific solutions
tailored to specific use cases rarely need to be adaptive.  I would think
that the 4 or 5 most important locks or concurrency coordination points in
Postgres have very specific, unique properties.

> Perhaps something like that (or some other new approach) might
> mitigate the effects of tens of thousands of processes competing for
> for a few resources, but it fundamentally seems unwise to turn those
> loose to compete if requests can be queued in some way.
>
> -Kevin
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

There's a bunch of useful blog posts about locks, concurrency, etc and how
they relate to low level hardware here:
http://blogs.sun.com/dave/

In particular, these are interesting references, (not only for java):

http://blogs.sun.com/dave/entry/seqlocks_in_java
http://blogs.sun.com/dave/entry/biased_locking_in_hotspot
http://blogs.sun.com/dave/entry/java_thread_priority_revisted_in
http://blogs.sun.com/dave/entry/hardware_assisted_transactional_read_set


От:
david@lang.hm
Дата:

On Thu, 4 Jun 2009, Mark Mielke wrote:

> Kevin Grittner wrote:
>> James Mansion <> wrote:
>
>> I know that if you do use a large number of threads, you have to be
>> pretty adaptive.  In our Java app that pulls data from 72 sources and
>> replicates it to eight, plus feeding it to filters which determine
>> what publishers for interfaces might be interested, the Sun JVM does
>> very poorly, but the IBM JVM handles it nicely.  It seems they use
>> very different techniques for the monitors on objects which
>> synchronize the activity of the threads, and the IBM technique does
>> well when no one monitor is dealing with a very large number of
>> blocking threads.  They got complaints from people who had thousands
>> of threads blocking on one monitor, so they now keep a count and
>> switch techniques for an individual monitor if the count gets too
>> high.
>>
> Could be, and if so then Sun JVM should really address the problem. However,
> having thousands of threads waiting on one monitor probably isn't a scalable
> solution, regardless of whether the JVM is able to optimize around your usage
> pattern or not. Why have thousands of threads waiting on one monitor? That's
> a bit insane. :-)
>
> You should really only have as 1X or 2X many threads as there are CPUs
> waiting on one monitor. Beyond that is waste. The idle threads can be pooled
> away, and only activated (with individual monitors which can be far more
> easily and effectively optimized) when the other threads become busy.

sometimes the decrease in complexity in the client makes it worthwhile to
'brute force' things.

this actually works well for the vast majority of services (including many
databases)

the question is how much complexity (if any) it adds to postgres to handle
this condition better, and what those changes are.

>> Perhaps something like that (or some other new approach) might
>> mitigate the effects of tens of thousands of processes competing for
>> for a few resources, but it fundamentally seems unwise to turn those
>> loose to compete if requests can be queued in some way.
>>
>
> An alternative approach might be: 1) Idle processes not currently running a
> transaction do not need to be consulted for their snapshot (and other related
> expenses) - if they are idle for a period of time, they "unregister" from the
> actively used processes list - if they become active again, they "register"
> in the actively used process list,

how expensive is this register/unregister process? if it's cheap enough do
it all the time and avoid the complexity of having another config option
to tweak.

> and 2) Processes could be reusable across
> different connections - they could stick around for a period after
> disconnect, and make themselves available again to serve the next connection.

depending on what criteria you have for the re-use, this could be a
significant win (if you manage to re-use the per process cache much. but
this is far more complex.

> Still heavy-weight in terms of memory utilization, but cheap in terms of
> other impacts. Without the cost of connection "pooling" in the sense of
> requests always being indirect through a proxy of some sort.

it would seem to me that the cost of making the extra hop through the
external pooler would be significantly more than the overhead of idle
processes marking themselvs as such so that they don't get consulted for
MVCC decisions

David Lang

От:
Robert Haas
Дата:

On Thu, Jun 4, 2009 at 8:51 PM, <> wrote:
> if this is the case, how hard would it be to have threads add and remove
> themselves from some list as they get busy/become idle?
>
> I've been puzzled as I've been watching this conversation on what internal
> locking/lookup is happening that is causing the problems with idle threads
> (the pure memory overhead isn't enough to account for the problems that are
> being reported)

That's because this thread has altogether too much theory and
altogether too little gprof.  :-)

...Robert

От:
Mark Mielke
Дата:

 wrote:
> On Thu, 4 Jun 2009, Mark Mielke wrote:
>> You should really only have as 1X or 2X many threads as there are
>> CPUs waiting on one monitor. Beyond that is waste. The idle threads
>> can be pooled away, and only activated (with individual monitors
>> which can be far more easily and effectively optimized) when the
>> other threads become busy.
> sometimes the decrease in complexity in the client makes it worthwhile
> to 'brute force' things.
> this actually works well for the vast majority of services (including
> many databases)
> the question is how much complexity (if any) it adds to postgres to
> handle this condition better, and what those changes are.

Sure. Locks that are not generally contended, for example, don't deserve
the extra complexity. Locks that have any expected frequency of a
"context storm" though, probably make good candidates.

>> An alternative approach might be: 1) Idle processes not currently
>> running a transaction do not need to be consulted for their snapshot
>> (and other related expenses) - if they are idle for a period of time,
>> they "unregister" from the actively used processes list - if they
>> become active again, they "register" in the actively used process list,
> how expensive is this register/unregister process? if it's cheap
> enough do it all the time and avoid the complexity of having another
> config option to tweak.

Not really relevant if you look at the "idle for a period of time". An
active process would not unregister/register. An inactive process,
though, after it is not in a commit, and after it hits some time that is
many times more than the cost of unregister + register, would free up
other processes from having to take this process into account, allowing
for better scaling. For example, let's say it doesn't unregister itself
for 5 seconds.

>> and 2) Processes could be reusable across different connections -
>> they could stick around for a period after disconnect, and make
>> themselves available again to serve the next connection.
> depending on what criteria you have for the re-use, this could be a
> significant win (if you manage to re-use the per process cache much.
> but this is far more complex.

Does it need to be? From a naive perspective - what's the benefit of a
PostgreSQL process dying, and a new connection getting a new PostgreSQL
process? I suppose bugs in PostgreSQL don't have the opportunity to
affect later connections, but overall, this seems like an unnecessary
cost. I was thinking of either: 1) The Apache model, where a PostreSQL
process waits on accept(), or 2) When the PostgreSQL process is done, it
does connection cleanup and then it waits for a file descriptor to be
transferred to it through IPC and just starts over using it. Too hand
wavy? :-)

>> Still heavy-weight in terms of memory utilization, but cheap in terms
>> of other impacts. Without the cost of connection "pooling" in the
>> sense of requests always being indirect through a proxy of some sort.
> it would seem to me that the cost of making the extra hop through the
> external pooler would be significantly more than the overhead of idle
> processes marking themselvs as such so that they don't get consulted
> for MVCC decisions

They're separate ideas to be considered separately on the complexity vs
benefit merit.

For the first - I think we already have an "external pooler", in the
sense of the master process which forks to manage a connection, so it
already involves a possible context switch to transfer control. I guess
the question is whether or not we can do better than fork(). In
multi-threaded programs, it's definitely possible to outdo fork using
thread pools. Does the same remain true of a multi-process program that
communicates using IPC? I'm not completely sure, although I believe
Apache does achieve this by having the working processes do accept()
rather than some master process that spawns off new processes on each
connection. Apache re-uses the process.

Cheers,
mark

--
Mark Mielke <>


От:
Greg Smith
Дата:

On Thu, 4 Jun 2009, Robert Haas wrote:

> That's because this thread has altogether too much theory and
> altogether too little gprof.

But running benchmarks and profiling is actual work; that's so much less
fun than just speculating about what's going on!

This thread reminds me of Jignesh's "Proposal of tunable fix for
scalability of 8.4" thread from March, except with only a fraction of the
real-world detail.  There are multiple high-profile locks causing
scalability concerns at quadruple digit high user counts in the PostgreSQL
code base, finding them is easy.  Shoot, I know exactly where a couple
are, and I didn't have to think about it at all--just talked with Jignesh
a couple of times, led me right to them.  Fixing them without causing
regressions in low client count cases, now that's the hard part.  No
amount of theoretical discussion advances that any until you're at least
staring at a very specific locking problem you've already characterized
extensively via profiling.  And even then, profiling trumps theory every
time.  This is why I stay out of these discussions and work on boring
benchmark tools instead.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
david@lang.hm
Дата:

On Thu, 4 Jun 2009, Mark Mielke wrote:

>  wrote:
>> On Thu, 4 Jun 2009, Mark Mielke wrote:

>>> An alternative approach might be: 1) Idle processes not currently running
>>> a transaction do not need to be consulted for their snapshot (and other
>>> related expenses) - if they are idle for a period of time, they
>>> "unregister" from the actively used processes list - if they become active
>>> again, they "register" in the actively used process list,
>> how expensive is this register/unregister process? if it's cheap enough do
>> it all the time and avoid the complexity of having another config option to
>> tweak.
>
> Not really relevant if you look at the "idle for a period of time". An active
> process would not unregister/register. An inactive process, though, after it
> is not in a commit, and after it hits some time that is many times more than
> the cost of unregister + register, would free up other processes from having
> to take this process into account, allowing for better scaling. For example,
> let's say it doesn't unregister itself for 5 seconds.

to do this you need to have the process set an alarm to wake it up. if
instead it just checks "anything else for me to do?, no?, ok I'll go
inactive until something comes in" you have a much simpler application.

the time needed to change it's status from active to inactive should be
_extremely_ small, even a tenth of a second should be a few orders of
magnatude longer than the time it needs to change it's status.

this does have potential for thrashing if you have lots of short delays on
a lot of threads, but you would need to have multiple CPUs changing status
at the same time.

>>> and 2) Processes could be reusable across different connections - they
>>> could stick around for a period after disconnect, and make themselves
>>> available again to serve the next connection.
>> depending on what criteria you have for the re-use, this could be a
>> significant win (if you manage to re-use the per process cache much. but
>> this is far more complex.
>
> Does it need to be? From a naive perspective - what's the benefit of a
> PostgreSQL process dying, and a new connection getting a new PostgreSQL
> process? I suppose bugs in PostgreSQL don't have the opportunity to affect
> later connections, but overall, this seems like an unnecessary cost. I was
> thinking of either: 1) The Apache model, where a PostreSQL process waits on
> accept(), or 2) When the PostgreSQL process is done, it does connection
> cleanup and then it waits for a file descriptor to be transferred to it
> through IPC and just starts over using it. Too hand wavy? :-)

if the contents of the cache are significantly different for different
processes (say you are servicing queries to different databases), sending
the new request to a process that has the correct hot cache could result
is a very significant speed up compared to the simple 'first available'
approach.

>>> Still heavy-weight in terms of memory utilization, but cheap in terms of
>>> other impacts. Without the cost of connection "pooling" in the sense of
>>> requests always being indirect through a proxy of some sort.
>> it would seem to me that the cost of making the extra hop through the
>> external pooler would be significantly more than the overhead of idle
>> processes marking themselvs as such so that they don't get consulted for
>> MVCC decisions
>
> They're separate ideas to be considered separately on the complexity vs
> benefit merit.
>
> For the first - I think we already have an "external pooler", in the sense of
> the master process which forks to manage a connection, so it already involves
> a possible context switch to transfer control. I guess the question is
> whether or not we can do better than fork(). In multi-threaded programs, it's
> definitely possible to outdo fork using thread pools. Does the same remain
> true of a multi-process program that communicates using IPC? I'm not
> completely sure, although I believe Apache does achieve this by having the
> working processes do accept() rather than some master process that spawns off
> new processes on each connection. Apache re-uses the process.

the current limits of postgres are nowhere close to being the limits of
fork.

on my firewalls I use a forking proxy (each new connection forks a new
process to handle that connection), I can get connection rates of tens of
thousands per second on linux (other systems may be slower).

but the here isn't the cost of establishing a new connection, it's the
cost of having idle connections in the system.

David Lang

От:
david@lang.hm
Дата:

On Fri, 5 Jun 2009, Greg Smith wrote:

> On Thu, 4 Jun 2009, Robert Haas wrote:
>
>> That's because this thread has altogether too much theory and
>> altogether too little gprof.
>
> But running benchmarks and profiling is actual work; that's so much less fun
> than just speculating about what's going on!
>
> This thread reminds me of Jignesh's "Proposal of tunable fix for scalability
> of 8.4" thread from March, except with only a fraction of the real-world
> detail.  There are multiple high-profile locks causing scalability concerns
> at quadruple digit high user counts in the PostgreSQL code base, finding them
> is easy.  Shoot, I know exactly where a couple are, and I didn't have to
> think about it at all--just talked with Jignesh a couple of times, led me
> right to them.  Fixing them without causing regressions in low client count
> cases, now that's the hard part.  No amount of theoretical discussion
> advances that any until you're at least staring at a very specific locking
> problem you've already characterized extensively via profiling.  And even
> then, profiling trumps theory every time.  This is why I stay out of these
> discussions and work on boring benchmark tools instead.

actually, as I see it we are a step before that.

it seems that people are arguing that there is no need to look for and fix
this sort of thing, on the basis that anyone who trips over these problems
is doing something wrong to start with and needs to change the behavior of
their app.

David Lang

От:
Mark Mielke
Дата:

Greg Smith wrote:
> This thread reminds me of Jignesh's "Proposal of tunable fix for
> scalability of 8.4" thread from March, except with only a fraction of
> the real-world detail.  There are multiple high-profile locks causing
> scalability concerns at quadruple digit high user counts in the
> PostgreSQL code base, finding them is easy.  Shoot, I know exactly
> where a couple are, and I didn't have to think about it at all--just
> talked with Jignesh a couple of times, led me right to them.  Fixing
> them without causing regressions in low client count cases, now that's
> the hard part.  No amount of theoretical discussion advances that any
> until you're at least staring at a very specific locking problem
> you've already characterized extensively via profiling.  And even
> then, profiling trumps theory every time.  This is why I stay out of
> these discussions and work on boring benchmark tools instead.

I disagree that profiling trumps theory every time. Profiling is useful
for identifying places where the existing architecture exhibits the best
and worst behaviour. It doesn't tell you whether a different
architecture (even a slightly different architecture) would work better
or worse. It might help identify architecture problems. It does not
provide you with architectural solutions.

I think it would be more correct to say that prototyping trumps theory.
That is, if somebody has a theory, and they invest time into a
proof-of-concept patch, and post actual results to show you that "by
changing this code over here to that, I get a N% improvement when using
thousands of connections, at no measurable cost for the single
connection case", these results will be far more compelling than theory.

Still, it has to involve theory, as not everybody has the time to run
off and prototype every wild idea. Discussion can determine whether an
idea has enough merit to be worth investing in a prototype.

I think several valuable theories have been discussed, many of which
directly apply to the domain that PostgreSQL fits within. The question
isn't about how valuable these theories are - they ARE valuable. The
question is how much support from the team can be gathered to bring
about change, and how willing the team is to accept or invest in
architectural changes that might take PostgreSQL to the next level. The
real problem here is the words "invest" and "might". That is, people are
not going to invest on a "might" - people need to be convinced, and for
people that don't have a problem today, the motivation to make the
investment is far less.

In my case, all I have to offer you is theory at this time. I don't have
the time to work on PostgreSQL, and I have not invested the time to
learn the internals of PostgreSQL well enough to comfortably and
effectively make changes to implement a theory I might have. I want to
get there - but there are so many other projects and ideas to pursue,
and I only have a few hours a day to decide what to spend it on.

You can tell me "sorry, your contribution of theory isn't welcome". In
fact, that looks like exactly what you have done. :-)

If the general community agrees with you, I'll stop my contributions of
theories. :-)

I think, though, that some of the PostgreSQL architecture is "old
theory". I have this silly idea that PostgreSQL could one day be better
than Oracle (in terms of features and performance - PostgreSQL already
beats Oracle on cost :-) ). It won't get there without some significant
changes. In only the last few years, I have watched as some pretty
significant changes were introduced into PostgreSQL that significantly
improved its performance and feature set. Many of these changes might
have started with profiling - but the real change came from applied
theory, not from profiling. Bitmap indexes are an example of this.
Profiling tells you what - that large joins involving OR are slow? It
takes theory to answer "why" and "so, what do we do about it?"

Cheers,
mark

--
Mark Mielke <>


От:
"Kevin Grittner"
Дата:

Mark Mielke <> wrote:
> Kevin Grittner wrote:
>> James Mansion <> wrote:
>>> Kevin Grittner wrote:
>>>
>>>> Sure, but the architecture of those products is based around all
>>>> the work being done by "engines" which try to establish affinity
>>>> to different CPUs, and loop through the various tasks to be done.
>>>> You don't get a context switch storm because you normally have
>>>> the number of engines set at or below the number of CPUs.
>>>>
>>> This is just misleading at best.
>>
>> What part?  Last I checked, Sybase ASE and SQL Server worked as I
>> described.  Those are the products I was describing.  Or is it
>> misleading to say that you aren't likely to get a context switch
>> storm if you keep your active thread count at or below the number
>> of CPUs?
>
> Context switch storm is about how the application and runtime
> implements concurrent accesses to shared resources, not about the
> potentials of the operating system.

I'm really not following how that's responsive to my questions or
points, at all.  You're making pretty basic and obvious points about
other ways to avoid the problem, but the fact is that the other
databases people point to as examples of handling large numbers of
connections have (so far at least) been ones which solve the problems
in other ways than what people seem to be proposing.  That doesn't
mean that the techniques used by these other products are the only way
to solve the issue, or even that they are the best ways; but it does
mean that pointing to those other products doesn't prove anything
relative to what lock optimization is likely to buy us.

> For example, if threads all spin every time a condition or event is
> raised, then yes, a context storm probably occurs if there are
> thousands of threads. But, it doesn't have to work that way. At it's
> very simplest, this is the difference between "wake one thread"
> (which is then responsible for waking the next thread) vs "wake all
> threads". This isn't necessarily the best solution - but it is one
> alternative. Other solutions might involve waking the *right*
> thread.  For example, if I know that a particular thread is waiting
> on my change and it has the highest priority - perhaps I only need
> to wake that one thread. Or, if I know that 10 threads are waiting
> on my results and can act on it, I only need to wake these specific
> 10 threads. Any system which actually wakes all threads will
> probably exhibit scaling limitations.

I would be surprised if any of this is not obvious to all on the list.

>>> I'm sorry, but (in particular) UNIX systems have routinely
>>> managed large numbers of runnable processes where the run queue
>>> lengths are long without such an issue.
>>>
>> Well, the OP is looking at tens of thousands of connections.  If we
>> have a process per connection, how many tens of thousands can we
>> handle before we get into problems with exhausting possible pid
>> numbers (if nothing else)?
>
> This depends if it is 16-bit pid numbers or 32-bit pid numbers. I
> believe Linux supports 32-bit pid numbers although I'm not up-to-date
on
> what the default configurations are for all systems in use today. In

> particular, Linux 2.6 added support for the O(1) task scheduler, with

> the express requirement of supporting hundreds of thousands of
(mostly
> idle) threads. The support exists. Is it activated or in proper use?
I
> don't know.

Interesting.  I'm running the latest SuSE Enterprise on a 64 bit
system with 128 GB RAM and 16 CPUs, yet my pids and port numbers are
16 bit.  Since I only use a tiny fraction of the available numbers
using current techniques, I don't need to look at this yet, but I'll
keep it in mind.

>> I know that if you do use a large number of threads, you have to be
>> pretty adaptive.  In our Java app that pulls data from 72 sources
and
>> replicates it to eight, plus feeding it to filters which determine
>> what publishers for interfaces might be interested, the Sun JVM
does
>> very poorly, but the IBM JVM handles it nicely.  It seems they use
>> very different techniques for the monitors on objects which
>> synchronize the activity of the threads, and the IBM technique does
>> well when no one monitor is dealing with a very large number of
>> blocking threads.  They got complaints from people who had
thousands
>> of threads blocking on one monitor, so they now keep a count and
>> switch techniques for an individual monitor if the count gets too
>> high.
>>
> Could be, and if so then Sun JVM should really address the problem.

I wish they would.

> However, having thousands of threads waiting on one monitor probably

> isn't a scalable solution, regardless of whether the JVM is able to
> optimize around your usage pattern or not. Why have thousands of
threads
> waiting on one monitor? That's a bit insane. :-)

Agreed.  We weren't the ones complaining to IBM.  :-)

>> Perhaps something like that (or some other new approach) might
>> mitigate the effects of tens of thousands of processes competing
for
>> for a few resources, but it fundamentally seems unwise to turn
those
>> loose to compete if requests can be queued in some way.
>
> An alternative approach might be: 1) Idle processes not currently
> running a transaction do not need to be consulted for their snapshot

> (and other related expenses) - if they are idle for a period of time,

> they "unregister" from the actively used processes list - if they
become
> active again, they "register" in the actively used process list, and
2)
> Processes could be reusable across different connections - they could

> stick around for a period after disconnect, and make themselves
> available again to serve the next connection.
>
> Still heavy-weight in terms of memory utilization, but cheap in terms
of
> other impacts. Without the cost of connection "pooling" in the sense
of
> requests always being indirect through a proxy of some sort.

Just guessing here, but I would expect the cost of such forwarding to
be pretty insignificant compared to the cost of even parsing the
query, much less running it.  That would be especially true if the
pool
was integrated into the DBMS in a way similar to what was described as
the
Oracle default.

-Kevin

От:
"Kevin Grittner"
Дата:

Scott Carey <> wrote:

> If you wake up 10,000 threads, and they all can get significant work
> done before yielding no matter what order they run, the system will
> scale extremely well.

But with roughly twice the average response time you would get
throttling active requests to the minimum needed to keep all resources
busy.  (Admittedly a hard point to find with precision.)

> I would think that the 4 or 5 most important locks or concurrency
> coordination points in Postgres have very specific, unique
> properties.

Given the wide variety of uses I'd be cautious about such assumptions.

> In particular, these are interesting references, (not only for
java):

With this wealth of opinion, perhaps they can soon approach IBM's JVM
in their ability to support a large number of threads.  I'm rooting
for them.

-Kevin

От:
Robert Haas
Дата:

On Fri, Jun 5, 2009 at 12:33 AM, <> wrote:
> On Fri, 5 Jun 2009, Greg Smith wrote:
>
>> On Thu, 4 Jun 2009, Robert Haas wrote:
>>
>>> That's because this thread has altogether too much theory and
>>> altogether too little gprof.
>>
>> But running benchmarks and profiling is actual work; that's so much less
>> fun than just speculating about what's going on!
>>
>> This thread reminds me of Jignesh's "Proposal of tunable fix for
>> scalability of 8.4" thread from March, except with only a fraction of the
>> real-world detail.  There are multiple high-profile locks causing
>> scalability concerns at quadruple digit high user counts in the PostgreSQL
>> code base, finding them is easy.  Shoot, I know exactly where a couple are,
>> and I didn't have to think about it at all--just talked with Jignesh a
>> couple of times, led me right to them.  Fixing them without causing
>> regressions in low client count cases, now that's the hard part.  No amount
>> of theoretical discussion advances that any until you're at least staring at
>> a very specific locking problem you've already characterized extensively via
>> profiling.  And even then, profiling trumps theory every time.  This is why
>> I stay out of these discussions and work on boring benchmark tools instead.
>
> actually, as I see it we are a step before that.
>
> it seems that people are arguing that there is no need to look for and fix
> this sort of thing, on the basis that anyone who trips over these problems
> is doing something wrong to start with and needs to change the behavior of
> their app.

I have a slightly different take on that.  I don't think there's
actually resistance to improving this situation if someone (or some
group of people) comes up with a good proposal for doing it and writes
a patch and tests it and shows that it helps that case without hurting
other cases that people care about.  And there is clearly great
willingness to tell people what to do until that happens: use
connection pooling.  But if you come back and say, well, I shouldn't
have to use connection pooling because it should work without
connection pooling, well, OK, but...

...Robert

От:
Craig James
Дата:

Greg Smith wrote:
> No amount of theoretical discussion advances that any until
> you're at least staring at a very specific locking problem you've
> already characterized extensively via profiling.  And even then,
> profiling trumps theory every time.

In theory, there is no difference between theory and practice.  In practice, there is a great deal of difference.

Craig

От:
Greg Smith
Дата:

On Fri, 5 Jun 2009, Mark Mielke wrote:

> I disagree that profiling trumps theory every time.

That's an interesting theory.  Unfortunately, profiling shows it doesn't
work that way.

Let's see if I can summarize the state of things a bit better here:

1) PostgreSQL stops working as efficiently with >1000 active connections

2) Profiling suggests the first barrier that needs to be resolved to fix
that is how the snapshots needed to support MVCC are derived

3) There are multiple patches around that aim to improve that specific
situation, but only being tested aggressively by one contributor so far
(that I'm aware of)

4) Those patches might cause a regression for other workloads, and the
section of code involved was very hard to get working well initially.
Before any change here will be accepted there needs to be a lot of data
proving it both does what expected and doesn't introduce a regression.

5) Few people are motivated to get their hands dirty doing the boring
benchmarking work to resolve this specific problem because "use a
connection pool" is a quite good workaround

6) Many other database vendors admit this problem so is hard to solve that
they also just suggest using a connection pool

If anyone wants to work on improving things here, (4) is the sticking
point that could use more hands.  Not much theory involved, but there is a
whole lot of profiling.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
Greg Smith
Дата:

On Thu, 4 Jun 2009, Mark Mielke wrote:

> At it's very simplest, this is the difference between "wake one thread"
> (which is then responsible for waking the next thread) vs "wake all
> threads"....Any system which actually wakes all threads will probably
> exhibit scaling limitations.

The prototype patch we got from Jignesh improved his specific workload by
waking more waiting processes than were being notified in the current
code.  The bottleneck that's been best examined so far at high client
counts is not because of too much waking, it's caused by not enough.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
Robert Haas
Дата:

On Fri, Jun 5, 2009 at 1:02 PM, Greg Smith<> wrote:
> On Fri, 5 Jun 2009, Mark Mielke wrote:
>> I disagree that profiling trumps theory every time.
> That's an interesting theory.  Unfortunately, profiling shows it doesn't
> work that way.

I had a laugh when I read this, but I can see someone being offended
by it.  Hopefully no one took it that way.

> Let's see if I can summarize the state of things a bit better here:
>
> 1) PostgreSQL stops working as efficiently with >1000 active connections
>
> 2) Profiling suggests the first barrier that needs to be resolved to fix
> that is how the snapshots needed to support MVCC are derived
>
> 3) There are multiple patches around that aim to improve that specific
> situation, but only being tested aggressively by one contributor so far
> (that I'm aware of)

I am actually aware of only two forays into this area that have been
reduced to code.  I am pretty much convinced that Jignesh's
wake-all-waiters patch is fundamentally - dare I say theoretically -
unsound, however much it may improve performance for his particular
workload.  The other is Simon's patch which AIUI is a fast-path for
the case where nothing has changed.  Are you aware of any others?

Thanks for the summary.

...Robert

От:
Jeff Janes
Дата:

On Thu, 4 Jun 2009 06:57:57 -0400, Robert Haas <> wrote
in http://archives.postgresql.org/pgsql-performance/2009-06/msg00065.php :

> I think I see the distinction you're drawing here.  IIUC, you're
> arguing that other database products use connection pooling to handle
> rapid connect/disconnect cycles and to throttle the number of
> simultaneous queries, but not to cope with the possibility of large
> numbers of idle sessions.  My limited understanding of why PostgreSQL
> has a problem in this area is that it has to do with the size of the
> process array which must be scanned to derive an MVCC snapshot.  I'd
> be curious to know if anyone thinks that's correct, or not.
>
> Assuming for the moment that it's correct, databases that don't use
> MVCC won't have this problem, but they give up a significant amount of
> scalability in other areas due to increased blocking (in particular,
> writers will block readers).  So how do other databases that *do* use
> MVCC mitigate this problem?

I apologize if it is bad form to respond to a message that is two months old,
but I did not see this question answered elsewhere and thought it
would be helpful
to have it answered.  This my rough understanding.  Oracle never
"takes" a snapshot,
it computes one the fly, if and when it is needed.  It maintains a
structure of recently
committed transactions, with the XID for when they committed.  If a
process runs into
a tuple that is neither from the future nor from the deep past, it
consults this structure
to see if that transaction has committed, and if so whether it did so before or
after the current query was started.  The structure is partionable so
it does not have
one global lock to serialize on, and the lock is short as it only gets
the info it needs, not the
entire set of global open transactions.

> The only one that we've discussed here is
> Oracle, which seems to get around the problem by having a built-in
> connection pooler.

There are several reasons to have something like Oracle's shared
server (or whatever they
call it now), and I don't think global serialization on snapshots is
high among them, at
least not for Oracle.  With shared server, you can (theoretically)
control memory usage so that 10,000 independent processes don't all
decide to do a large in-memory sort or hash join at the same time.

It is also a bit more than a standard connection pooler, because
multiple connections can
be in the middle of non-read-only transactions on the same backend at
the same time.  I
don't think client-based pools allow that.

Jeff

От:
Tom Lane
Дата:

Jeff Janes <> writes:
> I apologize if it is bad form to respond to a message that is two
> months old, but I did not see this question answered elsewhere and
> thought it would be helpful to have it answered.  This my rough
> understanding.  Oracle never "takes" a snapshot, it computes one the
> fly, if and when it is needed.  It maintains a structure of recently
> committed transactions, with the XID for when they committed.  If a
> process runs into a tuple that is neither from the future nor from the
> deep past, it consults this structure to see if that transaction has
> committed, and if so whether it did so before or after the current
> query was started.  The structure is partionable so it does not have
> one global lock to serialize on, and the lock is short as it only gets
> the info it needs, not the entire set of global open transactions.

Are you sure it's partitionable?  I've been told that Oracle's
transaction log is a serious scalability bottleneck.  (But I think
I first heard that in 2001, so maybe they've improved it in recent
releases.)  We know that Postgres' WAL log is a bottleneck --- check
for recent discussions involving XLogInsert.  But the WAL log is
only touched by read-write transactions, whereas in Oracle even
read-only transactions often have to go to the transaction log.

            regards, tom lane

От:
"simon@2ndquadrant.com"
Дата:


On 14 August 2009 at 03:18 Jeff Janes <> wrote:

> This my rough understanding.  Oracle never
> "takes" a snapshot, it computes one the fly, if and when it is needed.  It maintains a
> structure of recently committed transactions, with the XID for when they committed.  If a
> process runs into a tuple that is neither from the future nor from the deep past, it
> consults this structure to see if that transaction has committed, and if so whether it did so before or
> after the current query was started.  The structure is partionable so
> it does not have one global lock to serialize on, and the lock is short as it only gets
> the info it needs, not the entire set of global open transactions.

If this is the way Oracle does it then the data structure you describe would need to be updated on both transaction start and transaction commit, as well as being locked while it was read. Transaction commits would need to be serialized so that the commit order was maintained. 

The Oracle structure would be read much less often, yet updated twice as often at snapshot point and at commit. It could be partitionable, but that would increase the conflict associated with reads of the data structure.

Oracle's structure works well for an "ideal workload" such as TPC-C where the data is physically separated and so the reads on this structure are almost nil. It would work very badly on data that continuously conflicted, which may account for the fact that no Oracle benchmark has been published on TPC-E. This bears out the experience of many Oracle DBAs, including myself. I certainly wouldn't assume Oracle have solved every problem.

The Postgres procarray structure is read often, yet only exclusively locked during commit. As Tom said, we optimize away the lock at xid assignment and also optimize away many xid assignments altogether. We don't have any evidence that the size of the procarray reduces the speed of reads, but we do know that the increased queue length you get from having many concurrent sessions increases time to record commit.

We might be able to do something similar to Oracle with Postgres, but it would require significant changes and much complex thought. The reason for doing so would be to reduce the number of reads on the "MVCC structure", making mild partitioning more palatable. The good thing about the current Postgres structure is that it doesn't increase contention when accessing concurrently updated data.

On balance it would appear that Oracle gains a benchmark win by giving up some real world usefulness. That's never been something anybody here has been willing to trade. 

Further thought in this area could prove useful, but it seems a lower priority for development simply because of the code complexity required to make this sort of change.

Best Regards, Simon Riggs
От:
Jeff Janes
Дата:

On Fri, Aug 14, 2009 at 4:21 PM, Tom Lane<> wrote:
> Jeff Janes <> writes:
>> I apologize if it is bad form to respond to a message that is two
>> months old, but I did not see this question answered elsewhere and
>> thought it would be helpful to have it answered.  This my rough
>> understanding.  Oracle never "takes" a snapshot, it computes one the
>> fly, if and when it is needed.  It maintains a structure of recently
>> committed transactions, with the XID for when they committed.  If a
>> process runs into a tuple that is neither from the future nor from the
>> deep past, it consults this structure to see if that transaction has
>> committed, and if so whether it did so before or after the current
>> query was started.  The structure is partionable so it does not have
>> one global lock to serialize on, and the lock is short as it only gets
>> the info it needs, not the entire set of global open transactions.
>
> Are you sure it's partitionable?

I don't have inside knowledge, but I'm pretty sure that that structure is
partionable.  Each data block has in its header a list of in-doubt
transactions touching that block, and a link to where in the rollback/UNDO
to find info on each one.  The UNDO header knows that transaction's status.

Of course there is still the global serialization on obtaining
the SCN, but the work involved in obtaining that (other than
fighting over the lock) is constant, it doesn't increase with the number of
backends. Real Applications Clusters must have solved that somehow,
I don't recall how.  But I think it does make compromises, like in read
committed mode a change made by another transaction might be invisible
to your simple select statements for up to 3 seconds or so.  I've never
had the opportunity to play with a RAC.

For all I know, the work of scanning ProcArray is trivial compared to the
work of obtaining the lock, even if the array is large.  If I had the talent
to write and run stand alone programs that could attach themselves to
the shared memory structure and then run my arbitrary code, I would
test that out.

> I've been told that Oracle's
> transaction log is a serious scalability bottleneck.  (But I think
> I first heard that in 2001, so maybe they've improved it in recent
> releases.)

Well, something always has to be the bottleneck.  Do you know at what
degree of scaling that became a major issue?  I don't think that there
is a point of global serialization, other than taking SCNs, but if
there is enough pair-wise fighting, it could still add up to a lot
of contention.

> We know that Postgres' WAL log is a bottleneck --- check
> for recent discussions involving XLogInsert.

Would these two be good places for me to start looking into that:

http://archives.postgresql.org/pgsql-hackers/2009-06/msg01205.php
http://archives.postgresql.org/pgsql-hackers/2009-06/msg01019.php

Or is bulk-copy (but with WAL logging) to specific to apply findings to the
general case?

> But the WAL log is
> only touched by read-write transactions, whereas in Oracle even
> read-only transactions often have to go to the transaction log.

That's true, but any given read only transaction shouldn't have to make heavy
use of the transaction log just to decide if a transaction has committed
or not.  It should be able to look that up once and cache it for the rest
of that subtran.  Of course if it actually has to construct a consistent
read from the UNDO on many different buffers due to the same interfering
transaction, that is more work and more contention.

Cheers,

Jeff