Обсуждение: More than 1024 connections from the same c-backend
Hi! We have an application running on Linux (SuSE 7.2, kernel 2.4.16) that opens lots of connections to a Postgres database and occasionaly dies with segfault. Trying to reproduce the crash, I came up with the following test code: --------------------- pgsql-test.c --------------------- #include <stdio.h> #include <libpq-fe.h> int main(int argc, char **argv) { PGconn *conn; int i; for (i = 0; i < 2048; i++) { conn = PQsetdbLogin("localhost", "5432", NULL, NULL, "template1", "postgres", NULL); if (PQstatus(conn) == CONNECTION_BAD) printf("%5d: Connection to database FAILED\n", i+1); else printf("%5d: Connection to database OK\n", i+1); if (i > 1010) { sleep(10); } // PQfinish(conn); } // sleep(300); return 0; } -------------------------------------------------------- The test program segfaults after it opens 1020 connections. Then it has exactly 1024 open file descriptors, including stdin, stdout, stderr and a file descriptor on /proc/sys/kernel/shmmax. The system limits on open file descriptors is set to 65535 (both ulimit and /proc/sys/fs/file-max). It's not related to the max-backends limit in postmaster either. The test program crashes even if postmaster is not running at all. The program seems to crash when it returns from pqWaitTimed(). As pqWaitTimed uses select() to poll the file descriptors, I suppose the crash is related to the limit of 1024 file descriptors that fd_set can hold. The weird thing is that it's not the select() that segfaults. The segfault occurs on return from pqWaitTimed(). It is 100% reproduceable on one machine, but it doesn't crash on another one. GDB can't show a backtrace from the core file: (gdb) bt #0 0x08049ab3 in connectDBComplete () Cannot access memory at address 0x1 When stepping through the program in gdb, I can see the "conn" pointer getting lost after on the 1021st connect when pqWaitTimed() returns. So it looks like the return stack gets corrupted or something like that. Can anyone confirm this? Am I missing anything here? Any idea how to get more than 1024 connections with one backend? Andi
Sorry, libpq was compiled without debug symbols in the previous email. Here's whet gdb shows after compiling postgresql-7.3.4 --enable-debug: Core was generated by `./pgsql-test-static'. Program terminated with signal 11, Segmentation fault. Reading symbols from /lib/libcrypt.so.1...done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /lib/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 Reading symbols from /lib/libnss_dns.so.2...done. Loaded symbols for /lib/libnss_dns.so.2 Reading symbols from /lib/libresolv.so.2...done. Loaded symbols for /lib/libresolv.so.2 #0 0x0804a3f3 in connectDBComplete (conn=???) at fe-connect.c:1124 1124 flag = PQconnectPoll(conn); (gdb) bt #0 0x0804a3f3 in connectDBComplete (conn=???) at fe-connect.c:1124 Cannot access memory at address 0x0 Andi
Andreas Muck <bb+list.pgsql-general@blitztrade.de> writes: > We have an application running on Linux (SuSE 7.2, kernel 2.4.16) that > opens lots of connections to a Postgres database and occasionaly dies > with segfault. > The program seems to crash when it returns from pqWaitTimed(). As > pqWaitTimed uses select() to poll the file descriptors, I suppose the > crash is related to the limit of 1024 file descriptors that fd_set can hold. If that's the size of fd_set on your machine, then yes, this doesn't surprise me at all. The code that calls select() is no doubt clobbering some bit beyond the end of the fd_set array. 7.4 is designed to use poll() in preference to select() (if available), because of previous complaints about exactly this problem. Not sure if you want to update to 7.4 beta, but you could consider lifting the pqWait code out of 7.4. regards, tom lane
Tom Lane wrote: > 7.4 is designed to use poll() in preference to select() (if available), > because of previous complaints about exactly this problem. Not sure if > you want to update to 7.4 beta, but you could consider lifting the > pqWait code out of 7.4. I see. I'll check it out if upgrading to 7.4 is an option. Thank you for the confirmation of the problem! Regards, Andreas