Обсуждение: FD_SETSIZE with large #s of files/ports in use
An interesting issue came up the other day. We are working with an application that opens a considerable number of files and tcp/udp ports (>3000). Unfortunately, that means that the odbc driver fails sometimes due to a corrupted stack. We eventually figured out what was causing the corrupted stack.
The SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count) function inside socket.c calls select. Unfortunately, the socket file descriptor number can be quite large at this time. That means that the fd_set fds variable can misused. The fd_set variable type only allows 1024 file descriptors to be used by the calling program on many Linux versions. This can be changed by setting FD_SETSIZE or __FD_SETSIZE to a larger number. We have ran tests where we were able to change the __FD_SETSIZE value in /usr/src/...linuxversion../linux/include/linux/posix_types.h. The fix worked well.
Unfortunately, this isn't a good solution because a software update to another linux version will invalidate our fix. We've tried various mechanisms to set FD_SETSIZE or __FD_SETSIZE in socket.c but with no luck. Has anyone else had this problem and came up with a good fix? Or is there a better solution?
Barry Nicholson
Niceng.com
Hi, Could you please try the attached patch? regards, Hiroshi Inoue Barry Nicholson wrote: > An interesting issue came up the other day. We are working with an > application that opens a considerable number of files and tcp/udp ports > (>3000). Unfortunately, that means that the odbc driver fails > sometimes due to a corrupted stack. We eventually figured out what was > causing the corrupted stack. > > The SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count) > function inside socket.c calls select. Unfortunately, the socket file > descriptor number can be quite large at this time. That means that the > fd_set fds variable can misused. The fd_set variable type only allows > 1024 file descriptors to be used by the calling program on many Linux > versions. This can be changed by setting FD_SETSIZE or __FD_SETSIZE to > a larger number. We have ran tests where we were able to change the > __FD_SETSIZE value in > /usr/src/...linuxversion../linux/include/linux/posix_types.h. The fix > worked well. > > Unfortunately, this isn't a good solution because a software update to > another linux version will invalidate our fix. We've tried various > mechanisms to set FD_SETSIZE or __FD_SETSIZE in socket.c but with no > luck. Has anyone else had this problem and came up with a good fix? > Or is there a better solution? > > Barry Nicholson > Niceng.com *** socket.c.orig 2010-02-04 00:40:55.643000000 +0900 --- socket.c 2010-05-19 08:53:59.429000000 +0900 *************** *** 385,391 **** FD_ZERO(&except_fds); FD_SET(self->socket, &fds); FD_SET(self->socket, &except_fds); ! ret = select((int) self->socket + 1, NULL, &fds, &except_fds, timeout > 0 ? &tm : NULL); gerrno = SOCK_ERRNO; if (0 < ret) break; --- 385,391 ---- FD_ZERO(&except_fds); FD_SET(self->socket, &fds); FD_SET(self->socket, &except_fds); ! ret = select(1, NULL, &fds, &except_fds, timeout > 0 ? &tm : NULL); gerrno = SOCK_ERRNO; if (0 < ret) break; *************** *** 497,503 **** tm.tv_sec = retry_count; tm.tv_usec = 0; } ! ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout ? NULL: &tm); gerrno = SOCK_ERRNO; } while (ret < 0 && EINTR == gerrno); if (retry_count < 0) --- 497,503 ---- tm.tv_sec = retry_count; tm.tv_usec = 0; } ! ret = select(1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout ? NULL : &tm); gerrno = SOCK_ERRNO; } while (ret < 0 && EINTR == gerrno); if (retry_count < 0)
Hiroshi Inoue wrote: > Hi, > > Could you please try the attached patch? Oops it doesn't seem to work. Another way is to use poll() instead of select(). regards, Hiroshi Inoue > regards, > Hiroshi Inoue > > Barry Nicholson wrote: >> An interesting issue came up the other day. We are working with an >> application that opens a considerable number of files and tcp/udp >> ports (>3000). Unfortunately, that means that the odbc driver fails >> sometimes due to a corrupted stack. We eventually figured out what >> was causing the corrupted stack. >> The SOCK_wait_for_ready(SocketClass *sock, BOOL output, int >> retry_count) function inside socket.c calls select. Unfortunately, >> the socket file descriptor number can be quite large at this time. >> That means that the fd_set fds variable can misused. The fd_set >> variable type only allows 1024 file descriptors to be used by the >> calling program on many Linux versions. This can be changed by >> setting FD_SETSIZE or __FD_SETSIZE to a larger number. We have ran >> tests where we were able to change the __FD_SETSIZE value in >> /usr/src/...linuxversion../linux/include/linux/posix_types.h. The >> fix worked well. >> >> Unfortunately, this isn't a good solution because a software update to >> another linux version will invalidate our fix. We've tried various >> mechanisms to set FD_SETSIZE or __FD_SETSIZE in socket.c but with no >> luck. Has anyone else had this problem and came up with a good >> fix? Or is there a better solution? >> >> Barry Nicholson >> Niceng.com
Hiroshi Inoue <inoue@tpf.co.jp> writes: > Another way is to use poll() instead of select(). You really need to go in that direction. Changing FD_SETSIZE is completely unworkable --- it will break various libc ABI details. regards, tom lane
Tom Lane wrote: > Hiroshi Inoue <inoue@tpf.co.jp> writes: >> Another way is to use poll() instead of select(). > > You really need to go in that direction. Changing FD_SETSIZE is > completely unworkable --- it will break various libc ABI details. Thanks. I already made a patch to use poll() if the function is available. I would post it later. regards, Hiroshi Inoue
Hiroshi Inoue wrote: > Hiroshi Inoue wrote: >> Hi, >> >> Could you please try the attached patch? > > Oops it doesn't seem to work. > Another way is to use poll() instead of select(). OK I made a patch to use poll(). Please #define HAVE_POLL e.g. in config.h and try the attached patch. regards, Hiroshi Inoue diff -c ../psqlodbc/socket.c ./socket.c *** ../psqlodbc/socket.c 2010-01-11 09:56:18.605000000 +0900 --- ./socket.c 2010-05-19 17:03:10.874000000 +0900 *************** *** 350,357 **** --- 350,362 ---- if (connect(self->socket, (struct sockaddr *) &(self->sadr_area), self->sadr_len) < 0) { int ret, optval; + int wait_sec = 0; + #ifdef HAVE_POLL + struct pollfd fds; + #else fd_set fds, except_fds; struct timeval tm; + #endif /* HAVE_POLL */ socklen_t optlen = sizeof(optval); time_t t_now, t_finish = 0; BOOL tm_exp = FALSE; *************** *** 377,391 **** { t_now = time(NULL); t_finish = t_now + timeout; ! tm.tv_sec = timeout; ! tm.tv_usec = 0; } do { FD_ZERO(&fds); FD_ZERO(&except_fds); FD_SET(self->socket, &fds); FD_SET(self->socket, &except_fds); ret = select((int) self->socket + 1, NULL, &fds, &except_fds, timeout > 0 ? &tm : NULL); gerrno = SOCK_ERRNO; if (0 < ret) break; --- 382,404 ---- { t_now = time(NULL); t_finish = t_now + timeout; ! wait_sec = timeout; } do { + #ifdef HAVE_POLL + fds.fd = self->socket; + fds.events = POLLOUT; + fds.revents = 0; + ret = poll(&fds, 1, timeout > 0 ? wait_sec * 1000 : -1); + #else + tm.tv_sec = wait_sec; + tm.tv_usec = 0; FD_ZERO(&fds); FD_ZERO(&except_fds); FD_SET(self->socket, &fds); FD_SET(self->socket, &except_fds); ret = select((int) self->socket + 1, NULL, &fds, &except_fds, timeout > 0 ? &tm : NULL); + #endif /* HAVE_POLL */ gerrno = SOCK_ERRNO; if (0 < ret) break; *************** *** 398,407 **** if (t_now = time(NULL), t_now >= t_finish) tm_exp = TRUE; else ! { ! tm.tv_sec = (long) (t_finish - t_now); ! tm.tv_usec = 0; ! } } } while (!tm_exp); if (tm_exp) --- 411,417 ---- if (t_now = time(NULL), t_now >= t_finish) tm_exp = TRUE; else ! wait_sec = t_finish - t_now; } } while (!tm_exp); if (tm_exp) *************** *** 475,482 **** --- 485,496 ---- static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count) { int ret, gerrno; + #ifdef HAVE_POLL + struct pollfd fds; + #else fd_set fds, except_fds; struct timeval tm; + #endif /* HAVE_POLL */ BOOL no_timeout = TRUE; if (0 == retry_count) *************** *** 488,493 **** --- 502,513 ---- no_timeout = TRUE; #endif /* USE_SSL */ do { + #ifdef HAVE_POLL + fds.fd = sock->socket; + fds.events = output ? POLLOUT : POLLIN; + fds.revents = 0; + ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000); + #else FD_ZERO(&fds); FD_ZERO(&except_fds); FD_SET(sock->socket, &fds); *************** *** 498,503 **** --- 518,524 ---- tm.tv_usec = 0; } ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout ? NULL: &tm); + #endif /* HAVE_POLL */ gerrno = SOCK_ERRNO; } while (ret < 0 && EINTR == gerrno); if (retry_count < 0) diff -c ../psqlodbc/socket.h ./socket.h *** ../psqlodbc/socket.h 2010-01-11 09:56:31.371000000 +0900 --- ./socket.h 2010-05-19 13:15:50.157000000 +0900 *************** *** 21,26 **** --- 21,29 ---- #ifndef WIN32 #define WSAAPI + #ifdef HAVE_POLL + #include <poll.h> + #endif /* HAVE_POLL_H */ #include <sys/types.h> #include <sys/socket.h> #include <sys/un.h>
Tom, what libc details will be broken by setting FD_SETSIZE to a larger number? I'm curious for my own knowledge base. I can see where it might cause 'data' sizes to grow which might break thinks. Anything else?
Barry Nicholson
On 05/19/2010 06:44 PM, Hiroshi Inoue wrote:
Hiroshi Inoue wrote:Hiroshi Inoue wrote:Hi,
Could you please try the attached patch?
Oops it doesn't seem to work.
Another way is to use poll() instead of select().
OK I made a patch to use poll().
Please #define HAVE_POLL e.g. in config.h and try the attached patch.
regards,
Hiroshi Inoue
"B. Nicholson" <b.nicholson@niceng.com> writes: > Tom, what libc details will be broken by setting FD_SETSIZE to a larger > number? I'm curious for my own knowledge base. I can see where it > might cause 'data' sizes to grow which might break thinks. Anything else? I'm not too sure, honestly. I can tell you that this exact point came up recently on a Red Hat internal mailing list, and no less an authority than Ulrich Drepper said "you can't do that, it'll break things". He didn't say exactly what though. It's possible that on non-glibc-based platforms, you could get away with it. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "B. Nicholson" <b.nicholson@niceng.com> writes: > > Tom, what libc details will be broken by setting FD_SETSIZE to a larger > > number? I'm curious for my own knowledge base. I can see where it > > might cause 'data' sizes to grow which might break thinks. Anything else? > > I'm not too sure, honestly. I can tell you that this exact point came up > recently on a Red Hat internal mailing list, and no less an authority > than Ulrich Drepper said "you can't do that, it'll break things". He > didn't say exactly what though. It's possible that on non-glibc-based > platforms, you could get away with it. I'd guess that as FD_SETSIZE is a macro used at compile time (including compile time of libc) and that without jumping through hoops in the implementation changing it later will cause inconsistencies between the size of structures or arrays passed between the application and libc. At the risk of topic drift and providing more information than people want to know (but think of the archives! :-), here is some additional information. Summary: a) you can't rely on changing FD_SETSIZE for select(2) b) poll(2) is preferable to select(2) for performance c) interfaces that should perform better than either select(2) and poll(2) are: i. /dev/poll (Solaris, HP-UX) ii. epoll (Linux) iii. pollset (AIX) iv. kqueue (*BSDs) There is some hope of maintaining portability across this newer, non-standardised set of interfaces with libevent. PostgreSQL seems to use poll() if it's available in some places, and select() in others. (And I don't know about the Windows code.) For small numbers of file descriptors especially on non-hot code paths, it's not going to matter. In general it would be IMHO nice to use poll() consistently when it's available and not emulated via select(). Whether there is a performance gain to be had by using the non-portable solutions I don't know: it would be interesting to see some measurements, but I wouldn't necessarily expect so: the newer interfaces (certainly /dev/poll) were driven by the needs of high performance web servers with high numbers of connections which may be a too-different use case to PostgreSQL to see a notable benefit. Based on some micro benchmarks I did some years back (on now non-current OS releases which I shall not name) I would not assume that the relative performance of these interfaces (that is, select v. poll v. whatever alternative local enhancement has been created) would be consistent: you may find systems with relatively well performing select(2) implementations. Re point (a): For POSIX, FD_SETSIZE is not documented as being changeable by the application, implying that it shouldn't be altered by portable applications: http://www.opengroup.org/onlinepubs/009695399/basedefs/sys/select.h.html That's from POSIX a.k.a. IEEE Std 1003.1, 2004 Edition, a.k.a. the "Single Unix Specification Version 3". I'm no Linux guru, so I'll take Tom's and Ulrich Drepper's word for the behaviour there. Some operating systems _do_ allow applications to alter FD_SETSIZE, including at least HP-UX: http://docs.hp.com/en/B2355-60130/select.2.html Re point(b): It is "well known" that poll(2) is more efficient than select(2) as the sets of file descriptors don't have to be reset before each call as they are in select(2). (Sorry, no good reference to hand, and I'm sure someone's had an exception somewhere, at least when poll() was emulated via select()!). Re point(c): i. /dev/poll was introduced in Solaris 7, and added some time later to HP-UX: http://developers.sun.com/solaris/articles/polling_efficient.html http://docs.hp.com/en/B2355-60130/poll.7.html ii. Linux preferred to introduce epoll(7): http://www.kernel.org/doc/man-pages/online/pages/man4/epoll.4.html iii. IBM preferred pollset for AIX: http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.basetechref/doc/basetrf1/pollset.htm iv. The *BSDs have developed kqueue; originally in FreeBSD but adopted by NetBSD and OpenBSD: http://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2 http://netbsd.gw.com/cgi-bin/man-cgi?kqueue++NetBSD-5.0 http://www.openbsd.org/cgi-bin/man.cgi?query=kqueue&sektion=2 v. libevent: http://www.monkey.org/~provos/libevent/ Regards, Giles P.S. No, this isn't the nightmare. I just woke _up_ from the nightmare. :-) Now, back to sleep ...
Works great! We've ran thousands of tests through the system with your poll change. No problems.
We're going to run some more tests (just to make sure) in the morning. Our goal with the tests in the morning are to really load test the system (5000+ connections all doing selects randomly). Nice job.
Barry Nicholson
On 05/19/2010 06:44 PM, Hiroshi Inoue wrote:
Hiroshi Inoue wrote:Hiroshi Inoue wrote:Hi,
Could you please try the attached patch?
Oops it doesn't seem to work.
Another way is to use poll() instead of select().
OK I made a patch to use poll().
Please #define HAVE_POLL e.g. in config.h and try the attached patch.
regards,
Hiroshi Inoue