Обсуждение: FD_SETSIZE with large #s of files/ports in use

Поиск
Список
Период
Сортировка

FD_SETSIZE with large #s of files/ports in use

От
Barry Nicholson
Дата:

An interesting issue came up the other day.  We are working with an application that opens a considerable number of files and tcp/udp ports (>3000).   Unfortunately, that means that the odbc driver fails sometimes due to a corrupted stack.  We eventually figured out what was causing the corrupted stack. 

The SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count) function inside socket.c calls select.  Unfortunately, the socket file descriptor number can be quite large at this time.  That means that the fd_set fds variable can misused.   The fd_set variable type only allows 1024 file descriptors to be used by the calling program on many Linux versions.   This can be changed by setting FD_SETSIZE or __FD_SETSIZE to a larger number.   We have ran tests where we were able to change the __FD_SETSIZE value in /usr/src/...linuxversion../linux/include/linux/posix_types.h.   The fix worked well.

Unfortunately, this isn't a good solution because a software update to another linux version will invalidate our fix.   We've tried various mechanisms to set FD_SETSIZE or __FD_SETSIZE in socket.c but with no luck.   Has anyone else had this problem and came up with a good fix?   Or is there a better solution?

Barry Nicholson
Niceng.com

Re: FD_SETSIZE with large #s of files/ports in use

От
Hiroshi Inoue
Дата:
Hi,

Could you please try the attached patch?

regards,
Hiroshi Inoue

Barry Nicholson wrote:
> An interesting issue came up the other day.  We are working with an
> application that opens a considerable number of files and tcp/udp ports
> (>3000).   Unfortunately, that means that the odbc driver fails
> sometimes due to a corrupted stack.  We eventually figured out what was
> causing the corrupted stack.
>
> The SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count)
> function inside socket.c calls select.  Unfortunately, the socket file
> descriptor number can be quite large at this time.  That means that the
> fd_set fds variable can misused.   The fd_set variable type only allows
> 1024 file descriptors to be used by the calling program on many Linux
> versions.   This can be changed by setting FD_SETSIZE or __FD_SETSIZE to
> a larger number.   We have ran tests where we were able to change the
> __FD_SETSIZE value in
> /usr/src/...linuxversion../linux/include/linux/posix_types.h.   The fix
> worked well.
>
> Unfortunately, this isn't a good solution because a software update to
> another linux version will invalidate our fix.   We've tried various
> mechanisms to set FD_SETSIZE or __FD_SETSIZE in socket.c but with no
> luck.   Has anyone else had this problem and came up with a good fix?
> Or is there a better solution?
>
> Barry Nicholson
> Niceng.com
*** socket.c.orig    2010-02-04 00:40:55.643000000 +0900
--- socket.c    2010-05-19 08:53:59.429000000 +0900
***************
*** 385,391 ****
              FD_ZERO(&except_fds);
              FD_SET(self->socket, &fds);
              FD_SET(self->socket, &except_fds);
!             ret = select((int) self->socket + 1, NULL, &fds, &except_fds, timeout > 0 ? &tm : NULL);
              gerrno = SOCK_ERRNO;
              if (0 < ret)
                  break;
--- 385,391 ----
              FD_ZERO(&except_fds);
              FD_SET(self->socket, &fds);
              FD_SET(self->socket, &except_fds);
!             ret = select(1, NULL, &fds, &except_fds, timeout > 0 ? &tm : NULL);
              gerrno = SOCK_ERRNO;
              if (0 < ret)
                  break;
***************
*** 497,503 ****
              tm.tv_sec = retry_count;
              tm.tv_usec = 0;
          }
!         ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout ?
NULL: &tm); 
          gerrno = SOCK_ERRNO;
      } while (ret < 0 && EINTR == gerrno);
      if (retry_count < 0)
--- 497,503 ----
              tm.tv_sec = retry_count;
              tm.tv_usec = 0;
          }
!         ret = select(1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout ? NULL : &tm);
          gerrno = SOCK_ERRNO;
      } while (ret < 0 && EINTR == gerrno);
      if (retry_count < 0)

Re: FD_SETSIZE with large #s of files/ports in use

От
Hiroshi Inoue
Дата:
Hiroshi Inoue wrote:
> Hi,
>
> Could you please try the attached patch?

Oops it doesn't seem to work.
Another way is to use poll() instead of select().

regards,
Hiroshi Inoue

> regards,
> Hiroshi Inoue
>
> Barry Nicholson wrote:
>> An interesting issue came up the other day.  We are working with an
>> application that opens a considerable number of files and tcp/udp
>> ports (>3000).   Unfortunately, that means that the odbc driver fails
>> sometimes due to a corrupted stack.  We eventually figured out what
>> was causing the corrupted stack.
>> The SOCK_wait_for_ready(SocketClass *sock, BOOL output, int
>> retry_count) function inside socket.c calls select.  Unfortunately,
>> the socket file descriptor number can be quite large at this time.
>> That means that the fd_set fds variable can misused.   The fd_set
>> variable type only allows 1024 file descriptors to be used by the
>> calling program on many Linux versions.   This can be changed by
>> setting FD_SETSIZE or __FD_SETSIZE to a larger number.   We have ran
>> tests where we were able to change the __FD_SETSIZE value in
>> /usr/src/...linuxversion../linux/include/linux/posix_types.h.   The
>> fix worked well.
>>
>> Unfortunately, this isn't a good solution because a software update to
>> another linux version will invalidate our fix.   We've tried various
>> mechanisms to set FD_SETSIZE or __FD_SETSIZE in socket.c but with no
>> luck.   Has anyone else had this problem and came up with a good
>> fix?   Or is there a better solution?
>>
>> Barry Nicholson
>> Niceng.com



Re: FD_SETSIZE with large #s of files/ports in use

От
Tom Lane
Дата:
Hiroshi Inoue <inoue@tpf.co.jp> writes:
> Another way is to use poll() instead of select().

You really need to go in that direction.  Changing FD_SETSIZE is
completely unworkable --- it will break various libc ABI details.

            regards, tom lane

Re: FD_SETSIZE with large #s of files/ports in use

От
Hiroshi Inoue
Дата:
Tom Lane wrote:
> Hiroshi Inoue <inoue@tpf.co.jp> writes:
>> Another way is to use poll() instead of select().
>
> You really need to go in that direction.  Changing FD_SETSIZE is
> completely unworkable --- it will break various libc ABI details.

Thanks.
I already made a patch to use poll() if the function is available.
I would post it later.

regards,
Hiroshi Inoue


Re: FD_SETSIZE with large #s of files/ports in use

От
Hiroshi Inoue
Дата:
Hiroshi Inoue wrote:
> Hiroshi Inoue wrote:
>> Hi,
>>
>> Could you please try the attached patch?
>
> Oops it doesn't seem to work.
> Another way is to use poll() instead of select().

OK I made a patch to use poll().
Please #define HAVE_POLL e.g. in config.h and try the attached patch.

regards,
Hiroshi Inoue
diff -c ../psqlodbc/socket.c ./socket.c
*** ../psqlodbc/socket.c    2010-01-11 09:56:18.605000000 +0900
--- ./socket.c    2010-05-19 17:03:10.874000000 +0900
***************
*** 350,357 ****
--- 350,362 ----
      if (connect(self->socket, (struct sockaddr *) &(self->sadr_area), self->sadr_len) < 0)
      {
          int    ret, optval;
+         int    wait_sec = 0;
+ #ifdef    HAVE_POLL
+         struct pollfd fds;
+ #else
          fd_set    fds, except_fds;
          struct    timeval    tm;
+ #endif /* HAVE_POLL */
          socklen_t    optlen = sizeof(optval);
          time_t    t_now, t_finish = 0;
          BOOL    tm_exp = FALSE;
***************
*** 377,391 ****
          {
              t_now = time(NULL);
              t_finish = t_now + timeout;
!             tm.tv_sec = timeout;
!             tm.tv_usec = 0;
          }
          do {
              FD_ZERO(&fds);
              FD_ZERO(&except_fds);
              FD_SET(self->socket, &fds);
              FD_SET(self->socket, &except_fds);
              ret = select((int) self->socket + 1, NULL, &fds, &except_fds, timeout > 0 ? &tm : NULL);
              gerrno = SOCK_ERRNO;
              if (0 < ret)
                  break;
--- 382,404 ----
          {
              t_now = time(NULL);
              t_finish = t_now + timeout;
!             wait_sec = timeout;
          }
          do {
+ #ifdef    HAVE_POLL
+             fds.fd = self->socket;
+             fds.events = POLLOUT;
+             fds.revents = 0;
+             ret = poll(&fds, 1, timeout > 0 ? wait_sec * 1000 : -1);
+ #else
+             tm.tv_sec = wait_sec;
+             tm.tv_usec = 0;
              FD_ZERO(&fds);
              FD_ZERO(&except_fds);
              FD_SET(self->socket, &fds);
              FD_SET(self->socket, &except_fds);
              ret = select((int) self->socket + 1, NULL, &fds, &except_fds, timeout > 0 ? &tm : NULL);
+ #endif /* HAVE_POLL */
              gerrno = SOCK_ERRNO;
              if (0 < ret)
                  break;
***************
*** 398,407 ****
                  if (t_now = time(NULL), t_now >= t_finish)
                      tm_exp = TRUE;
                  else
!                 {
!                     tm.tv_sec = (long) (t_finish - t_now);
!                     tm.tv_usec = 0;
!                 }
              }
          } while (!tm_exp);
          if (tm_exp)
--- 411,417 ----
                  if (t_now = time(NULL), t_now >= t_finish)
                      tm_exp = TRUE;
                  else
!                     wait_sec = t_finish - t_now;
              }
          } while (!tm_exp);
          if (tm_exp)
***************
*** 475,482 ****
--- 485,496 ----
  static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count)
  {
      int    ret, gerrno;
+ #ifdef    HAVE_POLL
+     struct pollfd    fds;
+ #else
      fd_set    fds, except_fds;
      struct    timeval    tm;
+ #endif /* HAVE_POLL */
      BOOL    no_timeout = TRUE;

      if (0 == retry_count)
***************
*** 488,493 ****
--- 502,513 ----
          no_timeout = TRUE;
  #endif /* USE_SSL */
      do {
+ #ifdef    HAVE_POLL
+         fds.fd = sock->socket;
+         fds.events = output ? POLLOUT : POLLIN;
+         fds.revents = 0;
+         ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000);
+ #else
          FD_ZERO(&fds);
          FD_ZERO(&except_fds);
          FD_SET(sock->socket, &fds);
***************
*** 498,503 ****
--- 518,524 ----
              tm.tv_usec = 0;
          }
          ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout ?
NULL: &tm); 
+ #endif /* HAVE_POLL */
          gerrno = SOCK_ERRNO;
      } while (ret < 0 && EINTR == gerrno);
      if (retry_count < 0)
diff -c ../psqlodbc/socket.h ./socket.h
*** ../psqlodbc/socket.h    2010-01-11 09:56:31.371000000 +0900
--- ./socket.h    2010-05-19 13:15:50.157000000 +0900
***************
*** 21,26 ****
--- 21,29 ----

  #ifndef WIN32
  #define    WSAAPI
+ #ifdef    HAVE_POLL
+ #include <poll.h>
+ #endif /* HAVE_POLL_H */
  #include <sys/types.h>
  #include <sys/socket.h>
  #include <sys/un.h>

Re: FD_SETSIZE with large #s of files/ports in use

От
"B. Nicholson"
Дата:
Yes, we'll try the patch in the morning.

Tom, what libc details will be broken by setting FD_SETSIZE to a larger number?   I'm curious for my own knowledge base.   I can see where it might cause 'data' sizes to grow which might break thinks.  Anything else?

Barry Nicholson

On 05/19/2010 06:44 PM, Hiroshi Inoue wrote:
Hiroshi Inoue wrote:
Hiroshi Inoue wrote:
Hi,

Could you please try the attached patch?

Oops it doesn't seem to work.
Another way is to use poll() instead of select().

OK I made a patch to use poll().
Please #define HAVE_POLL e.g. in config.h and try the attached patch.

regards,
Hiroshi Inoue

Re: FD_SETSIZE with large #s of files/ports in use

От
Tom Lane
Дата:
"B. Nicholson" <b.nicholson@niceng.com> writes:
> Tom, what libc details will be broken by setting FD_SETSIZE to a larger
> number?   I'm curious for my own knowledge base.   I can see where it
> might cause 'data' sizes to grow which might break thinks.  Anything else?

I'm not too sure, honestly.  I can tell you that this exact point came up
recently on a Red Hat internal mailing list, and no less an authority
than Ulrich Drepper said "you can't do that, it'll break things".  He
didn't say exactly what though.  It's possible that on non-glibc-based
platforms, you could get away with it.

            regards, tom lane

Re: FD_SETSIZE with large #s of files/ports in use

От
Giles Lean
Дата:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> "B. Nicholson" <b.nicholson@niceng.com> writes:
> > Tom, what libc details will be broken by setting FD_SETSIZE to a larger
> > number?   I'm curious for my own knowledge base.   I can see where it
> > might cause 'data' sizes to grow which might break thinks.  Anything else?
>
> I'm not too sure, honestly.  I can tell you that this exact point came up
> recently on a Red Hat internal mailing list, and no less an authority
> than Ulrich Drepper said "you can't do that, it'll break things".  He
> didn't say exactly what though.  It's possible that on non-glibc-based
> platforms, you could get away with it.

I'd guess that as FD_SETSIZE is a macro used at compile time
(including compile time of libc) and that without jumping
through hoops in the implementation changing it later will
cause inconsistencies between the size of structures or arrays
passed between the application and libc.

At the risk of topic drift and providing more information than
people want to know (but think of the archives! :-), here is
some additional information.

Summary:

a) you can't rely on changing FD_SETSIZE for select(2)
b) poll(2) is preferable to select(2) for performance
c) interfaces that should perform better than either select(2)
   and poll(2) are:

   i.   /dev/poll (Solaris, HP-UX)
   ii.  epoll (Linux)
   iii. pollset (AIX)
   iv.  kqueue (*BSDs)

   There is some hope of maintaining portability across this
   newer, non-standardised set of interfaces with libevent.

PostgreSQL seems to use poll() if it's available in some
places, and select() in others.  (And I don't know about the
Windows code.)

For small numbers of file descriptors especially on non-hot
code paths, it's not going to matter.  In general it would be
IMHO nice to use poll() consistently when it's available
and not emulated via select().

Whether there is a performance gain to be had by using the
non-portable solutions I don't know: it would be interesting
to see some measurements, but I wouldn't necessarily expect
so: the newer interfaces (certainly /dev/poll) were driven by
the needs of high performance web servers with high numbers of
connections which may be a too-different use case to
PostgreSQL to see a notable benefit.

Based on some micro benchmarks I did some years back (on now
non-current OS releases which I shall not name) I would not
assume that the relative performance of these interfaces (that
is, select v. poll v. whatever alternative local enhancement
has been created) would be consistent: you may find systems
with relatively well performing select(2) implementations.

Re point (a):

For POSIX, FD_SETSIZE is not documented as being changeable by
the application, implying that it shouldn't be altered by
portable applications:

  http://www.opengroup.org/onlinepubs/009695399/basedefs/sys/select.h.html

That's from POSIX a.k.a. IEEE Std 1003.1, 2004 Edition, a.k.a. the
"Single Unix Specification Version 3".

I'm no Linux guru, so I'll take Tom's and Ulrich Drepper's
word for the behaviour there.

Some operating systems _do_ allow applications to alter
FD_SETSIZE, including at least HP-UX:

  http://docs.hp.com/en/B2355-60130/select.2.html

Re point(b):

It is "well known" that poll(2) is more efficient than
select(2) as the sets of file descriptors don't have to be
reset before each call as they are in select(2).

(Sorry, no good reference to hand, and I'm sure someone's had
an exception somewhere, at least when poll() was emulated via
select()!).

Re point(c):

i. /dev/poll was introduced in Solaris 7, and added some time
later to HP-UX:

  http://developers.sun.com/solaris/articles/polling_efficient.html
  http://docs.hp.com/en/B2355-60130/poll.7.html

ii. Linux preferred to introduce epoll(7):

  http://www.kernel.org/doc/man-pages/online/pages/man4/epoll.4.html

iii. IBM preferred pollset for AIX:

  http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.basetechref/doc/basetrf1/pollset.htm

iv. The *BSDs have developed kqueue; originally in FreeBSD but
adopted by NetBSD and OpenBSD:

  http://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2
  http://netbsd.gw.com/cgi-bin/man-cgi?kqueue++NetBSD-5.0
  http://www.openbsd.org/cgi-bin/man.cgi?query=kqueue&sektion=2

v. libevent:

  http://www.monkey.org/~provos/libevent/

Regards,

Giles

P.S. No, this isn't the nightmare.  I just woke _up_ from the
nightmare. :-)  Now, back to sleep ...

Re: FD_SETSIZE with large #s of files/ports in use

От
"B. Nicholson"
Дата:
Hiroshi:

Works great!  We've ran thousands of tests through the system with your poll change.  No problems.
We're going to run some more tests (just to make sure) in the morning.   Our goal with the tests in the morning are to really load test the system (5000+ connections all doing selects randomly).   Nice job.

Barry Nicholson  

On 05/19/2010 06:44 PM, Hiroshi Inoue wrote:
Hiroshi Inoue wrote:
Hiroshi Inoue wrote:
Hi,

Could you please try the attached patch?

Oops it doesn't seem to work.
Another way is to use poll() instead of select().

OK I made a patch to use poll().
Please #define HAVE_POLL e.g. in config.h and try the attached patch.

regards,
Hiroshi Inoue