[stunnel-users] 4.14b1+0.9.7i client loop in s_poll_wait/scan_waiting_queue

Richard.Hall Richard.Hall at ingenta.com
Thu Oct 27 16:16:05 CEST 2005


Obviously stunnel problems are like London buses - you wait years for one,
then two come along inside a week ;-)

Having fixed my server-side problems by installing 4.14b1, I now have a
problem on the client side. Apologies for the voluminous diagnostics
below, but hopefully they will help make the problem clear.

I am running Apache + mod_jk, stunneling to a remote JBoss server. The
client is

   stunnel 4.14 on i686-pc-linux-gnu UCONTEXT+POLL+IPv4 with OpenSSL
   0.9.7i 14 Oct 2005

When things are running normally, everything is sweetness and light.
However, if I 'kill -HUP' the Apache server (eg to do a log rotation), the
mod_jk connection to stunnel gets closed, and stunnel goes into a loop. If
I enable DEBUG_UCONTEXT in network.c, and add a few lines of diagnostics:-

$ diff -c network.c.orig network.c.loop
*** network.c.orig      Thu Oct 20 08:37:47 2005
--- network.c.loop      Thu Oct 27 13:07:10 2005
***************
*** 31,37 ****
  #include "common.h"
  #include "prototypes.h"

! /* #define DEBUG_UCONTEXT */

  #ifndef USE_WIN32
  static int signal_pipe[2]={-1, -1};
--- 31,37 ----
  #include "common.h"
  #include "prototypes.h"

! #define DEBUG_UCONTEXT

  #ifndef USE_WIN32
  static int signal_pipe[2]={-1, -1};
***************
*** 156,167 ****
          for(i=0; i<ctx->fds->nfds; i++) {
              ctx->fds->ufds[i].revents=ufds[nfds].revents;
  #ifdef DEBUG_UCONTEXT
!             s_log(LOG_DEBUG, "CONTEXT %ld, FD=%d, (%s%s)->(%s%s)",
                  ctx->id, ufds[nfds].fd,
                  ufds[nfds].events & POLLIN ? "IN" : "",
                  ufds[nfds].events & POLLOUT ? "OUT" : "",
                  ufds[nfds].revents & POLLIN ? "IN" : "",
!                 ufds[nfds].revents & POLLOUT ? "OUT" : "");
  #endif
              if(ufds[nfds].revents & (POLLIN|POLLOUT))
                  ctx->ready++;
--- 156,170 ----
          for(i=0; i<ctx->fds->nfds; i++) {
              ctx->fds->ufds[i].revents=ufds[nfds].revents;
  #ifdef DEBUG_UCONTEXT
!             s_log(LOG_DEBUG, "CONTEXT %ld, FD=%d, (%s%s)->(%s%s%s%s%s)",
                  ctx->id, ufds[nfds].fd,
                  ufds[nfds].events & POLLIN ? "IN" : "",
                  ufds[nfds].events & POLLOUT ? "OUT" : "",
                  ufds[nfds].revents & POLLIN ? "IN" : "",
!                 ufds[nfds].revents & POLLOUT ? "OUT" : "",
!                 ufds[nfds].revents & POLLERR ? "ERR" : "",
!                 ufds[nfds].revents & POLLHUP ? "HUP" : "",
!                 ufds[nfds].revents & POLLNVAL ? "NVAL" : "");
  #endif
              if(ufds[nfds].revents & (POLLIN|POLLOUT))
                  ctx->ready++;

I get (with fd=9 being the mod_jk socket, fd=10 the SSL one, and
TIMEOUTidle set to 35 seconds for demo purposes) ...

2005.10.27 13:13:42 LOG7[1286:0]: Waiting 35 second(s) for 7 file
descriptor(s)
2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->()
2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->()
2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->()
2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->()
2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->()
2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 2, FD=9, (INOUT)->(OUT)
2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->(IN)
2005.10.27 13:13:42 LOG7[1286:2]: SSL_read returned WANT_READ: retrying
2005.10.27 13:13:42 LOG7[1286:0]: Waiting 35 second(s) for 7 file
descriptor(s)
2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->()
2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->()
2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->()
2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->()
2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->()
2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 2, FD=9, (IN)->(IN)
2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->()
2005.10.27 13:13:43 LOG7[1286:2]: Socket closed on read

... ie mod_jk has disconnected. Everything carries on for a while ...

2005.10.27 13:13:43 LOG7[1286:2]: SSL write shutdown
2005.10.27 13:13:43 LOG7[1286:0]: Waiting 60 second(s) for 6 file
descriptor(s)
2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->()
2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->()
2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->()
2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->()
2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->()
2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 2, FD=10, (INOUT)->(OUT)
2005.10.27 13:13:43 LOG7[1286:2]: SSL alert (write): warning: close notify
2005.10.27 13:13:43 LOG7[1286:2]: SSL_shutdown retrying
2005.10.27 13:13:43 LOG7[1286:2]: SSL doesn't need to read or write
2005.10.27 13:13:43 LOG7[1286:0]: Waiting 60 second(s) for 6 file
descriptor(s)
2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->()
2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->()
2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->()
2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->()
2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->()
2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->(IN)
2005.10.27 13:13:45 LOG7[1286:0]: Waiting 35 second(s) for 7 file
descriptor(s)
2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->()
2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->()
2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->()
2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->()
2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->()
2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->()
2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 2, FD=9, (OUT)->(OUT)
[...]
2005.10.27 13:13:46 LOG7[1286:0]: Waiting 60 second(s) for 6 file
descriptor(s)
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->(IN)
2005.10.27 13:13:46 LOG7[1286:2]: SSL_read returned WANT_READ: retrying
2005.10.27 13:13:46 LOG7[1286:0]: Waiting 60 second(s) for 6 file
descriptor(s)
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->(IN)
2005.10.27 13:13:46 LOG7[1286:0]: Waiting 35 second(s) for 7 file
descriptor(s)
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 2, FD=9, (OUT)->(ERRHUP)

... but eventually we start getting ERRHUP (ie POLLERR | POLLHUP), not
OUT, on fd 9, and then we end up in a tight loop of

2005.10.27 13:13:46 LOG7[1286:0]: Waiting 35 second(s) for 7 file
descriptor(s)
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->()
2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 2, FD=9, (OUT)->(ERRHUP)

... until eventually we get

[...]
2005.10.27 13:14:23 LOG7[1286:0]: Waiting 2 second(s) for 6 file
descriptor(s)
2005.10.27 13:14:23 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->()
2005.10.27 13:14:23 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->()
2005.10.27 13:14:23 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->()
2005.10.27 13:14:23 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->()
2005.10.27 13:14:23 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->()
2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 2, FD=9, (OUT)->(ERRHUP)
2005.10.27 13:14:25 LOG7[1286:0]: Waiting 0 second(s) for 6 file
descriptor(s)
2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->()
2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->()
2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->()
2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->()
2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->()
2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 2, FD=9, (OUT)->(ERRHUP)
2005.10.27 13:14:25 LOG6[1286:2]: s_poll_wait timeout: connection reset
2005.10.27 13:14:25 LOG5[1286:2]: Connection reset: 260 bytes sent to SSL,
24734 bytes sent to socket

The problem seems to be in network.c:s_poll_wait(), in the code

        while(!ready_head) /* no context ready */
            scan_waiting_queue();

because poll() returns > 0, but scan_waiting_queue() never sets ready_head
because no fd has yielded POLLIN or POLLOUT:-

            if(ufds[nfds].revents & (POLLIN|POLLOUT))
                ctx->ready++;

The end result is that we chew up CPU until TIMEOUTidle expires (which is
rather a long time if left at the default of 12 hours!!)

I have fudged a circumvention as follows (because I couldn't follow the
logic well enough to do it cleanly, and time is running out on me, I'm
afraid.)

$ diff -c network.c.loop network.c
*** network.c.loop      Thu Oct 27 13:07:10 2005
--- network.c   Thu Oct 27 13:05:21 2005
***************
*** 166,172 ****
                  ufds[nfds].revents & POLLHUP ? "HUP" : "",
                  ufds[nfds].revents & POLLNVAL ? "NVAL" : "");
  #endif
!             if(ufds[nfds].revents & (POLLIN|POLLOUT))
                  ctx->ready++;
              nfds++;
          }
--- 166,172 ----
                  ufds[nfds].revents & POLLHUP ? "HUP" : "",
                  ufds[nfds].revents & POLLNVAL ? "NVAL" : "");
  #endif
!             if(ufds[nfds].revents & (POLLIN|POLLOUT|POLLERR|POLLHUP|POLLNVAL))
                  ctx->ready++;
              nfds++;
          }

... which seems to solve the problem but produces a rather untidy result:-

2005.10.27 13:17:46 LOG7[1759:0]: Waiting 35 second(s) for 7 file
descriptor(s)
2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=3, (IN)->()
2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=5, (IN)->()
2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=6, (IN)->()
2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=7, (IN)->()
2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=8, (IN)->()
2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 2, FD=10, (IN)->(IN)
2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 2, FD=9, (OUT)->(OUT)
2005.10.27 13:17:46 LOG7[1759:0]: Waiting 35 second(s) for 6 file
descriptor(s)
2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=3, (IN)->()
2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=5, (IN)->()
2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=6, (IN)->()
2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=7, (IN)->()
2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=8, (IN)->()
2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 2, FD=9, (OUT)->(ERRHUP)
2005.10.27 13:17:46 LOG3[1759:2]: INTERNAL ERROR: s_poll_wait returned 1,
but no descriptor is ready
2005.10.27 13:17:46 LOG5[1759:2]: Connection reset: 260 bytes sent to SSL,
41132 bytes sent to socket

... i.e it sorts itself out quickly, but at the cost of logging an
INTERNAL ERROR, which doesn't look very good ;-) As far as I could tell
(but haven't shown), everything actually got close'd correctly.

As I say, I appear to have a circumvention, so I'm not hanging on a proper
fix (unless anyone can see any fatal flaw in what I've done?)

Regards,
 Richard




More information about the stunnel-users mailing list