[stunnel-users] transfer() loop executes not transferring any data and truncated data when using unix sockets

Wed Feb 6 01:07:03 CET 2013

I've found a workaround for the truncated response issue. I managed to
reproduce the problem while running stunnel inside strace and noticed
this behavior:

read(7, "HTTP/1.0 200 OK\r\nContent-Type: text/plain\r\nContent-Length:
> 56391\r\n\r\nXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"...,
> 18432) = 18432
> poll([{fd=7, events=0}, {fd=3, events=POLLIN|POLLOUT}], 2, 43200000) = 2
> ([{fd=7, revents=POLLHUP}, {fd=3, revents=POLLOUT}])
> read(7, "", 0)                    = 0
> sendto(6, "<31>Feb  5 15:08:44 stunnel: LOG7[20119:0]: Socket closed on
> read", 65, MSG_NOSIGNAL, NULL, 0) = 65
> write(2, "2013.02.05 15:08:44 LOG7[20119:0]: Socket closed on read\n", 57)
> = 57

Here read() was being called with a size of zero, which according it the
man page should return 0 and have no other results. I added a and
additional guard to the read from socket portion of transfer() and have not
been able to reproduce the truncated response issue yet. Unfortunately the
transfer loop watchgaurd is still occurring.

Dustin Lundquist
Blue Box Group

On Mon, Feb 4, 2013 at 5:39 PM, Stephen Balukoff <sbalukoff at bluebox.net>wrote:

> Hello, y'all!
>
> I've recruited the help of my colleague Dustin Lundquist, and he's
> been able to create a reproducible test case for this bug, which will
> hopefully shed some more light on this.  We're specifically seeing two
> error behaviors which we haven't yet been able to determine whether
> they're related.  These are:
>
> 1. transfer() loop error, generating an error code telling us to
> contact Michal.Trojnara at mirt.net
> 2. truncated response from the web server (always at byte 46693, for
> whatever reason).
>
> These errors can occur independently or simultaneously, and
> occasionally not at all on different runs of the same script on the
> same machine. When we attempt to strace the stunnel process, the
> problem goes away entirely, which screams to me of a timing issue or
> race condition. In any case, it's a bit of a heisenbug, which is
> annoying in and of itself.
>
> And, again, these go away when we communicate to the back-end web
> server process using TCP sockets instead of unix sockets.
>
> Attached are output from the four permutations of the above bug, as
> well as the sample perl script used to generate them.
>
> If anyone has any ideas of how we might go about fixing this one, or
> knows what significant difference between TCP and unix sockets might
> be causing the error here, we'd love to hear it!
>
> Stephen
>
>
>
> On Thu, Jan 24, 2013 at 6:18 PM, Stephen Balukoff <sbalukoff at bluebox.net>
> wrote:
> > Hello!
> >
> > Recently we started seeing some erroneous behavior in one of our load
> > balancer configurations in which we're using stunnel for SSL tunnel
> > termination with web clients.  Specifically, the setup looks like the
> > following:
> >
> >
> > Web client  ---(SSL)--->  stunnel  ---(unix socket)--> haproxy
> > ---(tcp/http)---> web server
> >
> > After extensive troubleshooting we were able to determine that when
> > the following parameters are present:
> >
> > * Client forces non-keepalive connection
> > * Client generates a POST request, the response to which (generated by
> > the back-end web server) is greater than about 8k
> > * stunnel and haproxy communicate via unix socket
> >
> > ...then most responses end up being truncated by an extraneous TCP RST
> > packet from stunnel, and the following log lines are generated:
> >
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> > transfer() loop executes not transferring any data
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]: please
> > report the problem to Michal.Trojnara at mirt.net
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]: stunnel
> > 4.54 on x86_64-unknown-linux-gnu platform
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> > Compiled/running with OpenSSL 1.0.0-fips 29 Mar 2010
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> > Threading:PTHREAD SSL:+ENGINE+OCSP+FIPS Auth:none Sockets:POLL+IPv6
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> > protocol=TLSv1, SSL_pending=0
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> > sock_open_rd=n, sock_open_wr=Y
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> > SSL_RECEIVED_SHUTDOWN=n, SSL_SENT_SHUTDOWN=Y
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> > sock_can_rd=Y, sock_can_wr=n
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> > ssl_can_rd=n, ssl_can_wr=n
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> > read_wants_read=Y, read_wants_write=n
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> > write_wants_read=n, write_wants_write=n
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> > shutdown_wants_read=n, shutdown_wants_write=n
> > Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]: socket
> > input buffer: 0 byte(s), ssl input buffer: 0 byte(s)
> >
> > The work arounds we have found are either:
> >
> > 1.  Make stunnel and haproxy communicate over a TCP socket bound to
> > the loopback interface (less ideal, as then we're limited to 65k
> > connections)
> > 2.  Add the "reset=no" parameter to our stunnel.conf (which seems to
> > fix the problem, but makes me wonder if we're not opening ourselves up
> > to other problems--  like really getting stuck in an infinite loop
> > that that watchdog is intended to avoid.)
> >
> > In any case based on what "reset=no" is supposed to do as well as
> > where that first line of the error output above is being generated in
> > the code, it's clear that the main client data transfer loop in
> > client.c (starting line 519) is being called too often with no data to
> > transfer, which triggers the infinite loop detection in the transfer()
> > function (client.c, starting line 832).
> >
> > My best hypothesis at this point is that the transfer() code was
> > written only with TCP sockets in mind, and that something about the
> > unix socket in use here is triggering that loop too often.  Maybe
> > s_poll_wait on line 541 is being called with parameters which are
> > right for a TCP socket but inappropriate for a unix socket?  At this
> > point though, we're at the limit of my socket programming ability, so
> > I'm a little stuck.  (Would like to have simply fixed this and
> > submitted a patch...  but cest la vie, eh.)
> >
> > We'd like to avoid using a TCP socket to communicate between stunnel
> > and haproxy simply because it's a little less efficient, and because
> > there's an artificial 65k limit on the number of back-end connections
> > we can support per loopback IP if we do that.  Alternatively, if
> > someone could verify that "reset=no" is an acceptable / safe work
> > around in this case, I'd appreciate that (though, of course, I'd
> > prefer my logs not to get spammed on every connection. :P )
> >
> > Is there any other data I can provide to help troubleshoot and fix this?
> >
> > Thanks,
> > Stephen
> >
> >
> >
> > --
> > Stephen Balukoff
> > Blue Box Group, LLC
> > (800)613-4305 x807
>
>
>
> --
> Stephen Balukoff
> Blue Box Group, LLC
> (800)613-4305 x807
>
> _______________________________________________
> stunnel-users mailing list
> stunnel-users at stunnel.org
> https://www.stunnel.org/cgi-bin/mailman/listinfo/stunnel-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.stunnel.org/pipermail/stunnel-users/attachments/20130205/c840f3a1/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Fix-truncated-response-issue.patch
Type: application/octet-stream
Size: 1031 bytes
Desc: not available
URL: <http://www.stunnel.org/pipermail/stunnel-users/attachments/20130205/c840f3a1/attachment.obj>