[stunnel-users] transfer() loop executes not transferring any data and truncated data when using unix sockets

Fri Jan 25 03:18:53 CET 2013

Hello!

Recently we started seeing some erroneous behavior in one of our load
balancer configurations in which we're using stunnel for SSL tunnel
termination with web clients.  Specifically, the setup looks like the
following:

Web client  ---(SSL)--->  stunnel  ---(unix socket)--> haproxy
---(tcp/http)---> web server

After extensive troubleshooting we were able to determine that when
the following parameters are present:

* Client forces non-keepalive connection
* Client generates a POST request, the response to which (generated by
the back-end web server) is greater than about 8k
* stunnel and haproxy communicate via unix socket

...then most responses end up being truncated by an extraneous TCP RST
packet from stunnel, and the following log lines are generated:

Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
transfer() loop executes not transferring any data
Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]: please
report the problem to Michal.Trojnara at mirt.net
Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]: stunnel
4.54 on x86_64-unknown-linux-gnu platform
Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
Compiled/running with OpenSSL 1.0.0-fips 29 Mar 2010
Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
Threading:PTHREAD SSL:+ENGINE+OCSP+FIPS Auth:none Sockets:POLL+IPv6
Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
protocol=TLSv1, SSL_pending=0
Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
sock_open_rd=n, sock_open_wr=Y
Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
SSL_RECEIVED_SHUTDOWN=n, SSL_SENT_SHUTDOWN=Y
Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
sock_can_rd=Y, sock_can_wr=n
Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
ssl_can_rd=n, ssl_can_wr=n
Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
read_wants_read=Y, read_wants_write=n
Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
write_wants_read=n, write_wants_write=n
Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
shutdown_wants_read=n, shutdown_wants_write=n
Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]: socket
input buffer: 0 byte(s), ssl input buffer: 0 byte(s)

The work arounds we have found are either:

1.  Make stunnel and haproxy communicate over a TCP socket bound to
the loopback interface (less ideal, as then we're limited to 65k
connections)
2.  Add the "reset=no" parameter to our stunnel.conf (which seems to
fix the problem, but makes me wonder if we're not opening ourselves up
to other problems--  like really getting stuck in an infinite loop
that that watchdog is intended to avoid.)

In any case based on what "reset=no" is supposed to do as well as
where that first line of the error output above is being generated in
the code, it's clear that the main client data transfer loop in
client.c (starting line 519) is being called too often with no data to
transfer, which triggers the infinite loop detection in the transfer()
function (client.c, starting line 832).

My best hypothesis at this point is that the transfer() code was
written only with TCP sockets in mind, and that something about the
unix socket in use here is triggering that loop too often.  Maybe
s_poll_wait on line 541 is being called with parameters which are
right for a TCP socket but inappropriate for a unix socket?  At this
point though, we're at the limit of my socket programming ability, so
I'm a little stuck.  (Would like to have simply fixed this and
submitted a patch...  but cest la vie, eh.)

We'd like to avoid using a TCP socket to communicate between stunnel
and haproxy simply because it's a little less efficient, and because
there's an artificial 65k limit on the number of back-end connections
we can support per loopback IP if we do that.  Alternatively, if
someone could verify that "reset=no" is an acceptable / safe work
around in this case, I'd appreciate that (though, of course, I'd
prefer my logs not to get spammed on every connection. :P )

Is there any other data I can provide to help troubleshoot and fix this?

Thanks,
Stephen

--
Stephen Balukoff
Blue Box Group, LLC
(800)613-4305 x807