[stunnel-users] Premature socket closure - race condition bug?

Fri Sep 19 14:33:35 CEST 2014

Hi Michal (and if there is anyone else remotely interested :-) )

After a lot more digging, I've now confirmed it to be a race condition in 
the handling of the read socket in stunnel, although not the one I 
originally suspected. The problem occurs when the remote program feeds some 
data to stunnel, then immediately closes the socket, i.e.:

write(sock, buff, length);
shutdown(sock, SHUT_WR);
close(sock);

The write returns a valid count and this looks legal code to me - the client 
program should not have to wait around to confirm data has been received at 
the other end before start the shutdown, OK?

Looking at the way stunnel checks for incoming data, it polls the file 
descriptors and then takes action on the return status - particularly POLLIN 
and/or POLLHUP. When the transfer succeeded, the poll status was just 
POLLIN, and Stunnel then reads the socket correctly. In the failure scenario 
though, the returned status was (POLLIN|POLLHUP). Stunnel processed the 
POLLHUP case first, and marked the socket as closed without checking for any 
data still within it. Switching the order to first unload any data from the 
socket, and ONLY THEN marking the socket as shut fixes the problem. The read 
on the fd with the HUP state marked does work, as confirmed by the return 
byte count, and the shutdown proceeds as normal after that has been 
processed.

I attach a log showing the desired and failure scenarios for both the 
original ordered code and the fixed version. I also attach a log of the 
diffs between the working code and the corresponding sources in Version 
5.02. The extra debugging output I added to track down the problem is 
included, although is clearly not strictly necessary. The version of the 
utility s_poll_value() for the non USE_POLL case is just a best guess, as my 
configuration appears to use the polling mechanism rather than select.

I've only tried to fix this particular problem. I don't know whether there 
is a similar scenario possible for the SSL side of the transfer as well - my 
knowledge of SSL is even scantier than my knowledge of sockets. I can't 
think of a similar problem on the write side...but again my knowledge of 
comms is purely on a need-to-know basis!

Graham

-----Original Message----- 
From: Graham Nayler (work)
Sent: Monday, September 15, 2014 3:04 PM
To: stunnel-users at stunnel.org
Subject: [stunnel-users] Premature socket closure - race condition bug?

Dear All,

After a recent upgrade I'm currently experiencing intermittent problems with
securing bidirectional comms traffic for a moitoring program with stunnel.

The system is:
70+ Client machines running BBWin on Windows (mostly 7) -> stunnel 5.01
.....internet......stunnel 5.02 -> Xymon running on 64-bit Linux Mint 17
(Virtual machine inside 2012 R2 Essentials server)

Prior to the recent upgrade, the server was an approx 3 year old 32-bit
Ubuntu server, running stunnel 4.56. Comms then worked (mostly) fine for our
client machines.

Since the upgrade, client requests for information from the server have been
largely failing. Running the comms with direct unsecured socket connections
work fine.

I've spent a bit of time over the last couple of days looking at the source
for both Stunnel and BBWin and it looks to me as if there is a disconnect in
understanding between BBWin and Stunnel as to how read and write connections
work.

The BBWin Client makes the connection, then issues (in essence) the
following sequence:
    send(connection, msg)
    shutdown(connection, SHUT_WR)
    do recv(connection....) until it returns zero or SOCKET_ERROR
    shutdown(connection, SHUT_RD)
    shutdown(connection, SHUT_BOTH)
    closesocket(connection)
i.e. the client shuts down the transmission side as soon as it's done, then
shuts down the receive side only once it's finished receiving any returned
data.

I atttach Stunnel logs for both client and server for both failed and
successful transfers. I've added a little more debugging output to the
server Stunnel instance to display the data being read and written to both
the socket and SSL side of the comms. This shows that the only difference
between the two is that in the successful transfer the server receives and
passes on data from the socket to SSL before starting the shutdown. So that
looks fine - when it works. But when it doesn't, it looks just as if the
return path is shut down before the server app has had time to retrieve the
data to be returned

Looking at the stunnel code though, I'm confused - and my suspicion is that
stunnel (on the client machine) is closing the SSL connection prematurely..
It looks as if it issues the SSL_shutdown command (client.c line 855) if:
    it's not already sent the shutdown
    the read fd on the socket is closed
    there's nothing left in the outboud queue (sock_ptr is 0)
    and SSL wants a retry (? I don't yet understand write_wants_write
usage).
That's all very well for closing the outbound side, but what about the
inbound? Surely it should keep the SSL open until either it's notified by
the other side that everything is closed down there, or BOTH read and write
on the socket side have been shutdown. A further point of confusion is that
the stunnel code handles read and write fds for each of SSL and socket
independently, but for most cases they are both set to the same value. Is
there some confusion about handling s_poll_hup()? I freely admit I don't
understand fully how this works, as I've only had a day's experience of this
comms stuff, and it looks pretty well thought out, but there's logic here
for handling the inbound and outbound sides independently, so SSL should
remain open while any one of those two channel remains active?

The bottom line is that the comms:
  a) works reliably when not routed through stunnel
  b) works reliably to transmit from client to server
  b) now works (in reception) less than 10% of the time when using stunnel -
but does work occasionally
It worked fairly reliably with 5.01 on the clients and 4.56 on a slowish
server, and now doesn't on 5.02 on a highish spec server, with client
software/hardware unchanged. My suspicion is that improving the spec on the
server has exposed a race condition on the client installation.

Any thoughts?

Graham Nayler

_______________________________________________
stunnel-users mailing list
stunnel-users at stunnel.org
https://www.stunnel.org/cgi-bin/mailman/listinfo/stunnel-users 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: diffs.log
Type: application/octet-stream
Size: 3116 bytes
Desc: not available
URL: <http://www.stunnel.org/pipermail/stunnel-users/attachments/20140919/fdc4b910/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stunnel_fixed.log
Type: application/octet-stream
Size: 14782 bytes
Desc: not available
URL: <http://www.stunnel.org/pipermail/stunnel-users/attachments/20140919/fdc4b910/attachment-0001.obj>