Tuesday, January 10, 2017

Windows Overlapped Socket IO with Completion Routines

Up till now, cSuneido has used WSAAsyncSelect to do non-blocking socket IO. But WSAAsyncSelect is deprecated and it's not the nicest approach anyway. cSuneido needs non-blocking socket IO for background fibers, the main fiber uses synchronous blocking IO. (Although that means the main fiber will block background fibers.) Note: Windows fibers are single threaded, cooperative multi-tasking, coroutines. The advantage of fibers is that because they are single threaded and you control the context switches, you don't have the concurrency issues you would with "real" preemptive threads.

I thought that the WSAAsyncSelect code was the source of some failures we were seeing so I decided to rewrite it. My first rewrite used a polling approach. I know that's not scalable, but cSuneido doesn't do a lot of background processing so I figured it would be ok. Basically, I put the sockets in non-blocking mode, and whenever an operation returned WSAWOULDBLOCK the fiber would give up the rest of its time slice (e.g. 50ms) This was quite simple to implement and seemed to work fine.

However, I soon found it was too slow for more than a few requests. For example, one background task was doing roughly 400 requests. 400 * 50 ms is 20 seconds - ouch!

Back to the drawing board. One option was to use WSAEventSelect, but it looked complicated and I wasn't quite sure how to fit it in with the GUI event message loop.

Then I saw that WSARecv and WSASend allowed completion routines, a bit like XMLHttpRequest or Node's non-blocking IO. This seemed like a simpler approach. The fiber could block (yielding to other fibers) and the completion routine could unblock it.

At first I thought I had to use WSASocket and specify overlapped, but it turned out that the regular socket function sets overlapped mode by default. That's ok because it has no effect unless you use WSARecv or WSASend in overlapped mode.

Sending was the easy part since there was no need to block the sending fiber. It could just "fire and forget". One question was whether it would always do the full transfer or whether it might just do a partial transfer and require calling WSASend again (from the completion routine) to finish the transfer. I couldn't find a definitive answer for this. I found several people saying that in practice, unless there is a major issue (like running out of memory), it will always do the full transfer. Currently I just have an assert to confirm this.

Receiving is trickier. You may need to block until the data is available. And the completion routine may get called for partial data in which case you need to call WSARecv again for the remainder. (Although this complicates the code, it's actually a good thing since it allows you to request larger amounts of data and receive it as it arrives.)

WSASend and WSARecv can succeed immediately. However, the completion routine will still be called later. And for WSARecv at least, "success" may only be a partial transfer, in which case you still need to block waiting for the rest.

One complication to this style of overlapped IO is that completion routines are only called when you're in an "alertable" state. There are only a handful of functions that are alertable. I used MsgWaitForMultipleObjectsEx in the message loop, and SleepEx with a zero delay in a few other places. Note: although the MSDN documentation is unclear, you must specify MWMO_AWAITABLE for MsgWaitForMultipleObjectsEx to be alertable. (And it has to be the Ex version.)

Each overlapped WSASend or WSARecv is given an WSAOVERLAPPED structure and this structure must stay valid until the completion routine is called. I ran into problems because in some cases the completion routine wasn't getting called until after the socket had been closed, at which point I'd free'd the WSAOVERLAPPED structure. I got around this be calling SleepEx with a zero delay so the completion routines would run.

When I looked at some debugging tracing I noticed that it seldom blocked for very long. So I added a 1ms SleepEx before blocking to see if the data would arrive, in which case it wouldn't need to block and incur a context switch. This eliminated some blocking, but sometimes it didn't seem to work. I realized it was probably because the sleep was getting ended by an unrelated completion routine (e.g. from the preceding write). So I added a loop to ensure it was waiting at least a millisecond and that fixed it. Of course, I'm testing with the client and server on the same machine so the latency is very low. Across the network it will be slower and will still need to block sometimes.

Although the code wasn't that complicated, it took me a while to get it working properly (i.e. fast). As always, the devil is in the details. But the end result looks good. Background socket IO now runs about 30% faster than the old WSAAsyncSelect version, and almost as fast as the foreground synchronous blocking IO.

No comments: