Building a high-performance TCP server requires a deep understanding of how sockets work and how to manage connections efficiently. In this blog post, we’ll take a close look at the .NET Socket
class, uncovering how it's implemented. By understanding these details, we can leverage its full potential to build a TCP server capable of handling high-throughput, low-latency communication, especially for scenarios like receiving heartbeat messages from numerous IoT devices.
We’ll also compare the performance of our custom server with Kestrel, the highly optimised ASP.NET web server, to demonstrate that a well-crafted .NET socket server can match its efficiency.
Join me as we dive into the architecture of .NET sockets and see how to create a high-performance, low-latency TCP server that delivers results on par with Kestrel.
The code is accessible in the accompanying
Our TCP server will handle heartbeat messages from thousands of IoT devices, each periodically sending a small HTTP/1.1 request that contains a header with its unique Device-Id
. Upon receiving a heartbeat, the server will update the last known timestamp for that device in an in-memory data structure, send an HTTP 201 status code to acknowledge the receipt, and then close the TCP connection.
curl -X PUT -H "Device-Id: 1234" -v 127.0.0.1:9096
* Trying 127.0.0.1:9096...
* Connected to 127.0.0.1 (127.0.0.1) port 9096
> PUT / HTTP/1.1
> Host: 127.0.0.1:9096
> User-Agent: curl/8.8.0
> Accept: */*
> Device-Id: 1234
>
< HTTP/1.1 204 No Content
<
* Connection #0 to host 127.0.0.1 left intact
The goal is to build a high-performance TCP server, with two key metrics in mind: latency and throughput. Latency refers to the time taken to receive and process a new heartbeat message — updating the last known timestamp in memory. Reducing latency improves the server’s responsiveness. Throughput, on the other hand, measures how many heartbeat messages the server can process in a given time. Increasing throughput allows the server to handle more IoT devices, improving overall efficiency.
Balancing these two metrics is critical, as high throughput doesn’t necessarily mean low latency. Optimising both ensures the server performs well under load.
To implement our server, we’ll use Socket, which is an abstraction provided by the operating system to facilitate network communication. Both the client and server must create and manage sockets to exchange data. In .NET, the Socket
class from the System.Net.Sockets
namespace acts as a managed wrapper around the OS's native socket functionality, providing a convenient interface for establishing and managing network connections.
The following code initialises a TCP socket, binding it to a specific IP address and port. By calling the Listen
method, the server signals the operating system to begin queuing incoming connections. The backlog
parameter defines how many connections the OS will hold in the queue. However, while the OS can queue connections, the server still needs to actively accept them.
To do this, the server calls the Accept
method, which dequeues a connection and returns a new socket dedicated to that connection. This new socket can then be used for sending and receiving data, while the original socket continues listening for additional connections.
var listenerSocket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
listenerSocket.Bind(new IPEndPoint(IPAddress.Any, _port));
listenerSocket.Listen(backlog: 4096);
while (true)
{
var acceptSocket = listenerSocket.Accept();
HandleNewConnection(acceptSocket); // sequential model
}
The simplest approach to handle new connections is through a sequential model as we saw, where the server processes one request at a time. It accepts a connection, handles it completely, and then moves on to the next. While easy to implement, this model is inefficient when multiple clients try to connect simultaneously, as only one client can interact with the server at any given time. This forces other clients to wait, leading to delays and poor performance, especially under heavy load.
To improve throughput and reduce latency, concurrency is key. Concurrency allows the server to handle multiple requests at once, treating each request as an independent task that can be processed in parallel.
There are two main concurrency models: processes and threads.
A common strategy is to create a new thread for each incoming connection, allowing the server to handle each request independently in its own thread, as shown in the following code.
while (true)
{
var acceptSocket = listenerSocket.Accept();
// concurrent model
var thread = new Thread(() => HandleNewConnection(acceptSocket));
thread.Start();
}
To boost server performance, a thread pool reuses long-running worker threads, reducing the overhead of creating new threads for each task. Instead of managing individual threads, a thread pool assigns tasks to available threads, making it more efficient, especially for short-lived tasks.
In .NET, you can use ThreadPool.QueueUserWorkItem
or ThreadPool.UnsafeQueueUserWorkItem
to queue tasks. The key difference is that QueueUserWorkItem
passes the security context to the thread, ensuring safer execution, while UnsafeQueueUserWorkItem
skips this for faster performance but with potential security risks.
Setting preferLocal
to false
ensures tasks are distributed evenly across all threads, improving load balancing in high-concurrency scenarios.
while (true)
{
var acceptSocket = listenerSocket.Accept();
// concurrent model
var handler = new ConnectionHandler(acceptSocket);
ThreadPool.UnsafeQueueUserWorkItem(handler, preferLocal: false);
}
class ConnectionHandler : IThreadPoolWorkItem
{
Socket _acceptSocket;
public ConnectionHandler(Socket acceptSocket)
{
_acceptSocket = acceptSocket;
}
public void Execute()
{
HandleNewConnection(_acceptSocket);
}
}
After accepting a connection, a server uses the socket to read from and write to the client. In blocking I/O, each read operation suspends the thread until data arrives, causing the thread to wait and preventing it from handling other tasks. This can lead to high context-switching overhead as the number of threads increases, reducing overall throughput and scalability.
Non-blocking I/O allows the server to request data without waiting for it to arrive. For example, a non-blocking read initiates the request and continues executing other code, such as handling additional connections, while waiting for data. This method enhances efficiency by not tying up threads during I/O operations.
However, non-blocking I/O can involve busy-waiting, where the server continuously checks the status of multiple sockets, which can be inefficient. For instance, with thousands of sockets, constantly polling each one — even if only the last one has data — can lead to excessive CPU usage and diminished performance.
while (true)
{
foreach (var socket in sockets)
{
data = socket.nonBlockingRead();
if (data is not null)
{
ProcessData(data);
}
}
}
To overcome the inefficiencies of busy-waiting in non-blocking I/O, modern systems use I/O multiplexing with an event-driven model. This approach leverages an event loop and callbacks to efficiently manage multiple I/O operations. In this model, the application registers interest in specific events (e.g., data availability or readiness to write) for each socket. The event loop waits for notifications from the operating system about these events instead of continuously polling each socket.
When an event occurs, the event loop triggers a callback function to handle the I/O operation, allowing the application to respond as needed. This pattern, known as the reactor pattern, reduces CPU usage and overhead by eliminating constant polling. It efficiently handles thousands of connections by focusing only on active, ready-to-process connections, making it ideal for high-performance and responsive network servers.
foreach (var socket in sockets)
{
OS.EventRegistration(socket, callback, event_type: read | write);
}
// event loop
while (true)
{
events = OS.WaitForSocketEvents();
foreach (var event in events)
{
event.callback_function(event.socket, event_type);
}
}
In .NET, socket operations on Unix-based systems (like Linux) are managed using an event-driven model, facilitated by the SocketAsyncEngine
class. This class implements an event loop that listens for notifications from the operating system’s native mechanisms, such as epoll
on Linux or kqueue
on macOS. When these notifications are received, the event loop schedules the corresponding socket operations (e.g., reads and writes) as work items in the ThreadPool
. The following code provides a simplified example of how the event loop is implemented by SocketAsyncEngine
.
class SocketAsyncEngine : IThreadPoolWorkItem
{
ConcurrentQueue<SocketIOEvent> eventQueue = _eventQueue;
static private SocketAsyncEngine()
{
var thread = new Thread(static s => ((SocketAsyncEngine)s!).EventLoop())
{
IsBackground = true,
Name = ".NET Sockets"
};
thread.UnsafeStart(this);
}
private void EventLoop()
{
SocketEventHandler handler = new SocketEventHandler(this);
while (true)
{
Interop.Sys.WaitForSocketEvents(_port, handler.Buffer, &numEvents);
if (handler.HandleSocketEvents(numEvents))
{
ThreadPool.UnsafeQueueUserWorkItem(this, preferLocal: false);
}
}
}
void void Execute()
{
while (true)
{
if (eventQueue.TryDequeue(out ev))
{
break;
}
ev.Context.HandleEvents(ev.Events);
}
}
}
To register a callback for non-blocking socket operations such as Receive
or Send
in .NET, we use the SocketAsyncEventArgs
class. This class allows us to associate a callback with the socket operation, which is triggered when the event loop detects a relevant event. The code snippet below demonstrates how to use SocketAsyncEventArgs
for this purpose:
var eventArgs = new SocketAsyncEventArgs(unsafeSuppressExecutionContextFlow:true);
eventArgs.SetBuffer(buffer);
eventArgs.Completed += RecvEventArg_Completed;
acceptSocket.ReceiveAsync(receiveEventArgs);
void RecvEventArg_Completed(object? sender, SocketAsyncEventArgs e)
{
// consumeing buffer
}
When a read event occurs for a specific socket, the event loop triggers the registered callback via the SocketAsyncEventArgs.Completed
event. Additionally, the Receive
method captures the execution context to ensure it can be restored before executing the callback. For improved performance, you can disable the execution context if it’s not needed by configuring SocketAsyncEventArgs
accordingly in its constructor.
For asynchronous operations like ReceiveAsync
that do not take a SocketAsyncEventArgs
parameter, .NET internally creates a new SocketAsyncEventArgs
object. These methods return a ValueTask
, which signals the runtime to proceed once the operation is completed and the underlying SocketAsyncEventArgs
object is done.
Reusing SocketAsyncEventArgs
objects from a pool can improve performance by avoiding the overhead of creating new instances for each socket event. However, pooling should be used carefully. If a pooled object is far from the CPU core’s cache, accessing it may cause delays. In some cases, creating a new object might be faster than using a cached one.
The key is to measure and balance the trade-offs between pooling and performance, ensuring the pool enhances rather than hinders efficiency.
In .NET, socket continuations are usually dispatched to the ThreadPool
from the event thread to prevent blocking the event handling loop. However, by setting PreferInlineCompletions
to true
, continuations can be executed directly on the event thread, reducing the overhead of dispatching to the ThreadPool
.
By default, PreferInlineCompletions
is set to false
. You can enable inline completions by setting the DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS
environment variable to 1
.
bool HandleEvents(Event[] events)
{
foreach(var event in envents)
{
if(PreferInlineCompletions)
{
event.Callback();
}
else
{
ThreadPool.UnsafeQueueUserWorkItem(event.Callback, preferLocal: false);
}
}
}
Each time an application creates a new socket on Linux, the operating system generates a file descriptor, an integer that represents the created socket. In .NET, you can access this file descriptor to make direct system calls, but this approach requires caution. The file descriptor becomes invalid once the socket is closed, and using it after that can lead to errors or undefined behaviour.
In .NET, the SafeSocketHandle
property of a Socket
contains this file descriptor. To access its value, you can use the DangerousGetHandle()
method of SafeSocketHandle
.
Now that we’ve covered the key concepts and building blocks, you can see everything in action in the event-based server I’ve implemented. This server handles heartbeat requests and is fully configurable — you can enable or disable features like inline completions or socket pooling using the corresponding option values.
To get started, clone the
cd src
dotnet run -c Release --InlineCompletions false --SocketPolling false
Once you run the server, you should see the following output in your terminal, indicating that it’s ready for action:
Server started
ServerOptions: Port=9096, Address=0.0.0.0, MaxRequestSizeInByte=512, InlineCompletions=False, SocketPolling=False
At this point, the server is live, and you’re ready to run your benchmarks to test its performance!
To ensure our TCP server performs well, we need to measure its throughput and latency. We’ll compare its performance against a baseline, the same Heartbeat server built using .NET Kestrel. Kestrel is highly optimised in the .NET ecosystem, and while our server is purpose-built, this comparison helps us verify that we’re using sockets efficiently.
For benchmarking, we used
To establish the baseline, I created a simple ASP.NET application exposing a minimal endpoint for handling heartbeat messages.
Our tests will make HTTP requests to both the heartbeat server and the baseline Kestrel server. Since IoT devices establish new TCP connections for each heartbeat, we need to simulate this behavior by ensuring each test request creates a new TCP client connection. Unlike Kestrel, which supports connection reuse (keep-alive
), our server closes the connection after every request.
Here’s the command we used to run the benchmark:
~$ bombardier -c 32 -m PUT -H "Device-Id:1234" --http1 -a 127.0.0.1:9096
Bombarding http://127.0.0.1:9096 for 10s using 32 connection(s)
[==================================================================================================================] 10s
Done!
Statistics Avg Stdev Max
Reqs/sec 10114.69 1085.95 12151.48
Latency 3.13ms 0.96ms 20.28ms
HTTP codes:
1xx - 0, 2xx - 101093, 3xx - 0, 4xx - 0, 5xx - 0
others - 0
Throughput: 1.84MB/s
While the default output is shown above, in our actual benchmarking, we used JSON output for easier parsing and charting.
The benchmark was run 35 times, with the first 5 iterations discarded as warm-up. We tested with varying numbers of connections: 1, 16, 32, 128, and 256 concurrent connections.
In all scenarios, the custom heartbeat server outperformed the baseline Kestrel server, both in terms of latency and throughput (requests per minute). This demonstrates that our event-based server, despite being purpose-built, handles high connection loads more efficiently than the baseline, proving that our socket implementation is well-optimised.
The following chart provides a comprehensive overview of the results:
In this post, we demonstrate the process of building a high-performance TCP server using .NET’s Socket class to handle heartbeat messages from IoT devices. We take a deep dive into socket handling, connection management, concurrency models, and efficient I/O techniques like event-driven programming. We explore the custom TCP server with both blocking and non-blocking I/O, along with advanced techniques such as thread pooling and socket multiplexing. Finally, we benchmark the server against the highly-optimised Kestrel server and show how a well-crafted, event-driven architecture can outperform Kestrel in handling high connection loads while maintaining low latency.
Next Step: Moving forward, we can further explore the data handling aspect by examining efficient ways to manage and consume buffers. This will allow for more optimised processing of incoming data, enhancing the server’s overall performance.