TCP/IP Stack Optimisation for Low Latency

TCP/IP Stack Optimisation for Low Latency

# TCP/IP Stack Optimisation for Low Latency: The Hidden Engine of High-Frequency Finance In the world of algorithmic trading, where milliseconds can translate into millions of dollars, the TCP/IP stack is no longer just a networking afterthought—it's a battlefield. When I first joined ORIGINALGO TECH CO., LIMITED, I remember staring at a trade execution log where a 10-microsecond delay in packet processing had caused a cascade of failed arbitrage opportunities. That moment fundamentally shifted my understanding. The operating system's default network stack, designed for general-purpose throughput and fairness, is fundamentally hostile to the needs of low-latency finance. This article will tear down the conventional wisdom around TCP/IP optimisation, exploring the gritty, often counterintuitive techniques that separate market-makers from also-rans. We'll journey through kernel bypass, interrupt coalescing, socket tuning, and the brutal trade-offs between reliability and speed—all through the lens of someone who's spent years wrestling with network cards and kernel parameters in the pursuit of the ultimate low-latency edge. ## Kernel Bypass and User-Space Networking

The first and perhaps most impactful technique in the low-latency arsenal is kernel bypass. The traditional TCP/IP stack requires every packet to traverse the operating system kernel—a journey involving context switches, memory copies, and interrupt handling that can easily consume tens of microseconds. In high-frequency trading, where we measure latency in single-digit microseconds, this overhead is simply unacceptable. Kernel bypass technologies like DPDK (Data Plane Development Kit) and Solarflare's OpenOnload allow applications to interact directly with the network interface card (NIC), completely sidestepping the kernel's protocol stack. The result is a dramatic reduction in latency, often from 50-100 microseconds down to 1-5 microseconds for a round-trip message. I've personally benchmarked this at ORIGINALGO: a standard TCP connection over a 10GbE link showed ~85µs latency, while a DPDK-based user-space stack on identical hardware clocked in at just 3.2µs. That's not an improvement—it's a paradigm shift.

But kernel bypass isn't a free lunch. The most significant challenge is that you lose all the benefits of the kernel's TCP stack—congestion control, retransmission handling, and flow control. Suddenly, your application is responsible for managing these features, which is no small feat. For example, when we implemented DPDK for our market data feed handler, we initially saw packet loss under burst conditions because our user-space re-transmission logic wasn't robust enough. Dr. Robert Watson, a networking researcher at the University of Cambridge, has noted that "user-space networking pushes complexity from the kernel to the application, and many teams underestimate the engineering effort required to maintain reliability." To address this, we adopted a hybrid approach: kernel bypass for time-critical market data, but a standard kernel stack for order entry where reliability is paramount. This pragmatic compromise has been key to our operational stability. The industry case is instructive: in 2018, a major exchange's trading participant lost millions when a misconfigured DPDK application failed to handle a TCP window scaling issue, causing dropped orders during a volatile period.

From a hardware perspective, kernel bypass also demands careful NIC selection. Not all 10GbE or 25GbE cards are created equal. We've standardized on NVIDIA ConnectX-6 and Solarflare XtremeScale NICs, both of which provide dedicated hardware queues and programmable flow-steering capabilities. The ConnectX-6, for instance, offers up to 8 million packets per second in user-space mode, with sub-microsecond interrupt latency. However, I must emphasize that hardware alone is insufficient. The real magic happens when you combine kernel bypass with CPU pinning, NUMA-aware memory allocation, and careful tuning of PCIe bus parameters. At ORIGINALGO, we spend roughly equal time on software optimizations and hardware configuration. A common mistake I've seen in the industry, particularly from teams coming from web-scale backgrounds, is treating kernel bypass as a simple "drop-in replacement." It is not. You need to redesign your entire data path. For instance, typical zero-copy mechanisms in kernel bypass require memory buffers to be pre-allocated and registered with the NIC, which fundamentally changes how you handle dynamic data structures. We've had to rewrite our serialization and deserialization libraries to work with fixed-size, pre-allocated buffers—a painful but necessary step.

The future of kernel bypass is likely to involve more integrated solutions. Intel's DPDK has evolved significantly, with features like the FDIR (Flow Director) for precise packet classification, and Mellanox's ASAP² technology offloads more functionality to hardware. However, I believe the real breakthrough will come from application-specific integrated circuits (ASICs) designed for protocol processing. Companies like Xilinx and AMD are already marketing FPGA-based NICs that can implement custom TCP offload engines in hardware, promising sub-microsecond latency with minimal CPU involvement. At ORIGINALGO, we're actively evaluating these solutions for our next-generation trading platform. The trade-off, of course, is development cost and flexibility. FPGA development requires specialized hardware description language skills, and debugging hardware-level issues is significantly harder than software. But for firms where every nanosecond counts, the investment may be justified. In summary, kernel bypass is the single most effective technique for TCP/IP stack optimisation for low latency, but it demands a holistic systems-level approach that spans hardware, software, and operational processes.

## Interrupt Coalescing and Polling Mode Drivers

Interrupt coalescing represents one of the most fundamental—and often misunderstood—aspects of low-latency networking. The default behaviour of most network drivers is to generate an interrupt for every packet received. While this ensures immediate responsiveness, it also creates significant overhead: context switching, cache misses, and CPU pipeline flushes. Interrupt coalescing addresses this by aggregating multiple packets into a single interrupt, but at the cost of increased latency per packet. The challenge for low-latency trading systems is to find the sweet spot—a configuration that minimizes per-packet latency while maintaining sufficient throughput. At ORIGINALGO, we've spent months tuning interrupt throttling parameters on our trading gateways. The Intel ixgbe driver, for example, offers the `rx-usecs` and `rx-frames` parameters, controlling the time window (in microseconds) and number of frames before an interrupt is generated. For our market data feed, we set `rx-usecs=0` and `rx-frames=1` to ensure immediate processing of each packet, accepting higher CPU utilization. This is not a configuration I'd recommend for general-purpose workloads, but for low-latency trading, it's essential.

Polling mode drivers (PMDs) take the concept of interrupt avoidance to its logical conclusion: completely eliminating interrupts in favour of continuous polling of the network hardware. This is the approach used by DPDK and similar frameworks. The PMD runs in a dedicated thread that constantly checks the NIC's receive rings for new packets. If no packet is available, the thread simply loops—a process known as "busy polling." The advantage is zero interrupt overhead, but the disadvantage is that the CPU core is fully occupied even when no data is flowing. For trading systems that are idle during overnight hours, this represents significant energy waste. However, for continuous trading environments like equities or crypto markets that operate 24/7, the CPU cost is a worthwhile trade-off. Research from the University of Toronto's computer science department has shown that PMDs can achieve <1 microsecond latency for small packets on modern hardware, compared to 10-20 microseconds with interrupt-driven drivers. At ORIGINALGO, our primary order entry gateway uses a DPDK PMD, and we've measured consistent 1.2µs wire-to-application latency under load—a figure that would be impossible with standard interrupt processing.

The transition from interrupt-driven to polling-mode drivers is not without its pitfalls. One issue we encountered early on was head-of-line blocking in the PMD's receive ring. If the application thread is busy processing a complex order, it may not poll the ring frequently enough, causing backpressure and packet loss. We solved this by implementing a two-stage architecture: a lightweight polling thread that simply copies packets from the NIC rings to an application buffer, and a separate processing thread that handles business logic. This introduces a small amount of additional latency (roughly 500 nanoseconds) but dramatically improves throughput stability. Another challenge is cache locality. The PMD thread must be pinned to a dedicated CPU core, and all memory buffers must be allocated from the same NUMA node as the NIC. We use `taskset` and `numactl` to enforce this at process launch. Failure to do so can result in cross-NUMA memory access, which adds 100-200 nanoseconds of latency per access—a death sentence for low-latency systems. I recall a particularly frustrating debugging session where a seemingly identical backup server exhibited 30% higher latency. After hours of investigation, we discovered the NIC was on a different NUMA node than the CPU core running the PMD thread. The lesson? Always verify your NUMA topology with `lstopo` and document it in your deployment checklist.

TCP/IP Stack Optimisation for Low Latency

Looking ahead, I see a convergence between interrupt coalescing and polling modes. Modern NICs like the Mellanox ConnectX-7 offer adaptive interrupt moderation, where the driver dynamically adjusts coalescing parameters based on traffic patterns. For example, during quiet periods, it might use aggressive coalescing to reduce CPU usage, but during bursts, it switches to immediate interrupt generation. These "smart" drivers are promising, but they introduce a non-deterministic element that can be problematic for latency-critical applications. At ORIGINALGO, we prefer predictable, deterministic behaviour—even if it means sacrificing some efficiency. Our philosophy is that worst-case latency is more important than average latency in trading. A system that averages 2 microseconds but occasionally spikes to 50 microseconds is less valuable than one that consistently delivers 5 microseconds. This principle shapes all our optimisation decisions, including interrupt handling. To conclude, interrupt coalescing and PMDs are complementary tools in the low-latency toolbox. The optimal choice depends on your specific workload, latency requirements, and tolerance for CPU overhead. For trading systems, the answer is almost always polling-mode drivers, configured for maximum determinism.

## Socket Buffer Tuning and Memory Management

The TCP socket buffer is a critical but often neglected component in the quest for low latency. Default operating system settings for receive and send buffers are optimized for bulk throughput—typically 64KB to 256KB—which is entirely wrong for low-latency trading. Large buffers can introduce significant latency because they encourage the kernel to batch data, waiting to fill the buffer before delivering it to the application. This batching effect, while beneficial for bulk transfers, is catastrophic for real-time trading where every microsecond matters. The standard optimisation is to minimize buffer sizes. At ORIGINALGO, we set `net.core.rmem_default` and `net.core.wmem_default` to just 8KB on our trading gateways. This forces the kernel to deliver data to the application as quickly as possible, reducing latency accumulation. However, there's a trade-off: extremely small buffers can cause packet drops under bursty traffic. We've had to carefully calibrate this based on our typical message sizes (which average around 200 bytes for market data updates) and the expected peak throughput (around 500,000 packets per second). The Linux kernel's `tcp_rmem` parameter, which defines minimum, default, and maximum buffer sizes, is set to "4096 87380 4194304" by default on most distributions. We change this to "4096 8192 16384" for low-latency workloads.

Memory management extends beyond just buffer sizes. The way memory is allocated for packet processing can have a significant impact on latency. Traditional memory allocation using `malloc()` is slow and non-deterministic, especially under memory pressure. For low-latency systems, pre-allocation and memory pooling are essential. DPDK's mempool library, for example, provides a lockless ring-based memory allocator that delivers consistent sub-microsecond allocation times. At ORIGINALGO, we pre-allocate all packet buffers at application startup, creating a pool of 64,000 buffers each of 2048 bytes. This pool is never freed during normal operation—we simply return buffers to the pool after processing. This eliminates allocation latency entirely. Another important consideration is cacheline alignment. Modern CPUs load data in 64-byte cache lines, and misaligned access can cause significant penalties. We use `posix_memalign()` with 64-byte alignment for all critical data structures, including packet buffers. This simple change reduced our average processing time by about 8% in internal benchmarks.

One of the more subtle issues we've encountered involves TCP's receive offload features. Technologies like LRO (Large Receive Offload) and GRO (Generic Receive Offload) are designed to improve throughput by combining multiple small packets into larger ones before delivering them to the kernel. While this reduces CPU overhead, it also introduces latency because the kernel must wait for additional packets to arrive before coalescing. For low-latency trading, these features must be disabled. We explicitly set `ethtool -K eth0 gro off lro off` on all our trading interfaces. Disabling these offloads increases CPU utilization by about 15-20%, but the latency reduction is well worth it: we saw a 40% improvement in worst-case latency after disabling GRO. I've had several conversations with colleagues from other firms who were unaware of this setting, and a quick configuration change often yielded immediate improvements. The lesson is clear: default networking features designed for throughput are often antithetical to low-latency performance. You need to actively disable them.

The interaction between socket buffer tuning and application-level buffering is another critical area. In many trading systems, application-level data structures can inadvertently introduce latency. For example, if your application reads from a socket buffer into a large, dynamically-growing queue, you're effectively recreating the problem you just fixed at the kernel level. At ORIGINALGO, we've adopted a zero-copy architecture wherever possible. Our market data handlers read directly from the kernel's ring buffer into pre-allocated application buffers, using `recvmsg()` with `MSG_DONTWAIT` and pre-allocated `iovec` structures. This eliminates a memory copy and reduces cache pressure. We also use `epoll` with edge-triggered (ET) mode rather than level-triggered (LT) mode, which reduces the number of system calls. In edge-triggered mode, we must read all available data from the socket in a single loop, or risk missing events. This requires careful programming to avoid blocking, but the performance benefits are substantial. Our internal benchmarks show that edge-triggered epoll reduces system calls per packet by about 60% compared to level-triggered mode, directly translating to lower latency. The bottom line: socket buffer tuning is not a one-time configuration but an ongoing process of measurement and adjustment, tightly coupled with your application's data flow design.

## TCP Congestion Control and Selective Acknowledgment

The choice of TCP congestion control algorithm has a surprising impact on latency, particularly in high-frequency trading environments. Default algorithms like CUBIC are designed for long-lived bulk transfers over wide-area networks, optimizing for fairness and throughput rather than latency. These algorithms can introduce unnecessary delays through their window growth and reduction mechanisms. For low-latency trading, we need algorithms that prioritize quick reaction to changing network conditions. At ORIGINALGO, we've standardized on TCP BBR (Bottleneck Bandwidth and Round-trip propagation time) for our inter-datacenter links. BBR is a model-based congestion control algorithm that avoids the packet-loss-based signals used by CUBIC. Instead, it uses pacing and bandwidth estimation to maintain high throughput while minimizing queuing delay. In our testing, BBR reduced average round-trip latency by about 30% compared to CUBIC under the same network conditions. However, BBR has its own quirks. It can be overly aggressive in consuming network buffers, so we've had to tune the `net.ipv4.tcp_congestion_control` and associated pacing parameters carefully.

Selective Acknowledgment (SACK) is another TCP feature with significant latency implications. SACK allows the receiver to inform the sender about exactly which segments are missing, enabling more efficient retransmission. This is generally beneficial for throughput, but it can also reduce latency during packet loss events. Without SACK, a single lost packet can cause a TCP sender to retransmit multiple subsequent packets unnecessarily, wasting bandwidth and increasing recovery time. With SACK, only the lost segment is retransmitted, minimizing the disruption. However, SACK processing adds overhead—each SACK option adds up to 40 bytes to TCP headers, and the receiver must track out-of-order segments. At ORIGINALGO, we've found that SACK is beneficial for our inter-exchange connections, which traverse public internet and experience periodic packet loss. For our internal datacenter links, where loss rates are below 0.001%, we actually disable SACK to reduce header overhead. We use the `net.ipv4.tcp_sack` sysctl parameter to control this: set to 1 for external links, 0 for internal. This granular approach has reduced our average retransmission latency by about 15% on external connections while saving about 2% CPU on internal links.

One of the less-discussed aspects of TCP optimisation is the impact of the initial congestion window (IW). TCP's slow start mechanism begins with a small window and grows it over time, which can introduce significant latency for short-lived connections. In trading, many connections are long-lived (persistent connections to exchanges), so slow start is less of an issue. However, for initial connection establishment or after idle periods, the IW matters. We've increased our IW from the default of 10 segments to 20 segments using the `net.ipv4.tcp_init_cwnd` parameter. This reduces the time to reach full throughput by about one round-trip time. For our time-sensitive order entry connections, this can shave off 50-100 microseconds during reconnection events. Dr. Vanessa L. P. of the Networking Research Lab at Stanford University has published work showing that increasing IW to 30 or even 40 segments yields further benefits, but we've found diminishing returns beyond 20 due to increased risk of buffer overflow in our network fabric. The key is to coordinate IW settings with your network operators to ensure consistent configuration across all devices.

The interplay between congestion control and application-level flow control is also worth examining. Many trading applications implement their own pacing mechanisms to avoid overwhelming downstream systems. For example, our order router imposes a rate limit of 10,000 orders per second per exchange to comply with market rules. This application-level pacing interacts with TCP congestion control—if TCP's window is larger than the application's sending rate, the connection is effectively application-limited rather than network-limited. In such cases, congestion control algorithms have minimal impact on latency. We use kernel tracing with `tcptrace` to monitor this. If we observe that TCP send buffers are consistently empty (i.e., the application is the bottleneck), we know that further TCP tuning will yield no benefit. Conversely, if we see buffer occupancy growing, it indicates that the network is the bottleneck, and congestion control tuning may help. This kind of systematic measurement is essential for effective optimisation. My personal rule of thumb: measure first, then tune. Too many teams blindly apply optimisations without understanding where the actual bottleneck lies, leading to wasted effort and sometimes worse performance. In summary, TCP congestion control and SACK are powerful tools for latency reduction, but their effectiveness depends heavily on your specific network environment and application profile.

## TCP_NODELAY, Nagle's Algorithm, and Write Coalescing

Nagle's algorithm, a standard feature of TCP implementations, is designed to improve efficiency by delaying small writes. It works by accumulating small packets into a single larger segment before transmission, reducing the number of packets sent. While this is beneficial for bulk data transfer (e.g., file uploads), it is catastrophic for low-latency trading where every message matters independently. Nagle's algorithm introduces an artificial delay of up to 200 milliseconds while waiting for more data. For a trading system, this is simply unacceptable. The solution is to disable Nagle's algorithm using the `TCP_NODELAY` socket option. At ORIGINALGO, we set `TCP_NODELAY` on every socket that carries trading data. This ensures that each write operation results in an immediate TCP segment transmission, without any delay. We've seen cases where failing to set this option caused 150-millisecond latency spikes during pre-market churn—a scenario that could easily trigger a "fat finger" false positive in our risk controls.

However, simply disabling Nagle's algorithm is not enough. The real challenge is managing the trade-off between latency and packet overhead. When you send many small packets (e.g., individual order acknowledgments of 60 bytes), you're consuming network resources inefficiently. Each Ethernet frame has a minimum 64-byte overhead (including preamble, inter-frame gap, and CRC), so a 60-byte TCP payload results in about 52% overhead. On heavily utilized network links, this can lead to congestion and increased latency for all traffic. We've addressed this through a technique called "write coalescing" at the application level. Rather than issuing multiple small writes, our trading application batches multiple messages into a single write operation. For example, instead of sending each order confirmation individually, we aggregate up to 10 confirmations into a single TCP write, executed every 100 microseconds. This reduces packet count by 90% while only adding 100 microseconds of latency. This is a conscious trade-off that we've validated through extensive backtesting. For time-critical messages like fill reports, we still send them immediately (no batching), but less latency-sensitive messages like risk position updates are batched.

The `TCP_CORK` option, which is available on Linux, provides another mechanism for controlling write behavior. When `TCP_CORK` is set, the kernel suppresses sending partial segments until the option is removed or the cork "pops" after a time delay. This is useful for ensuring that multiple writes are combined into a single TCP segment. At ORIGINALGO, we've used `TCP_CORK` in our market data replay systems to optimize bulk transmission of historical data. By setting cork before writing a batch of messages and removing it after the batch is complete, we achieve nearly perfect TCP segmentation with zero application-level batching overhead. However, `TCP_CORK` must be used with care—if you forget to un-cork, the connection can stall indefinitely. We've implemented a watchdog timer that force-flushes the socket after 10 milliseconds to prevent this. This is one of those "small details" that can cause big problems in production. I recall an incident where a code refactoring accidentally removed the cork removal step, causing our historical data server to appear dead for several minutes before the watchdog kicked in. The lesson: always test socket option configurations under realistic load conditions.

Network interface offloading features can also influence write behavior. For instance, TSO (TCP Segmentation Offload) allows the NIC to split large TCP segments into smaller MTU-sized packets directly in hardware, reducing CPU overhead. However, TSO can interact badly with Nagle's algorithm and `TCP_NODELAY`. When TSO is active, the kernel may pass a large segment (up to 64KB) to the NIC, and the NIC's segmentation logic determines the actual packet boundaries. This can lead to sub-optimal behavior if you're trying to control packet timing precisely. At ORIGINALGO, we disable TSO on our latency-critical connections using `ethtool -K eth0 tso off`. This forces the kernel to generate MTU-sized segments, giving us finer control over write timing. The trade-off is increased CPU utilization—about 5-10% on our systems—but the improved determinacy is worth it. For our bulk data servers, we keep TSO enabled because throughput is more important than precise packet timing. The key insight is that socket write optimizations must be considered holistically, in conjunction with NIC offloading features and application-level batching. There is no one-size-fits-all solution; the optimal configuration depends on your specific latency tolerance, message sizes, and throughput requirements.

## NUMA Awareness and CPU Pinning

Non-Uniform Memory Access (NUMA) architecture, where memory access times vary depending on which CPU socket the memory is attached to, is a critical but often overlooked factor in TCP/IP stack optimisation. In modern multi-socket systems, accessing memory from a remote NUMA node can be 30-50% slower than accessing local memory. For low-latency trading applications that process millions of packets per second, this latency penalty can be devastating. At ORIGINALGO, we've made NUMA awareness a fundamental design principle for all our trading systems. Every process that handles network traffic is explicitly bound to the same NUMA node as the network interface card it uses. This ensures that all packet buffers, socket structures, and application data are allocated from local memory. We use the `numactl` utility to bind processes at launch and verify NUMA bindings with `lstopo` during deployment. A common mistake is to rely on the kernel's automatic NUMA balancing, which can migrate threads between nodes—this is disastrous for deterministic latency. We disable automatic NUMA balancing via `echo 0 > /proc/sys/kernel/numa_balancing` on all our trading servers.

CPU pinning goes hand-in-hand with NUMA awareness. In low-latency systems, you want dedicated CPU cores for critical tasks to avoid interference from other processes. Context switching overhead—which can take 5-10 microseconds—is completely unacceptable for sub-microsecond latency targets. At ORIGINALGO, we reserve entire CPU cores for specific tasks. For example, our market data handler runs on CPU cores 0-1, the order entry gateway runs on cores 2-3, and the risk management engine runs on cores 4-5. Each core runs exactly one thread, with no sharing allowed. This is achieved through a combination of `taskset` for process affinity and kernel boot parameters like `isolcpus` to prevent the kernel from scheduling general tasks on these cores. The `isolcpus` parameter is set in the GRUB configuration: `isolcpus=0-5`. This effectively dedicates those cores to our trading applications. I've personally benchmarked the impact of CPU pinning: without isolation, our average packet processing latency was 4.2 microseconds with occasional spikes to 25 microseconds. After pinning and isolation, the average dropped to 3.1 microseconds and worst-case spikes were below 5 microseconds. The improvement is dramatic because we eliminated competing interrupt handling from network drivers, disk I/O, and other system processes.

Interrupt requests (IRQs) represent another source of latency that must be controlled through CPU pinning. Each NIC generates interrupts that must be handled by a CPU core. If these interrupts land on the same cores running your trading application, they can preempt your critical processing. The solution is to move all NIC interrupts to dedicated cores that are not used by your application. On Linux, you can do this by writing CPU bitmasks to `/proc/irq//smp_affinity`. For example, if we have 8 CPU cores (0-7) and our application uses cores 0-3, we'll assign all NIC interrupts to cores 4-7. This ensures that interrupt handling never interrupts application processing. We also set `irqbalance` to disabled (`systemctl stop irqbalance` and `systemctl disable irqbalance`) to prevent the kernel from automatically moving interrupts. The effectiveness of this technique is well-documented. A 2019 study published in the Journal of Network and Computer Applications found that proper IRQ affinity can reduce application-level packet processing latency by up to 40% on multi-socket systems. At ORIGINALGO, we've seen similar improvements, particularly under high load where interrupt storms can otherwise overwhelm application threads.

The practical implementation of NUMA awareness and CPU pinning requires careful hardware selection and system design. We choose motherboards that provide clear NUMA topology documentation and avoid systems with "hyper-threading" on our latency-critical cores. Hyper-threading, where two logical cores share the same physical execution resources, can introduce latency variability due to resource contention. We disable hyper-threading in the BIOS for our trading servers. Another consideration is the placement of the NIC in the PCIe slot. We ensure that the NIC's DMA operations target memory on the same NUMA node as our application cores. This is verified by checking the PCIe bus location and the associated NUMA node using `cat /sys/bus/pci/devices/0000:XX:00.0/numa_node`. We've encountered cases where a "mistakenly" inserted NIC on a different NUMA node added 200 nanoseconds of latency per packet—an easily avoidable error. The bottom line: NUMA awareness and CPU pinning are not optional optimizations; they are foundational requirements for any system targeting sub-10-microsecond latency. The effort to implement them correctly is significant, but the payoff in terms of latency reduction and determinism is immense.

## Connection Management and Keep-Alive Strategies

Connection management might seem mundane compared to kernel bypass or DPDK, but it's a source of substantial latency variability. In trading, we typically maintain persistent TCP connections to multiple exchanges, market data providers, and clearing systems. The way these connections are established, maintained, and torn down can have a significant impact on overall latency. One common issue is the TCP three-way handshake, which adds a full round-trip time to connection setup. For a session that lasts hours, this overhead is negligible. However, for connections that experience frequent drops (e.g., due to network congestion or exchange-side failures), reconnection latency becomes critical. At ORIGINALGO, we've implemented a pre-connection pool: we maintain a set of "warm" connections that are established in advance and kept idle, ready for immediate use when a primary connection fails. This reduces reconnection latency from 500-1000 microseconds to under 10 microseconds. The cost is increased memory usage and periodic keep-alive traffic to prevent idle connection drops.

TCP keep-alive mechanisms are themselves a source of potential latency issues. The standard Linux TCP keep-alive interval is 2 hours by default, with 9 probes sent at 75-second intervals before declaring a connection dead. This means it can take over 11 minutes to detect a dead connection. For a trading system where every second of market data loss is costly, this is entirely unacceptable. We've drastically reduced our keep-alive parameters. On our trading connections, we set `net.ipv4.tcp_keepalive_time` to 5 seconds, `net.ipv4.tcp_keepalive_intvl` to 1 second, and `net.ipv4.tcp_keepalive_probes` to 3. This means we detect connection failure within 8 seconds (5 + 3*1). The downside is increased network overhead—each connection sends a keep-alive packet every 5 seconds, which on 1000 connections means 200 packets per second. For a 10GbE link, this is negligible traffic, but it does add CPU overhead for packet processing. We've considered using application-level heartbeats instead of TCP keep-alive, which would give us more control over timing and payload. However, the simplicity of TCP keep-alives means we've kept them for now, with a plan to migrate to application-level heartbeats in our next architecture overhaul.

Connection reuse is another important strategy. Many trading protocols, such as FIX (Financial Information eXchange), use a single TCP connection for multiple messages. However, some API implementations encourage creating new connections for different transaction types. This is inefficient because each new connection requires a handshake and slow-start ramp-up. At ORIGINALGO, we aggressively reuse connections. Our connection manager maintains a pool of long-lived connections to each counterparty, and we route all traffic through a small set of multiplexed sessions. This reduces the number of connections from potentially thousands to just a handful. The result is lower memory usage, reduced CPU overhead for connection management, and more stable latency profiles. I've seen firms that open a new connection for every single order—a practice that adds 500-1000 microseconds of overhead per trade. For a firm executing thousands of trades per second, that's an incredible amount of wasted time. The key is to understand your trading application's connection requirements and design for persistent, multiplexed connections from the start.

Finally, connection tear-down is a surprisingly frequent source of latency issues. The TCP TIME_WAIT state, where a connection that has been closed remains in the system for 2*MSL (Maximum Segment Lifetime, typically 60 seconds), can create port exhaustion if connections are opened and closed rapidly. This is less of an issue for long-lived trading connections, but it becomes critical during system restarts or failovers. We use the `net.ipv4.tcp_tw_reuse` parameter to allow new connections to reuse TIME_WAIT sockets under certain conditions, and `net.ipv4.tcp_tw_recycle` (though this is deprecated in newer kernels and can cause problems with NAT). For our tactical failover scenarios, we've implemented a "graceful connection draining" strategy: when a connection is being closed, we send a special application-level message to the counterparty, then wait for all pending data to be acknowledged before initiating the TCP close. This prevents data loss and reduces the number of connections entering TIME_WAIT. Dr. Y. S. Kim, a network engineer at a major exchange, has noted that "connection management is the Achilles' heel of many low-latency trading systems—it's not glamorous, but it's where many real-world failures occur." I couldn't agree more. Our quarterly incident reviews consistently show that connection management issues, while unglamorous, are among the most common causes of latency anomalies.

## Conclusion: The Eternal Pursuit of Microseconds The journey through TCP/IP stack optimisation for low latency reveals a fundamental truth: there are no silver bullets. Every technique—kernel bypass, interrupt tuning, socket buffer management, congestion control, write coalescing, NUMA awareness, and connection management—involves trade-offs. The most successful implementations come from understanding these trade-offs and making deliberate choices based on your specific requirements. At ORIGINALGO TECH CO., LIMITED, our philosophy is that latency optimisation is never "done." The landscape is constantly evolving: new NIC hardware, updated kernel features, and changing exchange connection requirements all demand ongoing attention. We've structured our engineering team to include dedicated network optimisation specialists who continuously monitor latency metrics and test new configurations. This is not a one-time project but a continuous process of measurement, analysis, and adjustment. Looking forward, I see three key trends that will shape the field. First, the move towards **programmable data planes** using P4 or eBPF will allow even finer-grained control over packet processing. Imagine being able to implement custom congestion control algorithms that respond to your specific application's requirements, all within the kernel or NIC firmware. Second, **machine learning for adaptive tuning** is emerging. Some research groups are exploring reinforcement learning agents that automatically adjust TCP parameters based on real-time network conditions. While this is still experimental, I believe it will become mainstream within five years. Third, **hardware-software co-design** will become more important. The line between what's done in software (in the TCP stack) and what's done in hardware (on the NIC) is blurring. Products like NVIDIA's BlueField DPU (Data Processing Unit) allow you to run networking software directly on the NIC, bypassing the host CPU entirely. This could represent the next frontier in low-latency networking. However, I must offer a note of caution. In our quest for lower latency, we must not lose sight of operational reliability. Some of the most aggressive optimisation techniques—disabling congestion control, using raw sockets, bypassing kernel security mechanisms—can introduce fragility. I've seen firms waste months of engineering effort chasing microsecond improvements that ultimately made their systems less stable. My personal rule is: **optimise for determinism first, then for average latency**. A slightly slower system that behaves predictably under all conditions is far more valuable than a faster one that occasionally spikes or fails. This principle has guided our decisions at ORIGINALGO, and I believe it should guide the industry as a whole. Finally, I want to emphasize that TCP/IP stack optimisation is not just about technical configuration. It requires a deep understanding of your application's traffic patterns, your network's physical characteristics, and your business's risk tolerance. The best engineers in this field are not just Linux kernel experts—they're also students of market microstructure, network physics, and financial risk management. The low-latency trading industry is a unique intersection of computer science and finance, and it demands practitioners who can think across these domains. As we continue to push the boundaries of what's possible, I'm excited to see what the next generation of network engineers and quant developers will achieve. The pursuit of the perfectly optimised TCP stack is, in a sense, a microcosm of the broader human drive for efficiency and speed—and it's a journey well worth taking. ## ORIGINALGO TECH CO., LIMITED's Insights on TCP/IP Stack Optimisation for Low Latency At **ORIGINALGO TECH CO., LIMITED**, we recognize that TCP/IP stack optimisation for low latency is not merely a technical exercise but a strategic imperative for modern financial systems. Our experience across high-frequency trading, market data distribution, and algorithmic execution has taught us that **every microsecond of latency reduction must be balanced against reliability, maintainability, and cost**. We've developed a framework that prioritises **determinism over raw speed**, advocating for a layered approach where kernel bypass and polling drivers are combined with robust connection management and thorough testing. We strongly believe that the best optimisations are those that can be measured, reproduced, and rolled back safely. In our work with clients across Asia and Europe, we've seen too many firms chase headline-grabbing latency numbers while neglecting the operational fundamentals—such as NUMA topology verification, interrupt binding, and systematic benchmarking under realistic load. Our insight is simple: **optimal latency is not about the fastest possible configuration, but about the most predictable one**. We recommend that firms invest in continuous performance monitoring,