Solving VPN-Related Network Timeouts on OpenWrt

From Jwiki

Solving VPN-Related Network Timeouts on OpenWrt

This guide documents the diagnosis and resolution of a common network issue: intermittent connection timeouts for specific services when traffic is routed through a VPN tunnel (e.g., WireGuard, ZeroTier) on an OpenWrt router. The root cause is a Path MTU Discovery (PMTUD) black hole, and the solution is to enable TCP MSS Clamping in the firewall.

1. The Problem: Connection Timeouts and Stalls

The primary symptom is that certain TCP connections hang and eventually time out, while others work perfectly.

  • Failing Services: Connections to complex websites or APIs that require a larger packet size for their TLS handshake.
  • Working Services: Connections to simple websites that transfer very little data (e.g., `ifconfig.me`) and standard ICMP (ping) requests.

This issue arises because VPN encapsulation adds overhead to packets. If a router on the internet path has a smaller MTU (Maximum Transmission Unit) than the encapsulated packet, and it is misconfigured to silently drop oversized packets instead of sending a proper ICMP "Fragmentation Needed" message, a PMTUD black hole is created. The connection stalls because the client's server never learns that it needs to send smaller packets.

2. The Diagnostic Process

A systematic approach using standard network tools can definitively identify a PMTUD black hole.

Step 1: Confirm the Scope

The issue was reproduced by running `curl` from a client whose traffic was routed through the VPN tunnel. Simple, low-data sites worked, while complex, high-data sites failed. This is a classic indicator of an MTU-related problem.

Step 2: Packet Capture with `tcpdump`

The definitive proof came from capturing the raw packet flow with `tcpdump` on the router. The capture showed a consistent pattern for failing connections:

  1. Successful Handshake: The initial TCP three-way handshake (`SYN`, `SYN/ACK`, `ACK`) completed successfully.
  2. TLS Negotiation Stall: The connection stalled immediately after the handshake when larger packets (like a TLS certificate) were expected.
  3. Selective Acknowledgment (SACK): The client's kernel sent `SACK` packets. This was the "smoking gun," as it proved the client was receiving some data but was acknowledging that other segments were missing.
  4. Timeout: The connection eventually hung and was closed by the client.

3. The Solution: Enable TCP MSS Clamping in OpenWrt

While manually setting the MTU on the VPN interface (e.g., to `1420` for WireGuard) is a necessary first step, it does not always solve the problem if the internet path has a non-standard MTU. The most robust solution is to enable TCP MSS Clamping. This instructs the router to automatically resize TCP segments to prevent fragmentation.

On OpenWrt, this is accomplished easily through the LuCI web interface or by editing `/etc/config/firewall`. The key is to add the `mtu_fix` option to the firewall zone handling VPN traffic.

Corrected Firewall Zone Configuration

# In /etc/config/firewall

config zone
        option name 'vpn_zone' #<-- Your VPN zone name
        # ... other options ...
        list network 'your_vpn_interface' #<-- e.g., 'wg_jgy_internal'
        option mtu_fix '1' # <-- This line is the fix

This setting is a best practice and should be enabled for all VPN-related firewall zones to ensure reliable network connectivity across any internet path.