Packet drops on AlwaysOnVPN server behind NSX Edge

Packet drops on AlwaysOnVPN server behind NSX Edge

Packet drops on AlwaysOnVPN server behind NSX Edge

Upwork

Upwork

Remoto

23 hours ago

No application

About

Problem Summary: When AlwaysOnVPN clients connect from home, applications that rely on direct SQL queries consistently fail with semaphore timeout expired errors. The VPN tunnel itself establishes correctly and lighter workflows (drive mapping, file copy) succeed, but SQL and Kerberos traffic experience systematic packet drops. Impact: Remote users are unable to use business-critical SQL-based applications, despite the VPN functioning for basic services. Cutover to new MSP: Issue appeared after moving our servers from our on-prem infrastructure (lift and shift) which had a managed Fortinet firewall to the NSX (dual) Edge with our current MSP. In other words this exact AlwaysOnVPN setup worked prior to changing MSP and no configuration changes was made on the AlwaysOnVPN setup itself. Why Dual NSX Edges The original intent was to run with a single NSX Edge appliance in front of the AlwaysOnVPN server, but the original single Edge design caused the firewall to block traffic flowing to and from the AlwaysOnVPN server likely due to asymmetric routing issues. As a workaround a second NSX Edge was introduced (by the MSP) creating a dual-edge topology where the inner Edge (closest to the AlwaysOnVPN server) was configured with the firewall disabled, so it would not block traffic and the outer Edge handles NAT/firewall functions, but with reduced inspection. Initial Findings: TCP Out-of-order packets, TCP retransmissions, and TCP resets observed and seen in packet traces (only across the VPN path). Also saw frequent ICMP fragmentation needed (Type 3, Code 4) but managed to get rid of by temporarily lowering the MTU (to 1100) on the VPN client, VPN server and SQL server. Even after getting rid of the fragmentation I was still seeing TCP restransmissions and resets. Hypotheses MSS mismatch ruled out as root cause (Temporarily lowered the MTU on all servers in path to test). Offload settings ruled out. (Temporarily disabled NIC offloading on SQL server to test). Potential culprit NSX Edge mishandling traffic somehow. Fortinet previously handled this gracefully. RFC1918 note: The subnet 200.30.120.0/24 is an internal private range, even though it does not abide by the RFC1918 rules. Help Needed: * I need assistance troubleshooting the cause of the packet drops. * I have the ability to spin up a full failover copy of Production, which allows us to make significant changes and tests without impacting live Production. * I can and will provide network packet traces as evidence. * I only have limited SSH read-only (enable) access to the NSX Edges due to how they are deployed, but given our good standing with the MSP I may be able to obtain additional access if required.