UDP and ICMP connections do not recover from an IPsec VPN tunnel flap

UDP and ICMP connections do not recover from an IPsec VPN tunnel flap - Topology

Introduction

A common issue seen when using FortiGate as an IPsec VPN gateway, is that some particular connections using stateless protocols such as UDP or ICMP, remain down after the VPN tunnel they use to send traffic through comes back up. This post explains why this issue occurs and how to solve it.

We will focus on FortiGate-Branch device. You can review the configuration used on FortiGate-HQ by downloading the configuration file on the Lab Files section.

Firmware and Configuration Details

  • FortiGate-Branch
    • FortiOS 6.2.3
    • VDOMs: disabled
    • Site-to-Site IPsec VPN tunnel to FortiGate-HQ

FortiGate-Branch Configuration

### Interfaces
config system interface
    edit "port1"
        set vdom "root"
        set ip 10.1.1.20 255.255.255.0
        set allowaccess ping https ssh http fgfm
        set type physical
        set snmp-index 1
    next
    edit "port5"
        set vdom "root"
        set ip 192.168.2.254 255.255.255.0
        set allowaccess ping
        set type physical
        set snmp-index 5
    next
end

### IPsec VPN
config vpn ipsec phase1-interface
    edit "ToHQ"
        set interface "port1"
        set peertype any
        set net-device disable
        set proposal aes128-sha256 aes256-sha256 aes128-sha1 aes256-sha1
        set remote-gw 10.1.1.10
        set psksecret checkthefirewall
    next
end
config vpn ipsec phase2-interface
    edit "ToHQ-P2"
        set phase1name "ToHQ"
        set proposal aes128-sha1 aes256-sha1 aes128-sha256 aes256-sha256 aes128gcm aes256gcm chacha20poly1305
        set auto-negotiate enable
    next
end

### Firewall policies
config firewall policy
    edit 1
        set name "FromLANtoInternet"
        set uuid 32d6dc00-ae84-51ea-f586-71c4da441eac
        set srcintf "port5"
        set dstintf "port1"
        set srcaddr "all"
        set dstaddr "all"
        set action accept
        set schedule "always"
        set service "ALL"
        set nat enable
    next
    edit 2
        set name "FromLANtoHQ"
        set uuid 6393559e-ae84-51ea-e2db-6e428f4f0f79
        set srcintf "port5"
        set dstintf "ToHQ"
        set srcaddr "all"
        set dstaddr "all"
        set action accept
        set schedule "always"
        set service "ALL"
    next
end

### Routing
config router static
    edit 1
        set gateway 10.1.1.1
        set device "port1"
    next
    edit 2
        set dst 172.16.1.0 255.255.255.0
        set device "ToHQ"
    next
end
FGT-Branch # get router info routing-table all

Routing table for VRF=0
Codes: K - kernel, C - connected, S - static, R - RIP, B - BGP
       O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, ia - IS-IS inter area
       * - candidate default

S*      0.0.0.0/0 [10/0] via 10.1.1.1, port1
C       10.1.1.0/24 is directly connected, port1
S       172.16.1.0/24 [10/0] is directly connected, ToHQ
C       172.16.2.0/24 is directly connected, port7
C       192.168.2.0/24 is directly connected, port5

I have highlighted in red the key parts of the configuration needed to reproduce the issue:

  • There is a VPN named FortiGate-HQ that is used to route traffic to 172.16.1.0/24 subnet.
  • port1 is used for Internet access, which is why a default route has been configured on that port.
  • There is an open firewall policy (id 1) for Internet access (all destination addresses are being allowed).

Issue Description

Initially, the tunnel is up and the continuous ping from PC to Server is successful. After the tunnel comes down, the ping begins to fail, which is expected. However, after the tunnel comes back up, the ping keeps failing for some reason.

Let's reproduce the issue and collect useful troubleshooting information at the same time. 

Tunnel is up, and ping works

  • From PC, let's run a continuous ping to Server while collecting some information:
### Ping works
C:\Users\User>ping 172.16.1.10 -t

Pinging 172.16.1.10 with 32 bytes of data:
Reply from 172.16.1.10: bytes=32 time=2ms TTL=62
Reply from 172.16.1.10: bytes=32 time=1ms TTL=62

### Route lookup for 172.16.1.10. Best route is through VPN tunnel
FGT-Branch # get router info routing-table details 172.16.1.10

Routing table for VRF=0
Routing entry for 172.16.1.0/24
  Known via "static", distance 10, metric 0, best
  * directly connected, ToHQ

### FortiGate-Branch VPN summary. Tunnel is up
FGT-Branch # get vpn ipsec tunnel summary
'ToHQ' 10.1.1.10:0  selectors(total,up): 1/1  rx(pkt,err): 10/0  tx(pkt,err): 10/0

### FortiGate-Branch sniffer. Packets are routed back and forth through VPN tunnel
FGT-Branch # diagnose sniffer packet any "host 172.16.1.10 and icmp" 4 0 l
interfaces=[any]
filters=[host 172.16.1.10 and icmp]
2020-06-14 16:18:23.886484 port5 in 192.168.2.10 -> 172.16.1.10: icmp: echo request
2020-06-14 16:18:23.886501 ToHQ out 192.168.2.10 -> 172.16.1.10: icmp: echo request
2020-06-14 16:18:23.887161 ToHQ in 172.16.1.10 -> 192.168.2.10: icmp: echo reply
2020-06-14 16:18:23.887174 port5 out 172.16.1.10 -> 192.168.2.10: icmp: echo reply

### ICMP session details. Session is established through ToHQ tunnel (index=18) and matches policy id 2
FGT-Branch # diagnose sys session list

session info: proto=1 proto_state=00 duration=12 expire=59 timeout=0 flags=00000000 sockflag=00000000 sockport=0 av_idx=0 use=4
origin-shaper=
reply-shaper=
per_ip_shaper=
class_id=0 ha_id=0 policy_dir=0 tunnel=ToHQ/ vlan_cos=0/255
state=may_dirty
statistic(bytes/packets/allow_err): org=780/13/1 reply=780/13/1 tuples=2
tx speed(Bps/kbps): 62/0 rx speed(Bps/kbps): 62/0
orgin->sink: org pre->post, reply pre->post dev=7->18/18->7 gwy=172.16.1.10/192.168.2.10
hook=pre dir=org act=noop 192.168.2.10:1->172.16.1.10:8(0.0.0.0:0)
hook=post dir=reply act=noop 172.16.1.10:1->192.168.2.10:0(0.0.0.0:0)
misc=0 policy_id=2 auth_info=0 chk_client_info=0 vd=0
serial=00000eba tos=ff/ff app_list=0 app=0 url_cat=0
rpdb_link_id = 00000000 ngfwid=n/a
dd_type=0 dd_mode=0
npu_state=0x3040000
total session 1

FGT-Branch # diagnose netlink interface list | grep index=18
if=ToHQ family=00 type=768 index=18 mtu=1438 link=0 master=0

Tunnel is brought down, and ping stops working (expected)

  • Let's collect the same information again:
### Ping stops working (expected)
Request timed out.
Request timed out.
Request timed out.

### Route lookup for 172.16.1.10. Route through VPN tunnel becomes inactive
### The next best route is the default one
FGT-Branch # get router info routing-table details 172.16.1.10

Routing table for VRF=0
Routing entry for 172.16.1.0/24
  Known via "static", distance 10, metric 0
    directly connected, ToHQ inactive

### FortiGate-Branch VPN summary. Tunnel is no longer up
FGT-Branch # get vpn ipsec tunnel summary
'ToHQ' 10.1.1.10:0  selectors(total,up): 1/0  rx(pkt,err): 0/0  tx(pkt,err): 0/0

### FortiGate-Branch sniffer. Packets are now routed through port1 (Internet)
### PC is now SNATed to port1 address (10.1.1.20)
FGT-Branch # diagnose sniffer packet any "host 172.16.1.10 and icmp" 4 0 l
interfaces=[any]
filters=[host 172.16.1.10 and icmp]
2020-06-14 16:19:08.401223 port5 in 192.168.2.10 -> 172.16.1.10: icmp: echo request
2020-06-14 16:19:08.401242 port1 out 10.1.1.20 -> 172.16.1.10: icmp: echo request

### ICMP session details. Session is now established through port1 (index=3) and matches policy id 1
### Connection is being SNATed
FGT-Branch # diagnose sys session list

session info: proto=1 proto_state=00 duration=9 expire=59 timeout=0 flags=00000000 sockflag=00000000 sockport=0 av_idx=0 use=4
origin-shaper=
reply-shaper=
per_ip_shaper=
class_id=0 ha_id=0 policy_dir=0 tunnel=/ vlan_cos=0/255
state=may_dirty
statistic(bytes/packets/allow_err): org=600/10/0 reply=960/10/1 tuples=2
tx speed(Bps/kbps): 62/0 rx speed(Bps/kbps): 100/0
orgin->sink: org pre->post, reply pre->post dev=7->3/3->7 gwy=10.1.1.1/192.168.2.10
hook=post dir=org act=snat 192.168.2.10:1->172.16.1.10:8(10.1.1.20:60417)
hook=pre dir=reply act=dnat 172.16.1.10:60417->10.1.1.20:0(192.168.2.10:1)
misc=0 policy_id=1 auth_info=0 chk_client_info=0 vd=0
serial=00000ecf tos=ff/ff app_list=0 app=0 url_cat=0
rpdb_link_id = 00000000 ngfwid=n/a
dd_type=0 dd_mode=0
npu_state=0x040000
total session 1

FGT-Branch # diagnose netlink interface list | grep index=3
if=port1 family=00 type=1 index=3 mtu=1500 link=0 master=0

Tunnel is brought back up, but ping continues to fail. Why?

  • Before answering, let's collect the same information one more time and look for any changes when compared to the previous output:
### Ping continues to fail
Request timed out.
Request timed out.
Request timed out.

### Route lookup for 172.16.1.10. VPN tunnel route is active again and the best match
FGT-Branch # get router info routing-table details 172.16.1.10

Routing table for VRF=0
Routing entry for 172.16.1.0/24
  Known via "static", distance 10, metric 0, best
  * directly connected, ToHQ

### FortiGate-Branch VPN summary. Tunnel is back up
FGT-Branch # get vpn ipsec tunnel summary
'ToHQ' 10.1.1.10:0  selectors(total,up): 1/1  rx(pkt,err): 0/0  tx(pkt,err): 0/1

### FortiGate-Branch sniffer. Packets continue to be routed through port1 (Internet) and SNATed
FGT-Branch # diagnose sniffer packet any "host 172.16.1.10 and icmp" 4 0 l
interfaces=[any]
filters=[host 172.16.1.10 and icmp]
2020-06-14 16:19:53.105006 port5 in 192.168.2.10 -> 172.16.1.10: icmp: echo request
2020-06-14 16:19:53.105023 port1 out 10.1.1.20 -> 172.16.1.10: icmp: echo request

### ICMP session details. Session is still established through port1 (index=3), matching policy id 1 and SNATed
FGT-Branch # diagnose sys session list

session info: proto=1 proto_state=00 duration=59 expire=59 timeout=0 flags=00000000 sockflag=00000000 sockport=0 av_idx=0 use=4
origin-shaper=
reply-shaper=
per_ip_shaper=
class_id=0 ha_id=0 policy_dir=0 tunnel=/ vlan_cos=0/255
state=may_dirty
statistic(bytes/packets/allow_err): org=3540/59/0 reply=5664/59/1 tuples=2
tx speed(Bps/kbps): 58/0 rx speed(Bps/kbps): 93/0
orgin->sink: org pre->post, reply pre->post dev=7->3/3->7 gwy=10.1.1.1/192.168.2.10
hook=post dir=org act=snat 192.168.2.10:1->172.16.1.10:8(10.1.1.20:60417)
hook=pre dir=reply act=dnat 172.16.1.10:60417->10.1.1.20:0(192.168.2.10:1)
misc=0 policy_id=1 auth_info=0 chk_client_info=0 vd=0
serial=00000ecf tos=ff/ff app_list=0 app=0 url_cat=0
rpdb_link_id = 00000000 ngfwid=n/a
dd_type=0 dd_mode=0
npu_state=0x040000
total session 1

FGT-Branch # diagnose netlink interface list | grep index=3
if=port1 family=00 type=1 index=3 mtu=1500 link=0 master=0

As we can see, even though the tunnel is back up, and therefore, the VPN tunnel route is again the best route for the destination, FortiGate continues to route the ICMP packets through the Internet connection instead of routing them through the tunnel.


Root Cause

  • By default, FortiGate updates the routing information of non-SNATed sessions whenever there is a routing change affecting that session. When the VPN tunnel went down, the new best route for 172.16.1.10 became the default one (port1). For this reason, FortiGate replaced ToHQ (index 18) with port1 (index 3) as the outgoing interface for the session.
  • After the routing information for a session changes, FortiGate also performs a policy lookup. In our case, the traffic direction changed to: from port5 to port1. As a result, policy 1 became the new matching policy. Policy 1 is an open policy, and has NAT (SNAT) enabled.
  • After the tunnel came back up, the route lookup showed the VPN route as the best route. However, nor the outgoing interface or the matching policy of the ICMP session were updated. The reason is that, as mentioned above, FortiGate only updates the routing information of non-SNATed sessions by default. When the tunnel went down, the original connection, which was not SNATed at that point, began to be SNATed because the new matching policy (policy 1) has NAT enabled. At this point, further routing updates won't have an effect in that session, which is why packets are not routed again through the tunnel after the tunnel comes back up.
  • Finally, there are two key contributing factors to this issue:
    • Protocol is ICMP (connectionless). Unlike TCP, an ICMP or UDP connection does not time out, and therefore, the application (or the user) does not need to restart the connection to continue sending traffic.
    • Ping is continuous. FortiGate assigns an expiration timer to a session. Every time a packet matches an existing session, the expiration timer is reset. Because the ping is continuous, the session is never cleared, and thus a route and policy lookup is never performed for the SNATed connection. 

Solution

I can think of three different solutions for this issue, and I will sort them based on my personal preference.

1. Configure blackhole routes to drop traffic to the remote network when the tunnel goes down.

To prevent FortiGate from using the default route for traffic destined to the remote server after the VPN tunnel goes down, we can configure a blackhole route that FortiGate matches after the VPN route becomes invalid. When packets match a blackhole route (aka null route), FortiGate drops them and does not update the firewall policy information on the session. In addition, FortiGate will send Destination Unreachable messages back to the sender (PC in this case).

A blackhole route like the one below will make the trick:

config router static
    edit 3
        set dst 172.16.1.0 255.255.255.0
        set distance 254
        set blackhole enable
    next
end

However, instead of creating a blackhole route for the remote network only, which is a private network, why don't we take the opportunity to create blackhole routes for the whole private network range?. After all, it does not make sense to send traffic destined to private networks over the Internet. Not only we use unnecessary bandwidth, but also the packets are not going anywhere as our ISP will probably drop them right away. That said, we can extend the configuration to the following:

config router static
    edit 3
        set dst 10.0.0.0 255.0.0.0
        set distance 254
        set comment "RFC1918 - Class A"
        set blackhole enable
    next
    edit 4
        set dst 172.16.0.0 255.240.0.0
        set distance 254
        set comment "RFC1918 - Class B"
        set blackhole enable
    next
    edit 5
        set dst 192.168.0.0 255.255.0.0
        set distance 254
        set comment "RFC1918 - Class C"
        set blackhole enable
    next
end

With the configuration above, traffic destined to a private address that does not match a more specific route in the routing table will be dropped by FortiGate. Note that the administrative distance is set to 254, in case you have another route matching the whole range for a private network. Because your route is most likely have a much lower distance, then the blackhole route will kick in only if your "primary" route becomes inactive.

2. Configure policies to drop traffic destined to the remote network routed through the Internet.

I prefer to use blackhole routes because I believe it is a more scalable solution. However, another option is to configure firewall policies to deny traffic destined to the remote network through the Internet. As we did with the blackhole route, we can configure a policy to match all private network subnets, not just the remote subnet.

### Firewall address objects
config firewall address
    edit "10/8"
        set subnet 10.0.0.0 255.0.0.0
    next
    edit "172.16/12"
        set subnet 172.16.0.0 255.240.0.0
    next
    edit "192.168/16"
        set subnet 192.168.0.0 255.255.0.0
    next
end

### Firewall address group
config firewall addrgrp
    edit "RFC1918"
        set member "10/8" "172.16/12" "192.168/16"
    next
end

### Firewall policy
config firewall policy
    edit 3
        set name "BlockRFC1928toInternet"
        set srcintf "port5"
        set dstintf "port1"
        set srcaddr "all"
        set dstaddr "RFC1918"
        set schedule "always"
        set service "ALL"
    next
end

### Move policy 3 on top of policy
config firewall policy
    move 3 before 1
end

Note that we must move the new policy (3) on top of policy 1 so it takes precedence. In addition, policy 3 has been configured with action deny (default, hidden), which means that traffic will be dropped and SNAT will not be performed.


3. Enable route updates for SNATed sessions

FortiGate does not update the routing information of SNATed sessions by default, but you can changed this behavior by enabling snat-route-change option under config system global:

config system global
    set snat-route-change enable
end

Note that the above change is global, and therefore, it will affect all SNATed sessions in all VDOMs. I will discuss the possible impact in another post.

FAQ

Below are some of the most common questions I get about this issue. If yours is not listed here, feel free to post a comment with your question.

Q: What kind of applications can be affected by this behavior?

Any application that uses connectionless protocols such as UDP, ICMP or GRE. Also, the application must continuously send traffic, which prevents the session from expiring on FortiGate side. Common cases are SIP endpoints that send frequent register and options messages, or monitoring servers sending periodic ICMP and SNMP probes.

Q: How can I recover from this issue quickly?

You can just clear the problematic session on FortiGate by carefully using the diagnose sys session clear command. Before running the command, make sure to set up the session filter by running diagnose sys session filter xxx command. If you don't set up the filter, you will clear ALL the sessions in the system.

Q: Does this only happen when VPN tunnels are used?

No. The existing of a VPN tunnel is the most common case in my experience. But if you replace the VPN tunnel with any connection such as Ethernet, 4G/LTE or PPPoE, the same will happen if the rest of the setup (routing, policies) and the application behavior remains the same.

Lab Files

Feel free to download the configuration files I used for this lab.

File Description Date
failback-snated-connections-FGT-branch-623.conf FortiGate-Branch configuration file. Includes configuration for solution 1 06/14/2020
failback-snated-connections-FGT-hq-623.conf FortiGate-HQ configuration file 06/14/2020

Bottom Line

I reproduced, analysed and provided a few solutions for a common issue seen on FortiGate devices with IPsec VPN tunnels when using applications based on connectionless protocols that continuously send traffic over the network.

Out of the three solutions provided, using blackhole routes is the one I prefer due to its scalability.


Paul Marin

Paul Marin
A Network Security Engineer based in Canada.

1 comment

  • Great stuff, thanks for sharing

    Rohan J

Leave a comment

Please note, comments must be approved before they are published