Recently, we encountered significant CPU and memory utilization spikes on one of our Fortigate Firewalls. Despite consistently handling around 1.5 million sessions for several months without any problems, the situation took a turn for the worse. The firewall became unresponsive through the Command Line Interface (CLI), and at that time, we hadn’t configured a dedicated management CPU core. Moreover, the firewall’s ability to handle Link Aggregation Control Protocol (LACP) began deteriorating, leading to interface flapping issues.
- Unmasking the Network Mystery: When Firewalls Go Haywire
- Root Causes and Solutions
- DNS Session Timeout
- FortiGate – DoS Policy
- Commands used
Unmasking the Network Mystery: When Firewalls Go Haywire
We recently encountered a series of network issues with our Fortigate Firewall. The problems included frequent port channel flapping, memory conservation mode activation, and high CPU utilization.
In the world of network security, the Fortigate Firewall had been the unsung hero for months. Capably handling a staggering 1.5 million sessions without a hiccup, it had been the reliable guardian of the digital realm. But then, the serenity shattered.
The Flapping Port Channels
It all began with the enigmatic dance of the port channels. Network ports, once stable, started a peculiar tango—up, down, and up again. This erratic behavior sent us diving into the logs:
diagnose sys ha history read
<2023-09-05 08:29:32> port aggr1 link status changed: 0->1
<2023-09-05 08:29:30> port aggr2 link status changed: 1->0
...
<2023-09-05 08:34:40> port aggr2 link status changed: 1->0
<2023-09-05 08:34:40> port aggr1 link status changed: 1->0
<2023-09-05 08:33:07> port aggr2 link status changed: 0->1
A Memory in Conservation Mode
To add to the mystery, our firewall seemed to be experiencing a memory blackout. It reluctantly entered a conservation mode, as if guarding a secret:
diagnose debug crashlog read
...
10: 2023-09-05 08:24:03 logdesc="Memory conserve mode entered" service=kernel conserve=on total="16063
11: 2023-09-05 08:24:03 MB" used="14142 MB" red="14135 MB" green="13171 MB" msg="Kernel enters memory
12: 2023-09-05 08:24:03 conserve mode"
13: 2023-09-05 08:24:12 logdesc="Memory conserve mode exited" service=kernel conserve=exit
14: 2023-09-05 08:24:12 total="16063 MB" used="12983 MB" red="14135 MB" green="13171 MB" msg="Kernel
15: 2023-09-05 08:24:12 exits memory conserve mode"
...
The CPU Conundrum
As we delved deeper, another puzzling revelation awaited—the CPU was behaving oddly. It wasn’t a mere high load; it was a CPU party, with 99% of the action happening in the “softirq” realm:
get system performance status | grep softirq
CPU states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU0 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU1 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU2 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU3 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU4 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU5 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU6 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU7 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU8 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU9 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU10 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU11 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
get system performance status | grep Memory
Memory: 16448784k total, 11415136k used (69.4%), 3599992k free (21.9%), 1433656k freeable (8.7%)
Sessions in Disguise
get system performance status | grep session
Average sessions: 894187 sessions in 1 minute, 996582 sessions in 10 minutes, 1056480 sessions in 30 minutes
Average session setup rate: 4739 sessions per second in last 1 minute, 5091 sessions per second in last 10 minutes, 7053 sessions per second in last 30 minutes
Average NPU sessions: 27341 sessions in last 1 minute, 26152 sessions in last 10 minutes, 45914 sessions in last 30 minutes
Average nTurbo sessions: 0 sessions in last 1 minute, 0 sessions in last 10 minutes, 0 sessions in last 30 minutes
Root Causes and Solutions
The issues had multiple contributing factors:
There are are some types of sessions that are not offloaded at all, for us of interest are ICMP and DNS sessions.
session info: proto=17 proto_state=01 duration=97 expire=82 timeout=0 flags=00000000 socktype=0 sockport=0 av_idx=0 use=3 =
origin-shaper=
reply-shaper=
per_ip_shaper=
class_id=0 ha_id=0 policy_dir=0 tunnel=/ helper=dns-udp vlan_cos=0/255
state=may_dirty npu synced netflow-origin netflow-reply
statistic(bytes/packets/allow_err): org=84/1/1 reply=100/1/1 tuples=2
tx speed(Bps/kbps): 0/0 rx speed(Bps/kbps): 0/0
orgin->sink: org pre->post, reply pre->post dev=64->65/65->64 gwy=192.0.2.97/192.0.2.97
hook=pre dir=org act=noop 198.51.100.99:31947->203.0.113.254:53(0.0.0.0:0)
hook=post dir=reply act=noop 203.0.113.254:53->198.51.100.99:31947(0.0.0.0:0)
misc=0 policy_id=702 auth_info=0 chk_client_info=0 vd=1
serial=4b6397dc tos=ff/ff app_list=0 app=0 url_cat=0
sdwan_mbr_seq=0 sdwan_service_id=0
rpdb_link_id=00000000 rpdb_svc_id=0 ngfwid=n/a
npu_state=00000000
npu info: flag=0x00/0x00, offload=0/0, ips_offload=0/0, epid=0/0, ipid=0/0, vlan=0x0000/0x0000
vlifid=0/0, vtag_in=0x0000/0x0000 in_npu=0/0, out_npu=0/0, fwd_en=0/0, qid=0/0
no_ofld_reason:
ofld_fail_reason(kernel, drv): none/not-established, none(0)/none(0)
npu_state_err=00/04
So DNS queries would stay for 180 seconds in the session table, quiet a bit long.
DNS Session Timeout
DNS sessions were being kept in the session table for 180 seconds, which was excessive. To address this, we reduced the session timeout for DNS traffic (UDP/53) to 30 seconds:
config vdom
edit <vdom_name>
config system session-ttl
config port
edit <ID>
set protocol 17
set timeout 30
set start-port 53
set end-port 53
next
end
end
But was the root cause of this?
To be frank there is no single root cause to this. But we observed a few things.
Client DNS Behavior
With the help of DNS query logs we could identify that some clients where basically asking two names on a high frequency. This was in the range that 1 client did than 20K queries/ 30 minutes to request a database server or an git repository server.
So some clients were making an unusually high number of DNS queries, as they lacked DNS caching mechanisms. This behavior was identified as a result of clients not having local DNS caching configured. Additionally, the Docker daemon on these systems was configured to bypass local DNS caching and query the primary DNS servers directly for each execution.
Docker and JOB execution
Some systems involved in the excessive DNS queries were running Docker/Kubernetes-based job scheduling environments. The Docker daemon configuration exacerbated the problem.
The Docker daemon was also configured not to rely on the Local Host DNS caching name server, rather than asking the the Companies Primary DNS servers directly for each execution.
Over Time the environment was growing and so the amount of DNS Querries.
Conclusion
While there is no single root cause for these issues, addressing the excessive DNS session timeout and improving DNS caching on the client side have already alleviated some of the resource usage concerns. Additionally, implementing an appropriate DoS (Denial of Service) policy on the firewalls will be a necessary step to mitigate future issues and protect the network infrastructure.
The next steps will be to configure an appropriated DoS policy on the firewalls.
FortiGate – DoS Policy
As a First Step we will introduce a simple Policy to mitigate the situation.
config firewall DoS-policy
edit 1
set status enable
set name "DNS DoS Mitigatigation"
set interface "wan1"
set srcaddr "all"
set dstaddr "all"
set service "DNS"
config anomaly
edit "tcp_syn_flood"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 2000
next
edit "tcp_port_scan"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 1000
next
edit "tcp_src_session"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 5000
next
edit "tcp_dst_session"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 5000
next
edit "udp_flood"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 2000
next
edit "udp_scan"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 2000
next
edit "udp_src_session"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 5000
next
edit "udp_dst_session"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 5000
next
edit "icmp_flood"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 250
next
edit "icmp_sweep"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 100
next
edit "icmp_src_session"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 300
next
edit "icmp_dst_session"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 1000
next
edit "ip_src_session"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 5000
next
edit "ip_dst_session"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 5000
next
edit "sctp_flood"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 2000
next
edit "sctp_scan"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 1000
next
edit "sctp_src_session"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 5000
next
edit "sctp_dst_session"
set status enable
set log enable
set action block
set quarantine attacker
set quarantine-expiry 30m
set threshold 5000
next
end
next
end
Commands used
get system performance status
get system interface transceiver
diagnose sys top 2 30
execute log display
get hardware status
diagnose sys ha history read
diagnose debug crashlog read
diagnose sys session stat
diagnose hardware deviceinfo disk
Photo by Andrew Gaines on Unsplash

