FortiGate – High CPU and Memory load

Recently, we encountered significant CPU and memory utilization spikes on one of our Fortigate Firewalls. Despite consistently handling around 1.5 million sessions for several months without any problems, the situation took a turn for the worse. The firewall became unresponsive through the Command Line Interface (CLI), and at that time, we hadn’t configured a dedicated management CPU core. Moreover, the firewall’s ability to handle Link Aggregation Control Protocol (LACP) began deteriorating, leading to interface flapping issues.

  1. Unmasking the Network Mystery: When Firewalls Go Haywire
    1. The Flapping Port Channels
    2. A Memory in Conservation Mode
    3. The CPU Conundrum
  2. Root Causes and Solutions
  3. DNS Session Timeout
    1. DNS sessions were being kept in the session table for 180 seconds, which was excessive. To address this, we reduced the session timeout for DNS traffic (UDP/53) to 30 seconds:
    2. But was the root cause of this?
    3. Client DNS Behavior
    4. Docker and JOB execution
  4. FortiGate – DoS Policy
  5. Commands used


Unmasking the Network Mystery: When Firewalls Go Haywire

We recently encountered a series of network issues with our Fortigate Firewall. The problems included frequent port channel flapping, memory conservation mode activation, and high CPU utilization.

In the world of network security, the Fortigate Firewall had been the unsung hero for months. Capably handling a staggering 1.5 million sessions without a hiccup, it had been the reliable guardian of the digital realm. But then, the serenity shattered.

The Flapping Port Channels

It all began with the enigmatic dance of the port channels. Network ports, once stable, started a peculiar tango—up, down, and up again. This erratic behavior sent us diving into the logs:

diagnose sys ha history read
<2023-09-05 08:29:32> port aggr1 link status changed: 0->1
<2023-09-05 08:29:30> port aggr2 link status changed: 1->0
...
<2023-09-05 08:34:40> port aggr2 link status changed: 1->0
<2023-09-05 08:34:40> port aggr1 link status changed: 1->0
<2023-09-05 08:33:07> port aggr2 link status changed: 0->1

A Memory in Conservation Mode

To add to the mystery, our firewall seemed to be experiencing a memory blackout. It reluctantly entered a conservation mode, as if guarding a secret:

diagnose debug crashlog read
...
10: 2023-09-05 08:24:03 logdesc="Memory conserve mode entered" service=kernel conserve=on total="16063
11: 2023-09-05 08:24:03 MB" used="14142 MB" red="14135 MB" green="13171 MB" msg="Kernel enters memory
12: 2023-09-05 08:24:03 conserve mode"
13: 2023-09-05 08:24:12 logdesc="Memory conserve mode exited" service=kernel conserve=exit
14: 2023-09-05 08:24:12 total="16063 MB" used="12983 MB" red="14135 MB" green="13171 MB" msg="Kernel
15: 2023-09-05 08:24:12 exits memory conserve mode"
...

The CPU Conundrum

As we delved deeper, another puzzling revelation awaited—the CPU was behaving oddly. It wasn’t a mere high load; it was a CPU party, with 99% of the action happening in the “softirq” realm:

get system performance status | grep softirq
CPU states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU0 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU1 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU2 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU3 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU4 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU5 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU6 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU7 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU8 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU9 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU10 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU11 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
get system performance status | grep Memory
Memory: 16448784k total, 11415136k used (69.4%), 3599992k free (21.9%), 1433656k freeable (8.7%)

Sessions in Disguise

get system performance status | grep session
Average sessions: 894187 sessions in 1 minute, 996582 sessions in 10 minutes, 1056480 sessions in 30 minutes
Average session setup rate: 4739 sessions per second in last 1 minute, 5091 sessions per second in last 10 minutes, 7053 sessions per second in last 30 minutes
Average NPU sessions: 27341 sessions in last 1 minute, 26152 sessions in last 10 minutes, 45914 sessions in last 30 minutes
Average nTurbo sessions: 0 sessions in last 1 minute, 0 sessions in last 10 minutes, 0 sessions in last 30 minutes

Root Causes and Solutions

The issues had multiple contributing factors:

There are are some types of sessions that are not offloaded at all, for us of interest are ICMP and DNS sessions.

session info: proto=17 proto_state=01 duration=97 expire=82 timeout=0 flags=00000000 socktype=0 sockport=0 av_idx=0 use=3   =
origin-shaper=
reply-shaper=
per_ip_shaper=
class_id=0 ha_id=0 policy_dir=0 tunnel=/ helper=dns-udp vlan_cos=0/255
state=may_dirty npu synced netflow-origin netflow-reply
statistic(bytes/packets/allow_err): org=84/1/1 reply=100/1/1 tuples=2
tx speed(Bps/kbps): 0/0 rx speed(Bps/kbps): 0/0
orgin->sink: org pre->post, reply pre->post dev=64->65/65->64 gwy=192.0.2.97/192.0.2.97
hook=pre dir=org act=noop 198.51.100.99:31947->203.0.113.254:53(0.0.0.0:0)
hook=post dir=reply act=noop 203.0.113.254:53->198.51.100.99:31947(0.0.0.0:0)
misc=0 policy_id=702 auth_info=0 chk_client_info=0 vd=1
serial=4b6397dc tos=ff/ff app_list=0 app=0 url_cat=0
sdwan_mbr_seq=0 sdwan_service_id=0
rpdb_link_id=00000000 rpdb_svc_id=0 ngfwid=n/a
npu_state=00000000
npu info: flag=0x00/0x00, offload=0/0, ips_offload=0/0, epid=0/0, ipid=0/0, vlan=0x0000/0x0000
vlifid=0/0, vtag_in=0x0000/0x0000 in_npu=0/0, out_npu=0/0, fwd_en=0/0, qid=0/0
no_ofld_reason:
ofld_fail_reason(kernel, drv): none/not-established, none(0)/none(0)
npu_state_err=00/04

So DNS queries would stay for 180 seconds in the session table, quiet a bit long.

DNS Session Timeout

DNS sessions were being kept in the session table for 180 seconds, which was excessive. To address this, we reduced the session timeout for DNS traffic (UDP/53) to 30 seconds:

config vdom
 edit <vdom_name>
  config system session-ttl
   config port
   edit <ID>
    set protocol 17
    set timeout 30
    set start-port 53
    set end-port 53
   next
 end
end

But was the root cause of this?

To be frank there is no single root cause to this. But we observed a few things.

Client DNS Behavior

With the help of DNS query logs we could identify that some clients where basically asking two names on a high frequency. This was in the range that 1 client did than 20K queries/ 30 minutes to request a database server or an git repository server.

So some clients were making an unusually high number of DNS queries, as they lacked DNS caching mechanisms. This behavior was identified as a result of clients not having local DNS caching configured. Additionally, the Docker daemon on these systems was configured to bypass local DNS caching and query the primary DNS servers directly for each execution.

Docker and JOB execution

Some systems involved in the excessive DNS queries were running Docker/Kubernetes-based job scheduling environments. The Docker daemon configuration exacerbated the problem.
The Docker daemon was also configured not to rely on the Local Host DNS caching name server, rather than asking the the Companies Primary DNS servers directly for each execution.

Over Time the environment was growing and so the amount of DNS Querries.

Conclusion

While there is no single root cause for these issues, addressing the excessive DNS session timeout and improving DNS caching on the client side have already alleviated some of the resource usage concerns. Additionally, implementing an appropriate DoS (Denial of Service) policy on the firewalls will be a necessary step to mitigate future issues and protect the network infrastructure.

The next steps will be to configure an appropriated DoS policy on the firewalls.

FortiGate – DoS Policy

As a First Step we will introduce a simple Policy to mitigate the situation.

config firewall DoS-policy
  edit 1
    set status enable
    set name "DNS DoS Mitigatigation"
    set interface "wan1"
    set srcaddr "all"
    set dstaddr "all"
    set service "DNS"
    config anomaly
      edit "tcp_syn_flood"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 2000
      next
      edit "tcp_port_scan"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 1000
      next
      edit "tcp_src_session"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 5000
      next
      edit "tcp_dst_session"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 5000
      next
      edit "udp_flood"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 2000
      next
      edit "udp_scan"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 2000
      next
      edit "udp_src_session"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 5000
      next
      edit "udp_dst_session"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 5000
      next
      edit "icmp_flood"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 250
      next
      edit "icmp_sweep"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 100
      next
      edit "icmp_src_session"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 300
      next
      edit "icmp_dst_session"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 1000
      next
      edit "ip_src_session"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 5000
      next
      edit "ip_dst_session"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 5000
      next
      edit "sctp_flood"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 2000
      next
      edit "sctp_scan"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 1000
      next
      edit "sctp_src_session"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 5000
      next
      edit "sctp_dst_session"
        set status enable
        set log enable
        set action block
        set quarantine attacker
        set quarantine-expiry 30m
        set threshold 5000
      next
    end
  next
end

Commands used

get system performance status
get system interface transceiver
diagnose sys top 2 30
execute log display
get hardware status
diagnose sys ha history read
diagnose debug crashlog read
diagnose sys session stat
diagnose hardware deviceinfo disk


Photo by Andrew Gaines on Unsplash