Network Performance Tuning

In order to achieve the highest bandwidth from high-speed network devices, (40Gb and faster), it is highly likely that you will need to perform some tuning. The following steps were taken to tune the test system. These steps were performed on a system with 100Gb network devices and are provided only as an example.

References

Please Note: The following actions were all performed as the "root" user. Most or all of these steps will require you to have root access. This can be achieved either by running each command via "sudo" or by first switching to the root account completely by using "sudo su -

Additionally, not all tuning steps are supported on all systems or devices. For example, the NUMA steps only work on systems that support NUMA. Setting "MaxReadReq" comes directly from tuning suggestions for Mellanox 100Gb adapters and thus may not be supported on adapters from other vendors.

Tuning Steps

  1. Find the device's NUMA node
    root@sys-6029u-trt:~# cat /sys/class/net/enp94s0f0/device/numa_node 
    0
    root@sys-6029u-trt:~# cat /sys/class/net/enp94s0f1/device/numa_node 
    0
    
  2. Find which CPU(s) the nodes are associated with:
    root@sys-6029u-trt:~# lscpu |grep NUMA
    NUMA node(s):          2
    NUMA node0 CPU(s):     0-17,36-53
    NUMA node1 CPU(s):     18-35,54-71
    
    * this lets us set affinity to keep the iperf processes close to the CPU
  3. Check current CPU frequencies:
    root@sys-6029u-trt:~# grep -E '^cpu MHz' /proc/cpuinfo
    cpu MHz         : 1000.000
    
  4. Set governor on all CPUs (Requires install of cpufrequtils)
    for x in `seq 0 71`;do
    	cpufreq-set -r -g performance -c $x
    done
    
  5. Check frequencies again
    grep -E '^cpu MHz' /proc/cpuinfo
    cpu MHz         : 2301.000
    
    Note they should now be at or above the CPU max.
  6. Make sure the card is in the right slot (100Gb cards should show speed of 8GT/s and Width of 16x otherwise the PCIe slot it’s in can’t handle the throughput)
    lspci -s 04:00.0 -vvv | grep Speed
                 LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
                 LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
    
  7. Check the MaxReadRequest using the PCI address of each port (you have to set this per port)
    lspci -s 04:00.0 -vvv | grep MaxReadReq
                            MaxPayload 256 bytes, MaxReadReq 512 bytes
    setpci -s 04:00.0 68.w
    2936
    
  8. Set Max Read Request to the upper limit
    setpci -s 04:00.0 68.w=5936
    lspci -s 04:00.0 -vvv | grep MaxReadReq
                            MaxPayload 256 bytes, MaxReadReq 4096 bytes
    
  9. Set buffer to 512M Buffers
    root@sys-6029u-trt:~# sysctl net.core.rmem_max=563870912
    net.core.rmem_max = 563870912
    root@sys-6029u-trt:~# sysctl net.core.wmem_max=563870912
    net.core.wmem_max = 563870912
    
  10. Increase linux autotuning TCP Buffer limits to 256MB
    root@sys-6029u-trt:~# sudo sysctl net.ipv4.tcp_rmem="4096 87380 268435456"
    net.ipv4.tcp_rmem = 4096 87380 268435456
    root@sys-6029u-trt:~# sudo sysctl net.ipv4.tcp_wmem="4096 87380 268435456"
    net.ipv4.tcp_wmem = 4096 87380 268435456
    
  11. Set max_backlog to 300K
    root@sys-6029u-trt:~# sysctl net.core.netdev_max_backlog=300000
    net.core.netdev_max_backlog = 300000
    
  12. Don't cache ssthresh from previous connection
    # sysctl net.ipv4.tcp_no_metrics_save=1
    net.ipv4.tcp_no_metrics_save = 1
    
  13. Explicitly set htcp as the congestion control. You could also set this to 'bbr'.
    # sysctl net.ipv4.tcp_congestion_control=htcp
    net.ipv4.tcp_congestion_control = htcp
    
  14. If you are using Jumbo Frames, also set this
    # sysctl net.ipv4.tcp_mtu_probing=1
    net.ipv4.tcp_mtu_probing = 1
    
  15. Set default qdisc to fq
    # sysctl net.core.default_qdisc=fq
    net.core.default_qdisc = fq
    
  16. NIC tweaks:
    1. Turn on Large Receive Offload:
      # ethtool -K enp216s0f0 lro on
      # ethtool -K enp216s0f1 lro on
      
    2. Set txqueuelen buffer higher:
      # ifconfig enp216s0f0 txqueuelen 20000
      # ifconfig enp216s0f1 txqueuelen 20000
      
    3. Enable jumbo frames:
      # ip link set enp216s0f0 mtu 9000
      # ip link set enp216s0f1 mtu 9000
      
    4. On some Mellanox ConnectX-4/ConnectX-5 cards, you may still only see around 60Gb/s on what should be a 100Gb/s device. This may be due to an issue with Adaptive RX being in use alongside hardware LRO.
      # ethtool -C  adaptive-rx off
      # ethtool -C  rx-usecs 8 rx-frames 128
      
      Note, this only applies to Mellanox cards. See Issue #1241056 in the driver release notes.
  17. Turn off irqbalance:
    # systemctl stop irqbalance
    # systemctl status irqbalance |grep Active
       Active: inactive (dead) since Tue 2018-04-17 20:58:16 UTC; 23s ago
    

For testing with iperf3

Note, the example below shows tests run over 60 seconds. Actual certification testing requires a test run of 1 hour per port. Thus for certification testing you would need to use "-t 3600" rather than "-t 60".

On the iperf target server, start 4 iperf3 daemons on different ports, pinned to NUMA Node 0 cores (see #2 above)

# iperf3 -sD -B 172.16.21.2 -p5101 -A0
# iperf3 -sD -B 172.16.21.2 -p5102 -A14
# iperf3 -sD -B 172.16.21.2 -p5103 -A36
# iperf3 -sD -B 172.16.21.2 -p5104 -A52
Note we're using -A to ensure each process is on a CPU core on the same NUMA node that our 100Gb NIC is attached to. On the System Under Test, kick off four iperf3 processes, one for each remote port. Please note that this is for example only. It is easier and neater to perform this using the tool "parallel" as noted in the next section.
$ iperf3 -c 172.16.21.1 -O 15 -t 60 -p 5101 -R -i 60 -T s1 & iperf3 -c 172.16.21.1 -O 15 -t 60 -p 5102 -R -i 60 -T s2 & iperf3 -c 172.16.21.1 -O 15 -t 60 -p 5103 -R -i 60 -T s3 & iperf3 -c 172.16.21.1 -O 15 -t 60 -p 5104 -R -i 60 -T s4&
This is abbreviated output
s4:  [ ID] Interval           Transfer     Bandwidth       Retr
s4:  [  4]   0.00-60.00  sec   161 GBytes  23.1 Gbits/sec  18726             sender
s4:  [  4]   0.00-60.00  sec   161 GBytes  23.1 Gbits/sec                  receiver
s4:  
s4:  iperf Done.
s3:  [ ID] Interval           Transfer     Bandwidth
s3:  [  4]   0.00-60.00  sec   160 GBytes  22.9 Gbits/sec                  
s3:  - - - - - - - - - - - - - - - - - - - - - - - - -
s3:  [ ID] Interval           Transfer     Bandwidth       Retr
s3:  [  4]   0.00-60.00  sec   160 GBytes  22.9 Gbits/sec  16953             sender
s3:  [  4]   0.00-60.00  sec   160 GBytes  22.9 Gbits/sec                  receiver
s3:  
s3:  iperf Done.
s2:  [ ID] Interval           Transfer     Bandwidth
s2:  [  4]   0.00-60.00  sec   163 GBytes  23.3 Gbits/sec                  
s2:  - - - - - - - - - - - - - - - - - - - - - - - - -
s2:  [ ID] Interval           Transfer     Bandwidth       Retr
s2:  [  4]   0.00-60.00  sec   163 GBytes  23.3 Gbits/sec  17582             sender
s2:  [  4]   0.00-60.00  sec   163 GBytes  23.3 Gbits/sec                  receiver
s1:  [ ID] Interval           Transfer     Bandwidth
s2:  
s2:  iperf Done.
s1:  [  4]   0.00-60.00  sec   159 GBytes  22.7 Gbits/sec                  
s1:  - - - - - - - - - - - - - - - - - - - - - - - - -
s1:  [ ID] Interval           Transfer     Bandwidth       Retr
s1:  [  4]   0.00-60.00  sec   159 GBytes  22.7 Gbits/sec  17869             sender
s1:  [  4]   0.00-60.00  sec   159 GBytes  22.7 Gbits/sec                  receiver
The average bandwidth over 60 seconds for all four threads adds up to 92Gb/s.

Running multiple iperf3 instances with parallel

Parallel is a tool that will allow you to run multiple commands at the same time. The following outlines how we used parallel to do some testing with iperf3.
  1. Ensure that parallel is installed.
    # sudo apt-get -y install parallel
    
  2. Create a file that looks like this:
    # cat commands.txt 
    iperf3 -c 172.16.21.1 -O 15 -t 30 -p 5101 -R -i 60 -T s1
    iperf3 -c 172.16.21.1 -O 15 -t 30 -p 5102 -R -i 60 -T s2
    iperf3 -c 172.16.21.1 -O 15 -t 30 -p 5103 -R -i 60 -T s3
    iperf3 -c 172.16.21.1 -O 15 -t 30 -p 5104 -R -i 60 -T s4
    
  3. Execute the commands like this:
    # parallel -a commands.txt |tee -a 100Gb-Port0.log
    
    When using programs that use GNU Parallel to process data for publication please cite:
    
      O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
      ;login: The USENIX Magazine, February 2011:42-47.
    
    This helps funding further development; and it won't cost you a cent.
    Or you can get GNU Parallel without this requirement by paying 10000 EUR.
    
    To silence this citation notice run 'parallel --bibtex' once or use '--no-notice'.
    
    s1:  Connecting to host 172.16.21.1, port 5101
    s1:  Reverse mode, remote host 172.16.21.1 is sending
    s1:  [  4] local 172.16.21.11 port 47762 connected to 172.16.21.1 port 5101
    s1:  [ ID] Interval           Transfer     Bandwidth
    s1:  [  4]   0.00-30.00  sec  74.4 GBytes  21.3 Gbits/sec                  
    s1:  - - - - - - - - - - - - - - - - - - - - - - - - -
    s1:  [ ID] Interval           Transfer     Bandwidth       Retr
    s1:  [  4]   0.00-30.00  sec  74.5 GBytes  21.3 Gbits/sec  39793             sender
    s1:  [  4]   0.00-30.00  sec  74.4 GBytes  21.3 Gbits/sec                  receiver
    s1:  
    s1:  iperf Done.
    s2:  Connecting to host 172.16.21.1, port 5102
    s2:  Reverse mode, remote host 172.16.21.1 is sending
    s2:  [  4] local 172.16.21.11 port 33354 connected to 172.16.21.1 port 5102
    s2:  [ ID] Interval           Transfer     Bandwidth
    s2:  [  4]   0.00-30.00  sec  79.6 GBytes  22.8 Gbits/sec                  
    s2:  - - - - - - - - - - - - - - - - - - - - - - - - -
    s2:  [ ID] Interval           Transfer     Bandwidth       Retr
    s2:  [  4]   0.00-30.00  sec  79.7 GBytes  22.8 Gbits/sec  43638             sender
    s2:  [  4]   0.00-30.00  sec  79.6 GBytes  22.8 Gbits/sec                  receiver
    s2:  
    s2:  iperf Done.
    s3:  Connecting to host 172.16.21.1, port 5103
    s3:  Reverse mode, remote host 172.16.21.1 is sending
    s3:  [  4] local 172.16.21.11 port 57094 connected to 172.16.21.1 port 5103
    s3:  [ ID] Interval           Transfer     Bandwidth
    s3:  [  4]   0.00-30.00  sec  75.3 GBytes  21.6 Gbits/sec                  
    s3:  - - - - - - - - - - - - - - - - - - - - - - - - -
    s3:  [ ID] Interval           Transfer     Bandwidth       Retr
    s3:  [  4]   0.00-30.00  sec  75.4 GBytes  21.6 Gbits/sec  41230             sender
    s3:  [  4]   0.00-30.00  sec  75.3 GBytes  21.6 Gbits/sec                  receiver
    s3:  
    s3:  iperf Done.
    s4:  Connecting to host 172.16.21.1, port 5104
    s4:  Reverse mode, remote host 172.16.21.1 is sending
    s4:  [  4] local 172.16.21.11 port 59674 connected to 172.16.21.1 port 5104
    s4:  [ ID] Interval           Transfer     Bandwidth
    s4:  [  4]   0.00-30.00  sec  75.7 GBytes  21.7 Gbits/sec                  
    s4:  - - - - - - - - - - - - - - - - - - - - - - - - -
    s4:  [ ID] Interval           Transfer     Bandwidth       Retr
    s4:  [  4]   0.00-30.00  sec  75.8 GBytes  21.7 Gbits/sec  41177             sender
    s4:  [  4]   0.00-30.00  sec  75.7 GBytes  21.7 Gbits/sec                  receiver
    s4:  
    s4:  iperf Done.
    
    1. At the end, you can add up the bandwith numbers for the "sender" or "receiver" and get an aggregate total that should be pretty close to the max bandwidth for the device. In the example above, the sender and receiver both show an aggregate of 87.4Gb/s.