When your blockchain node has lost network connection — Ethernet Unit Hang on the Intel i218/219 controller

When you run a blockchain node, you need to keep it as stable and reliable as possible. You cannot afford to disconnect from the network so you network controller should be perfectly stable.

Intel i218/219 Ethernet controller can be slightly problematic with Linux kernel version 4.15 and above. The e1000e driver causes Ethernet Unit Hardware Hang that is fully recoverable in most cases but not all. I had a situation where the only solution was to hard reboot my server. Unfortunately it lost internet connection permanently.

Using following command you can check which NIC you have:

lspci | grep -i eth

You can easily check if you have this problem with your NIC using command:
journalctl --since "3 days ago" | grep "Detected Hardware Unit Hang"

Below is a full log entry

Mar 22 07:49:33 Proxmox-VE kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
TDH <37>
TDT <9f>
next_to_use <9f>
next_to_clean <36>
buffer_info[next_to_clean]:
time_stamp <10da4ddfe>
next_to_watch <37>
jiffies <10da4e268>
next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>
PCI Status <10>

My setup that needs solution

I have dedicated server EX62-NVMe from Hetzner with Proxmox 6.3–1 (running kernel: 5.4.101–1-pve) and unfortunately it has Intel i218/219 NIC.

The solution

To solve this problem you should turn off TCP segmentation offloading feature of your Ethernet controller. If you have a virtual environment like Proxmox, you should apply solution in Proxmox, leave VMs unchanged.

  1. Install ethtool using apt install ethtool
  2. Check if you have turned on the segmentation offloading by typing ethtool -k <your_network_interface_name>
    and check tcp-segmentation-offload parameters that will be set to onin the case of active segmentation offloading.
  3. Turn off segmentation offloading using: ethtool -K <your_network_interface_name> tso off gso off
    The above command must be executed every system boot.

<your_network_interface_name> means the name of your hardware network interface like eno1, ens18 etc…

In the case of Proxmox system you can add the following line to your /etc/network/interfaces as a last entry of vmbr0 or your main network interface configuration. This will turn off offloading just after setup your network interface every reboot.

post-up /usr/bin/logger -p debug -t ifup "Disabling offload for <your_network_interface_name>" && /sbin/ethtool -K <your_network_interface_name> tso off gso off && /usr/bin/logger -p debug -t ifup "Disabled offload for <your_network_interface_name>"

Here is my current network interface configuration:

source /etc/network/interfaces.d/* 

auto lo
iface lo inet loopback

iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
address <your_ip_address>
netmask <your_netmask>
gateway <your_gateway>
bridge-ports eno1
bridge-stp off
bridge-fd 0
post-up /usr/bin/logger -p debug -t ifup "Disabling offload for eno1" && /sbin/ethtool -K eno1 tso off gso off && /usr/bin/logger -p debug -t ifup "Disabled offload for eno1"

When you have added the post-up entry you can reboot your Proxmox from Web GUI and run the following command to verify that offloading is really disabled.

ethtool -k <your_network_interface_name>

In the result of this command following lines should look like this

tcp-segmentation-offload: off
tx-tcp-segmentation: off
tx-tcp-ecn-segmentation: off
tx-tcp-mangleid-segmentation: off
tx-tcp6-segmentation: off
generic-segmentation-offload: off

For final confirmation, wait a few hours and check if there is no ETH Unit Hardware Hand in your system logs:

journalctl --since "yesterday" | grep "Detected Hardware Unit Hang"

References

  1. https://forum.proxmox.com/threads/e1000-driver-hang.58284/post-375919
  2. https://docs.hetzner.com/robot/dedicated-server/troubleshooting/performance-intel-i218-nic/