ISC DHCP, FreeBSD and VMWare ESXi

recently during casual browsing of WLAN controller i spotted that sometimes users are having problems with receiving responses from DHCP server. i was suprised, as family doesn’t complain - and they’d do that immediately. well, so i went troubleshooting element by element.

obviously, switches were primary suspect. why? everything was working, and those DHCP problems were very, very rare - that may mean drops on switch interfaces. Cisco QoS configuration on Catalyst and Nexus switches is far from easy. comparing this however to other vendors… there’s really nothing to compare. on one side you can do whatever you want, on the other side - you can shoot yourself in both foots, stomach and then in the head pretty quickly. just assume, that if you haven’t spent couple of weeks labbing QoS on real hardware - it’s area that you shouldn’t wander alone in unsupervised ;) in very simple terms, either use dedicated GUI for managing campus networks - Cisco DNA Center or stop at either enabling QoS globally (mls qos) or disabling it (no mls qos).

i’m not however double CCIE for nothing, right? however, i already did comprehensive (and manual) configuration of QoS across my switches, router and access points - and properly mapped traffic classes to queues and buffers. that configuration was working without any problem - without any drops anywhere.

so, next step was to see what’s going on with servers themselves. and, here i saw nasty suprise - that i wasn’t looking for before, honestly speaking. in the DHCP server logs, at debug level (which by default on FreeBSD will be written to /var/log/debug.log) i noticed following entries:

Aug 20 06:32:37 ns1 dhcpd[6775]: 3 bad udp checksums in 5 packets
Aug 20 07:01:23 ns1 dhcpd[6775]: reuse_lease: lease age 1802 (secs) under 25% threshold, [...]
Aug 20 07:31:25 ns1 dhcpd[6775]: 3 bad udp checksums in 5 packets
Aug 20 07:31:31 ns1 dhcpd[6775]: reuse_lease: lease age 3610 (secs) under 25% threshold, [...]
Aug 20 07:32:42 ns1 dhcpd[6775]: reuse_lease: lease age 3605 (secs) under 25% threshold, [...]
Aug 20 07:33:55 ns1 dhcpd[6775]: reuse_lease: lease age 4888 (secs) under 25% threshold, [...]
Aug 20 07:38:22 ns1 dhcpd[6775]: reuse_lease: lease age 3610 (secs) under 25% threshold, [...]
Aug 20 07:39:11 ns1 dhcpd[6775]: 3 bad udp checksums in 5 packets
Aug 20 08:01:32 ns1 dhcpd[6775]: reuse_lease: lease age 5411 (secs) under 25% threshold, [...]
Aug 20 08:02:45 ns1 dhcpd[6775]: reuse_lease: lease age 5408 (secs) under 25% threshold, [...]
Aug 20 08:02:45 ns1 dhcpd[6775]: 3 bad udp checksums in 5 packets

that doesn’t look good. where that comes from? i changed virtual NIC in VMware ESXi long time ago from Intel (em) to native vmx just to steer clear of any problems (and increase performance). searching the internet didn’t find anything useable, apart from some posts dating back in 2011 and 2012. developers of some antique version of Ubuntu were fixing problems with ISC that were popping up when using virtio under KVM.

here it dawned on me. sometimes, combination of operating system network stack and hypervisor emulation layer as well as physical NIC can cause problems.

so, i did check what i’m actually using on FreeBSD under hypervisor:

[ns1 ~]$ ifconfig vmx0
vmx0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 00:0c:29:77:12:47
	[...]

bingo - RXCSUM,TXCSUM,TSO4,TSO6,RXCSUM_IPV6,TXCSUM_IPV6. so, let’s check without them:

# ifconfig vmx0 -tso4 -tso6 -rxcsum -txcsum -rxcsum6 -txcsum6

…and that’s way better. no messages about UDP checksums with packets served by DHCP server. WLAN controller logs also went quiet on problems.

so just for testing, i did try other various combinations of enabling/disabling hardware offload support in hypervisor and VM running under VMware ESXi (6.5/6.7). unfortunately the only working combination was disabled (off) both in VM and on hypervisor at the same time.

sometimes, easiest things can be really nasty suprises. good network engineer should be able to troubleshoot whole stack and find the problem.