Nexus and ECMP for DNS

if you read my previous pieces about my home network, you know well my core switch is Nexus 93180YC-EX. you know… home, core switch.

anycasted services

at any point in time I have a number of DNS (and DHCP) servers available, all reachable via either 192.168.168.168 or 2001:470:xx:a6::168. no matter what is going on, at least one should be able to respond.

currently, in the “cluster” I have two VMs and two physical Raspberry Pi 4B+. all of them run on FreeBSD 14.0-STABLE, with nsd, unbound and bird packages, last one to do the advertisement of IPv4 and IPv6 addresses.

from the BGP perspective of core Nexus, those advertisements look like that - for IPv4:

sw-core# sh bgp ipv4 unicast
BGP routing table information for VRF default, address family IPv4 Unicast
BGP table version is 36, Local Router ID is 192.168.33.1
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup, 2 - best2

   Network            Next Hop            Metric     LocPrf     Weight Path
*|i192.168.168.168/32 192.168.66.22                     100          0 i
*|i                   192.168.66.44                     100          0 i
*>i                   192.168.44.180                    100          0 i
*|i                   192.168.66.33                     100          0 i

for IPv6:

sw-core# sh bgp ipv6 unicast
[...]
   Network            Next Hop            Metric     LocPrf     Weight Path
*|i2001:470:xx:a6::168/128
                      2001:470:xx:66::22
                                                        100          0 i
*|i                   2001:470:xx:66::44
                                                        100          0 i
*>i                   2001:470:xx:444::180
                                                        100          0 i
*|i                   2001:470:xx:66::33
                                                        100          0 i

and finally - what ends up in actual routing table?

sw-core# sh ip route 192.168.168.168
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

192.168.168.168/32, ubest/mbest: 4/0
    *via 192.168.44.180, [200/0], 1w5d, bgp-65055, internal, tag 65055
    *via 192.168.66.22, [200/0], 4d09h, bgp-65055, internal, tag 65055
    *via 192.168.66.33, [200/0], 3w3d, bgp-65055, internal, tag 65055
    *via 192.168.66.44, [200/0], 1w1d, bgp-65055, internal, tag 65055

sw-core# sh ipv6 route 2001:470:xx:a6::168
IPv6 Routing Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]

2001:470:xx:a6::168/128, ubest/mbest: 4/0
    *via 2001:470:xx:66::22/128, [200/0], 4d09h, bgp-65055, internal, tag 65055
    *via 2001:470:xx:66::33/128, [200/0], 3w3d, bgp-65055, internal, tag 65055
    *via 2001:470:xx:66::44/128, [200/0], 1w1d, bgp-65055, internal, tag 65055
    *via 2001:470:xx:444::180/128, [200/0], 1w5d, bgp-65055, internal, tag 65055

the (default) end result is that traffic gets distributed unequally across available servers. that’s because default load sharing algorithm uses only source IP address and source TCP/UDP port for hash. which for couple of my 192.168/16 networks results in traffic being distributed in roughly a ratio of 1/2 to 1/4 and to two 1/8.

you have to actually customize ECMP:

sw-core(config)# ip load-sharing address source-destination port source-destination rotate 32 universal-id 9239194

…and after about one day load distribution ended up almost ideal: ECMP - almost ideal 1/4 to each

of course, you can verify ECMP operation on Nexus directly:

sw-core# sh ip load-sharing
IPv4/IPv6 ECMP load sharing:
Universal-id (Random Seed): 94812191
Load-share mode : address source-destination port source-destination
Rotate: 32

summary

outside of ECMP (Equal-Cost MultiPathing) you can try to use more complicated setups, for example with UCMP (Uneqal-Cost MultiPathing). however, at least for starters, this approach should work well ;)

anycasted services#

summary#

anycasted services

summary