Documentation/networking/nexthop-group-resilient.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 =========================
   4 Resilient Next-hop Groups
   5 =========================
   6
   7 Resilient groups are a type of next-hop group that is aimed at minimizing
   8 disruption in flow routing across changes to the group composition and
   9 weights of constituent next hops.
  10
  11 The idea behind resilient hashing groups is best explained in contrast to
  12 the legacy multipath next-hop group, which uses the hash-threshold
  13 algorithm, described in RFC 2992.
  14
  15 To select a next hop, hash-threshold algorithm first assigns a range of
  16 hashes to each next hop in the group, and then selects the next hop by
  17 comparing the SKB hash with the individual ranges. When a next hop is
  18 removed from the group, the ranges are recomputed, which leads to
  19 reassignment of parts of hash space from one next hop to another. RFC 2992
  20 illustrates it thus::
  21
  22              +-------+-------+-------+-------+-------+
  23              |   1   |   2   |   3   |   4   |   5   |
  24              +-------+-+-----+---+---+-----+-+-------+
  25              |    1    |    2    |    4    |    5    |
  26              +---------+---------+---------+---------+
  27
  28               Before and after deletion of next hop 3
  29               under the hash-threshold algorithm.
  30
  31 Note how next hop 2 gave up part of the hash space in favor of next hop 1,
  32 and 4 in favor of 5. While there will usually be some overlap between the
  33 previous and the new distribution, some traffic flows change the next hop
  34 that they resolve to.
  35
  36 If a multipath group is used for load-balancing between multiple servers,
  37 this hash space reassignment causes an issue that packets from a single
  38 flow suddenly end up arriving at a server that does not expect them. This
  39 can result in TCP connections being reset.
  40
  41 If a multipath group is used for load-balancing among available paths to
  42 the same server, the issue is that different latencies and reordering along
  43 the way causes the packets to arrive in the wrong order, resulting in
  44 degraded application performance.
  45
  46 To mitigate the above-mentioned flow redirection, resilient next-hop groups
  47 insert another layer of indirection between the hash space and its
  48 constituent next hops: a hash table. The selection algorithm uses SKB hash
  49 to choose a hash table bucket, then reads the next hop that this bucket
  50 contains, and forwards traffic there.
  51
  52 This indirection brings an important feature. In the hash-threshold
  53 algorithm, the range of hashes associated with a next hop must be
  54 continuous. With a hash table, mapping between the hash table buckets and
  55 the individual next hops is arbitrary. Therefore when a next hop is deleted
  56 the buckets that held it are simply reassigned to other next hops::
  57
  58             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  59             |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
  60             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  61                              v v v v
  62             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  63             |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
  64             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  65
  66             Before and after deletion of next hop 3
  67             under the resilient hashing algorithm.
  68
  69 When weights of next hops in a group are altered, it may be possible to
  70 choose a subset of buckets that are currently not used for forwarding
  71 traffic, and use those to satisfy the new next-hop distribution demands,
  72 keeping the "busy" buckets intact. This way, established flows are ideally
  73 kept being forwarded to the same endpoints through the same paths as before
  74 the next-hop group change.
  75
  76 Algorithm
  77 ---------
  78
  79 In a nutshell, the algorithm works as follows. Each next hop deserves a
  80 certain number of buckets, according to its weight and the number of
  81 buckets in the hash table. In accordance with the source code, we will call
  82 this number a "wants count" of a next hop. In case of an event that might
  83 cause bucket allocation change, the wants counts for individual next hops
  84 are updated.
  85
  86 Next hops that have fewer buckets than their wants count, are called
  87 "underweight". Those that have more are "overweight". If there are no
  88 overweight (and therefore no underweight) next hops in the group, it is
  89 said to be "balanced".
  90
  91 Each bucket maintains a last-used timer. Every time a packet is forwarded
  92 through a bucket, this timer is updated to current jiffies value. One
  93 attribute of a resilient group is then the "idle timer", which is the
  94 amount of time that a bucket must not be hit by traffic in order for it to
  95 be considered "idle". Buckets that are not idle are busy.
  96
  97 After assigning wants counts to next hops, an "upkeep" algorithm runs. For
  98 buckets:
  99
 100 1) that have no assigned next hop, or
 101 2) whose next hop has been removed, or
 102 3) that are idle and their next hop is overweight,
 103
 104 upkeep changes the next hop that the bucket references to one of the
 105 underweight next hops. If, after considering all buckets in this manner,
 106 there are still underweight next hops, another upkeep run is scheduled to a
 107 future time.
 108
 109 There may not be enough "idle" buckets to satisfy the updated wants counts
 110 of all next hops. Another attribute of a resilient group is the "unbalanced
 111 timer". This timer can be set to 0, in which case the table will stay out
 112 of balance until idle buckets do appear, possibly never. If set to a
 113 non-zero value, the value represents the period of time that the table is
 114 permitted to stay out of balance.
 115
 116 With this in mind, we update the above list of conditions with one more
 117 item. Thus buckets:
 118
 119 4) whose next hop is overweight, and the amount of time that the table has
 120    been out of balance exceeds the unbalanced timer, if that is non-zero,
 121
 122 \... are migrated as well.
 123
 124 Offloading & Driver Feedback
 125 ----------------------------
 126
 127 When offloading resilient groups, the algorithm that distributes buckets
 128 among next hops is still the one in SW. Drivers are notified of updates to
 129 next hop groups in the following three ways:
 130
 131 - Full group notification with the type
 132   ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is
 133   created and buckets populated for the first time.
 134
 135 - Single-bucket notifications of the type
 136   ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of
 137   individual migrations within an already-established group.
 138
 139 - Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This
 140   is sent before the group is replaced, and is a way for the driver to veto
 141   the group before committing anything to the HW.
 142
 143 Some single-bucket notifications are forced, as indicated by the "force"
 144 flag in the notification. Those are used for the cases where e.g. the next
 145 hop associated with the bucket was removed, and the bucket really must be
 146 migrated.
 147
 148 Non-forced notifications can be overridden by the driver by returning an
 149 error code. The use case for this is that the driver notifies the HW that a
 150 bucket should be migrated, but the HW discovers that the bucket has in fact
 151 been hit by traffic.
 152
 153 A second way for the HW to report that a bucket is busy is through the
 154 ``nexthop_res_grp_activity_update()`` API. The buckets identified this way
 155 as busy are treated as if traffic hit them.
 156
 157 Offloaded buckets should be flagged as either "offload" or "trap". This is
 158 done through the ``nexthop_bucket_set_hw_flags()`` API.
 159
 160 Netlink UAPI
 161 ------------
 162
 163 Resilient Group Replacement
 164 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 165
 166 Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the
 167 same manner as other multipath groups. The following changes apply to the
 168 attributes passed in the netlink message:
 169
 170   =================== =========================================================
 171   ``NHA_GROUP_TYPE``  Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group.
 172   ``NHA_RES_GROUP``   A nest that contains attributes specific to resilient
 173                       groups.
 174   =================== =========================================================
 175
 176 ``NHA_RES_GROUP`` payload:
 177
 178   =================================== =========================================
 179   ``NHA_RES_GROUP_BUCKETS``           Number of buckets in the hash table.
 180   ``NHA_RES_GROUP_IDLE_TIMER``        Idle timer in units of clock_t.
 181   ``NHA_RES_GROUP_UNBALANCED_TIMER``  Unbalanced timer in units of clock_t.
 182   =================================== =========================================
 183
 184 Next Hop Get
 185 ^^^^^^^^^^^^
 186
 187 Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP``
 188 message in exactly the same way as other next hop get requests. The
 189 response attributes match the replacement attributes cited above, except
 190 ``NHA_RES_GROUP`` payload will include the following attribute:
 191
 192   =================================== =========================================
 193   ``NHA_RES_GROUP_UNBALANCED_TIME``   How long has the resilient group been out
 194                                       of balance, in units of clock_t.
 195   =================================== =========================================
 196
 197 Bucket Get
 198 ^^^^^^^^^^
 199
 200 The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is
 201 used to request a single bucket. The attributes recognized at get requests
 202 are:
 203
 204   =================== =========================================================
 205   ``NHA_ID``          ID of the next-hop group that the bucket belongs to.
 206   ``NHA_RES_BUCKET``  A nest that contains attributes specific to bucket.
 207   =================== =========================================================
 208
 209 ``NHA_RES_BUCKET`` payload:
 210
 211   ======================== ====================================================
 212   ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table.
 213   ======================== ====================================================
 214
 215 Bucket Dumps
 216 ^^^^^^^^^^^^
 217
 218 The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used
 219 to request a dump of matching buckets. The attributes recognized at dump
 220 requests are:
 221
 222   =================== =========================================================
 223   ``NHA_ID``          If specified, limits the dump to just the next-hop group
 224                       with this ID.
 225   ``NHA_OIF``         If specified, limits the dump to buckets that contain
 226                       next hops that use the device with this ifindex.
 227   ``NHA_MASTER``      If specified, limits the dump to buckets that contain
 228                       next hops that use a device in the VRF with this ifindex.
 229   ``NHA_RES_BUCKET``  A nest that contains attributes specific to bucket.
 230   =================== =========================================================
 231
 232 ``NHA_RES_BUCKET`` payload:
 233
 234   ======================== ====================================================
 235   ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets
 236                            that contain the next hop with this ID.
 237   ======================== ====================================================
 238
 239 Usage
 240 -----
 241
 242 To illustrate the usage, consider the following commands::
 243
 244         # ip nexthop add id 1 via 192.0.2.2 dev eth0
 245         # ip nexthop add id 2 via 192.0.2.3 dev eth0
 246         # ip nexthop add id 10 group 1/2 type resilient \
 247                 buckets 8 idle_timer 60 unbalanced_timer 300
 248
 249 The last command creates a resilient next-hop group. It will have 8 buckets
 250 (which is unusually low number, and used here for demonstration purposes
 251 only), each bucket will be considered idle when no traffic hits it for at
 252 least 60 seconds, and if the table remains out of balance for 300 seconds,
 253 it will be forcefully brought into balance.
 254
 255 Changing next-hop weights leads to change in bucket allocation::
 256
 257         # ip nexthop replace id 10 group 1,3/2 type resilient
 258
 259 This can be confirmed by looking at individual buckets::
 260
 261         # ip nexthop bucket show id 10
 262         id 10 index 0 idle_time 5.59 nhid 1
 263         id 10 index 1 idle_time 5.59 nhid 1
 264         id 10 index 2 idle_time 8.74 nhid 2
 265         id 10 index 3 idle_time 8.74 nhid 2
 266         id 10 index 4 idle_time 8.74 nhid 1
 267         id 10 index 5 idle_time 8.74 nhid 1
 268         id 10 index 6 idle_time 8.74 nhid 1
 269         id 10 index 7 idle_time 8.74 nhid 1
 270
 271 Note the two buckets that have a shorter idle time. Those are the ones that
 272 were migrated after the next-hop replace command to satisfy the new demand
 273 that next hop 1 be given 6 buckets instead of 4.
 274
 275 Netdevsim
 276 ---------
 277
 278 The netdevsim driver implements a mock offload of resilient groups, and
 279 exposes debugfs interface that allows marking individual buckets as busy.
 280 For example, the following will mark bucket 23 in next-hop group 10 as
 281 active::
 282
 283         # echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity
 284
 285 In addition, another debugfs interface can be used to configure that the
 286 next attempt to migrate a bucket should fail::
 287
 288         # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
 289
 290 Besides serving as an example, the interfaces that netdevsim exposes are
 291 useful in automated testing, and
 292 ``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of
 293 them to test the algorithm.