In Part 1 of this series, we ran through the technical details of what it would take to build an edge traffic controller to steer traffic away from overloaded edge links. In this blog, I will try to demonstrate the controller in action by simulating traffic flows using our virtualized topology. We will also look into other real world considerations such as operational monitoring and metrics.

Initial Setup

To recap, this is what our topology looks like:

BGP setup

Recall that the Juniper vQFX is setup as a peer router in our topology and is configured to announce several random prefixes such as 1.0.0.0/24, 1.0.1.0/24 etc to the downstream Cumulus VX. These prefixes will be installed as sink routes with a discard ( or null0) next hop which means the UDP traffic to these destinations is simply going to get dropped. The vQFX will be configured to announce the same routes across both BGP sessions to the Cumulux which will then be configured to prefer the route over the “peering” link as opposed to the “transit” link. This device will also establish an iBGP peering session with the controller and will announce additional all paths to it (while retaining the external next hops). Below is a BGP configuration snippet from the Cumulux VX:

router bgp 12121
  bgp router-id 50.50.50.50
  neighbor 10.1.4.100 remote-as 12121
  neighbor 10.1.4.100 description CONTROLLER
  neighbor 172.16.0.0 remote-as 33010
  neighbor 172.16.0.0 description PEER
  neighbor 172.16.0.2 remote-as 42428
  neighbor 172.16.0.2 description TRANSIT
  neighbor 172.16.0.0 timers 0 0
  neighbor 172.16.0.2 timers 0 0

  address-family ipv4 unicast
    network 10.1.3.0/24
    network 10.1.4.0/24
    neighbor 10.1.4.100 addpath-tx-all-paths
    neighbor 10.1.4.100 route-map ALL out
    neighbor 172.16.0.0 soft-reconfiguration inbound
    neighbor 172.16.0.0 route-map PEER_IN in
    neighbor 172.16.0.2 soft-reconfiguration inbound

route-map PEER_IN permit 10
  set local-preference 200

route-map ALL permit 10

Below is a snippet from the route table of the Cumulus VX, which shows the same route being announced from both neighbors, while the route to the peer being preferred:

vagrant@rtr1:mgmt:~$ net show bgp
show bgp ipv4 unicast
=====================
BGP table version is 239, local router ID is 50.50.50.50, vrf id 0
Default local pref 100, local AS 12121
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*  1.0.0.0/24       172.16.0.2                             0 42428 i
*>                  172.16.0.0                    200      0 33010 i
*  1.0.1.0/24       172.16.0.2                             0 42428 i
*>                  172.16.0.0                    200      0 33010 i
*  1.0.2.0/24       172.16.0.2                             0 42428 i
*>                  172.16.0.0                    200      0 33010 i
*  1.0.3.0/24       172.16.0.2                             0 42428 i
*>                  172.16.0.0                    200      0 33010 i
*  1.0.4.0/24       172.16.0.2                             0 42428 i
*>                  172.16.0.0                    200      0 33010 i

Host sflow setup

In order to correctly sample egress packets on the Cumulus VX, we need to enable nflog sampling at a rate of 1 in 1000 packets using iptables post-routing rules as follows:

MOD_STATISTIC="-m statistic --mode random --probability 0.001"
NFLOG_CONFIG="--nflog-group 5 --nflog-prefix SFLOW"

sudo iptables -t mangle -I POSTROUTING -j NFLOG $MOD_STATISTIC $NFLOG_CONFIG
sudo iptables -t mangle -I INPUT -j NFLOG $MOD_STATISTIC $NFLOG_CONFIG

The host sflow daemon on the Cumulus VX can then be configured to export flows to the controller. Below is the relevant host-sflow daemon config:

sflow {
  agent = swp4
  polling = 10
  sampling = 1000
  collector { ip=10.1.4.100 udpport=6343 }
  nflog { group = 5  probability = 0.001 }
}

This hsflowd config specifies flow export to agent IP 10.1.4.100 using a sampling rate of 1 in 1000 packets. It is set to poll and export interface counters every 10 seconds using the address of interface swp4 as the source ip. We also configure the daemon to use nflog group 5 (defined above) with a probability that matches the sampling rate.

Finally, we can start the hsflow daemon using:

sudo service hsflowd start

Controller setup

The controller needs to be supplied with a config file that specifies the device and interfaces to be monitored, as well as the potential detour candidate links. For the demo, we will also specify interface parameters such as link speed, SNMP index etc using the same config file. The following is the YAML config we will use:

sflow:
  listen_port: 6343
  counter_poll_interval: 10s

bgp:
  local_as: 12121
  local_ip: 10.1.4.100

devices:
  - name: rtr1
    ip: 10.1.4.1
    asn: 12121 
    bgp_community: 100:100
    interfaces:
      - name: swp1
        ifindex: 3
        address: 172.16.0.1/31
        speed: 100000000
        high_watermark: 80
        low_watermark: 65
        per_prefix_threshold: 30000000
        #dry_run: true
        monitor: true
      - name: swp2
        ifindex: 4
        address: 172.16.0.3/31
        speed: 100000000
        high_watermark: 80
        low_watermark: 60

First we specify the sflow port and the expected counter poll interval. Next comes the BGP config for the embedded GoBGP server inside the controller. At the device level, we specify the BGP parameters to use. Each device is configured with a special BGP community, which is used by GoBGP to perform selective route announcement (via peer policies) to that specific neighbor. Also note that the device ASN is the same as our controller ASN, which means we are using iBGP. This causes the router to announce prefixes with the original external next hops retained (i.e no next-hop-self). This is a crucial part of controller operation - since the controller needs visibility into external next hops in order to make detour routing decisions.

Interface swp1 is the interface connecting to our peering partner , which is AS 33010. Interface swp2 on the other hand connects to or transit provider, which is AS 42428. The monitored interface also specifies three noteworthy parameters:

high_watermark: the utilzation percentage above which detours should be triggered
low_watermark: the utilization percentage below which detours can be removed
per_prefix_threshold: the per-prefix utilization in bps above which prefixes with high traffic rates need to be split into smaller prefixes. This is set to 30Mbps. The controller is thus configured to maintain the utilization between 60% to 80%. Also note the dry-run option which is set to False. IF this is set to True, the controller will simply log its decisions instead of actually applying the detours. Finally, we cap the interface speeds to 100Mbps in order to facilitate testing with reasonable traffic rates.

We can fire up the controller on srv2 and watch BGP come up with the router (rtr1):

vagrant@server2:~$ sudo ./controller -v=4 -logtostderr -config config.yaml
I0321 06:09:40.712404   15107 controller.go:90] Starting GoBGP Server
I0321 06:09:40.713107   15107 controller.go:100] Starting Sflow server on :6343
INFO[0000] Add a peer configuration for:10.1.4.1         Topic=Peer
I0321 06:09:40.714226   15107 controller.go:187] Starting http server on :8080
I0321 06:09:40.714315   15107 controller.go:223] Starting monitoring for Interface swp1 on rtr1
INFO[0006] Peer Up                                       Key=10.1.4.1 State=BGP_FSM_OPENCONFIRM Topic=Peer
I0321 06:09:47.048734   15107 controller.go:107] Peer rtr1 is now up !

The controller is now ready to receive sflow samples and start monitoring the interface swp1.

Traffic generation

We can use iperf installed on one of the servers attached to our Cumulus VM as a traffic generator. Iperf allows us to generate simple UDP streams with a given throughput and for a given duration to specific destination IPs. This is called client mode. Because our main goal here is to simply create some traffic load over our edge links without worrying about what happens to these streams, we dont really care about setting up iperf in server mode on the receiving end. In order to facilitate testing, I’ll be using a short Python script that can spin up subprocesses running iperf to a given destination at a given rate of traffic, in parallel.

Monitoring

In order to better monitor the effects of the controller on the edge interfaces, the controller offers a REST API that exposes useful operational information such as:

utilization of interfaces being monitored
top-N prefixes in terms of interface utilization
current overrides that are in effect
BGP paths learnt from the routers

Providing above APIs is a great way to ensure visibity into the operational state of the controller. Taking this a step further, the controller also exposes a Prometheus scrape endpoint that serves interface traffic statistics as well as per-prefix statistics obtained from Sflow samples. For the purposes of this demo, this can help visualize the effects of controller actions. We also stand up a local Grafana instance to create some lovely dashboards to go with this.

To start generating some traffic, lets go ahead and start up three UDP streams from srv1 to destinations 1.0.0.1, 2.0.0.1 and 3.0.0.1, each sending 20Mbps. Note that our HWM is set to 80Mbps and the LWM is set to 65% utilization for interface swp1. Watching the controller logs, we can see it monitoring the traffic utilizations every minute:

I0321 06:10:40.714529   15107 controller.go:240] rtr1/swp1 (Util: 45446361, speed: 100000000) is at 45% utilization
I0321 06:10:40.714785   15107 controller.go:246] Low WM reached for rtr1/swp1
I0321 06:10:40.714941   15107 controller.go:223] Starting monitoring for Interface swp1 on rtr1
I0321 06:11:40.715295   15107 controller.go:240] rtr1/swp1 (Util: 55628467, speed: 100000000) is at 55% utilization
I0321 06:11:40.715491   15107 controller.go:246] Low WM reached for rtr1/swp1
I0321 06:11:40.715641   15107 controller.go:223] Starting monitoring for Interface swp1 on rtr1
I0321 06:12:40.755906   15107 controller.go:240] rtr1/swp1 (Util: 65814520, speed: 100000000) is at 65% utilization
I0321 06:12:40.755917   15107 controller.go:246] Low WM reached for rtr1/swp1
I0321 06:12:40.755923   15107 controller.go:223] Starting monitoring for Interface swp1 on rtr1
I0321 06:13:40.756138   15107 controller.go:240] rtr1/swp1 (Util: 65806368, speed: 100000000) is at 65% utilization
I0321 06:13:40.756362   15107 controller.go:246] Low WM reached for rtr1/swp1
I0321 06:13:40.756674   15107 controller.go:223] Starting monitoring for Interface swp1 on rtr1

Even though we ask iperf to send 20Mbps streams, the actual measured traffic rate includes header overhead and as a result, the interface utilization settles at 65%.

After waiting a few minutes for counters to catch up and flow samples to populate, we start a fourth stream; this time of 25Mbps sending traffic to 4.0.0.1. A few minutes later, the utilization on interface swp1 hits the HWM and we can see the controller override in action:

I0321 06:16:40.917430   15107 controller.go:240] rtr1/swp1 (Util: 82609935, speed: 100000000) is at 82% utilization
I0321 06:16:40.923359   15107 controller.go:294] Final Prefixes: [[{4.0.0.0/24 29107200}] [{2.0.0.0/24 21830400}] [{1.0.0.0/24 21628266}] [{3.0.0.0/24 19404800}]]
I0321 06:16:40.923414   15107 controller.go:302] Will detour Prefix: {4.0.0.0/24 29107200}
I0321 06:16:40.923524   15107 controller.go:400] Adding overrides for device: rtr1, intf: swp1
I0321 06:16:40.925620   15107 controller.go:415] Found 3 total paths for prefix 4.0.0.0/24
I0321 06:16:40.925681   15107 controller.go:439] Detouring prefix 4.0.0.0/24 to interface swp2 (nh 172.16.0.2)
I0321 06:16:40.926163   15107 controller.go:315] Detouring total of 29107200 bps off interface rtr1:swp1 (util 82609935, hwm 80)

From the logs, we can see the controller logic in action. When it measures the interface utilization to be above the HWM, it first queries the Sflow collector and gets a list of the top 5 prefixes sending traffic across the interface. It then detours those prefixes one by one until the interface utilization falls below the HWM. In this case, the topmost prefix was 4.0.0.0/24 which had a utilization of 25Mbps (actual ~29Mbps) which gets detoured to the alternate path. This happens via injection of a BGP override to rtr1 with the same route but pointing to interface swp2 (next-hop 172.16.0.2) and with a higher local preference of 500. Checking the BGP table on rtr1 shows us this route:

*> 4.0.0.0/16       172.16.0.2                             0 42428 i
*>i4.0.0.0/24       172.16.0.2                    500      0 e
*                   172.16.0.0                    200      0 33010 i
*                   172.16.0.2                             0 42428 i
*> 4.0.1.0/24       172.16.0.0                    200      0 33010 i
*                   172.16.0.2                             0 42428 i
*> 4.0.2.0/24       172.16.0.0                    200      0 33010 i
*                   172.16.0.2                             0 42428 i
*> 4.0.3.0/24       172.16.0.0                    200      0 33010 i
*                   172.16.0.2                             0 42428 i

If we query the REST API in the controller, we can get all the information relevant to the current state of affairs. For one, we can see the current overrides in the controller. his includes information about the prefix being overriden along with its traffic rate as well as the injected BGP route.

$ curl "http://localhost:8080/api/overrides/?router=rtr1&interface=swp1"
{"4.0.0.0/24":{"Prefix":"4.0.0.0/24","rate":24458133,"parent_prefix":"4.0.0.0/24","Route":{"Prefix":"4.0.0.0/24","Nh":"172.16.0.2","Lp":500,"as_path":[42428],"Origin":1,"Med":0,"bgp_type":0,"Communities":["100:100"],"mp_candidate":false},"out_ifindex":4}}

We can also query the topK prefix rates across interfaces as shown below:

$ curl "http://localhost:8080/api/topK/?router=rtr1&interface=swp1"
[{"Prefix":"3.0.0.0/24","RateBps":24053866},{"Prefix":"1.0.0.0/24","RateBps":20415466},{"Prefix":"2.0.0.0/24","RateBps":19404800}]

$ curl "http://localhost:8080/api/topK/?router=rtr1&interface=swp2"
[{"Prefix":"4.0.0.0/24","RateBps":25873066}]

After waiting a few minutes, we shut off the 25Mbps traffic stream destined to 4.0.0.1 which naturally causes the interface util to fall to the LWM. The controller detects this and removes the override:

I0321 06:22:40.958804   15107 controller.go:240] rtr1/swp1 (Util: 65814866, speed: 100000000) is at 65% utilization
I0321 06:22:40.958839   15107 controller.go:246] Low WM reached for rtr1/swp1
I0321 06:22:40.958860   15107 controller.go:261] Removing previous override: {4.0.0.0/24 29843056}
I0321 06:22:40.959048   15107 controller.go:266] Adding back total 29843056 bps
I0321 06:22:40.959066   15107 controller.go:460] Removing override: 4.0.0.0/24 -> 172.16.0.2
I0321 06:22:40.959232   15107 controller.go:223] Starting monitoring for Interface swp1 on rtr1

The entire sequence of events above can be nicely visualized using our Grafana dashboards:

The first panel shows interface utilizations and speeds. You can see at the utilization of swp1 climbing up at 23:16 and hitting the HWM of 80Mbps. Right after at 23:17, the extra traffic spills over to the detour interface swp2 while utilization on swp1 gets normalized back to ~65Mbps. At 23:23, the extra traffic is removed. The second panel shows per-prefix utilizations. Initially, we start the three 20Mbps streams which continue on for the duration of the test. At 23:16, 25Mbps of additional traffic to 4.0.0.0/24 (over swp1) is introduced (the orange timeseries). A minute later, this traffic moves over to swp2 in the form of the red timeseries. As you can imagine, visualization is a great way to capture effects of network controllers.

Prefix Splitting

For the second part, we will try and test the prefix-splitting functionality of the controller. In order to facilitate this, we will generate the following traffic streams from srv1:

20Mbps each to hosts 1.0.0.1, 1.0.0.129 and 1.0.0.193
20Mbps to host 2.0.0.1
10 Mbps to host 2.0.0.129

This gives us the following traffic distribution broken down by aggregate prefix:

60Mbps for 1.0.0.0/24
30Mbps to prefix 2.0.0.0/24

These streams cumulatively cause the traffic to exceed the HWM. Lets check the controller logs and the override decisions being made:

I0321 20:29:55.215899   17691 controller.go:240] rtr1/swp1 (Util: 99181451, speed: 100000000) is at 99% utilization
I0321 20:29:55.218543   17691 controller.go:294] Final Prefixes: [[{1.0.0.0/25 19606933} {1.0.0.128/26 23649600} {1.0.0.192/26 19000533}] [{2.0.0.0/25 23245333} {2.0.0.128/25 21723733}]]
I0321 20:29:55.218585   17691 controller.go:302] Will detour Prefix: {1.0.0.0/25 19606933}
I0321 20:29:55.218786   17691 controller.go:400] Adding overrides for device: rtr1, intf: swp1
I0321 20:29:55.219900   17691 controller.go:415] Found 3 total paths for prefix 1.0.0.0/25
I0321 20:29:55.219951   17691 controller.go:439] Detouring prefix 1.0.0.0/25 to interface swp2 (nh 172.16.0.2)
I0321 20:29:55.220442   17691 controller.go:315] Detouring total of 19606933 bps off interface rtr1:swp1 (util 99181451, hwm 80)
I0321 20:29:55.220479   17691 controller.go:223] Starting monitoring for Interface swp1 on rtr1

$ curl "http://localhost:8080/api/overrides/?router=rtr1&interface=swp1"
{"1.0.0.0/25":{"Prefix":"1.0.0.0/25","rate":19606933,"parent_prefix":"1.0.0.0/24","Route":{"Prefix":"1.0.0.0/25","Nh":"172.16.0.2","Lp":500,"as_path":[42428],"Origin":1,"Med":0,"bgp_type":0,"Communities":["100:100"],"mp_candidate":false},"out_ifindex":4}}

Since the top prefix in terms of traffic is 1.0.0.0/24 with a cumulative rate of 60Mbps, and our per-prefix threshold is set to 30Mbps, the controller splits this prefix in two passes. The first splits the /24 into two /25s (1.0.0.0/25, 1.0.0.128/25) with traffic rates of (20Mbps, 40Mbps). Since the second /25 prefix is still above the threshold of 30 Mbps, it is further split into two /26s (1.0.0.128/26, 1.0.0.192/26) with traffic rates of (20Mbps, 20Mbps). At this point, everything is within the threshold so the splitting logic returns the final set of prefixes. The controller then detours 20Mbps of traffic for prefix 1.0.0.0/25 as can be seen in the above logs as well as in the BGP table of rtr1:

vagrant@rtr1:mgmt:~$ net show bgp
show bgp ipv4 unicast
=====================
BGP table version is 426, local router ID is 50.50.50.50, vrf id 0
Default local pref 100, local AS 12121
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 1.0.0.0/16       172.16.0.2                             0 42428 i
*  1.0.0.0/24       172.16.0.2                             0 42428 i
*>                  172.16.0.0                    200      0 33010 i
*>i1.0.0.0/25       172.16.0.2                    500      0 e
*  1.0.1.0/24       172.16.0.2                             0 42428 i
*>                  172.16.0.0                    200      0 33010 i
*  1.0.2.0/24       172.16.0.2                             0 42428 i
*>                  172.16.0.0                    200      0 33010 i
*  1.0.3.0/24       172.16.0.2                             0 42428 i
*>                  172.16.0.0                    200      0 33010 i

Our Grafana dashboards also reflect all the above decisions. Note that the per-prefix dashboards still only show the aggregate /24 prefixes due to a limitation with the host sflow daemon on the Cumulux VX that does not export route table extensions. We can also see that one of the prefixes in the 2.0.0.0/24 range was detoured for a short amount of time. In general, detours will be added and removed as counters get updated and the controller gets enough flow data to reach a steady state.

Grafana charts

Final thoughts

While the controller we designed works pretty well and as intended in a heavily simulated and fabricated environment, real world deployments are a whole different ball game :) Often times, traffic patterns and application behaviors will end up governing and influencing controller development over time. Even though you may see the traffic being correctly detoured, your applications may still experience issues that may lead you to question the validity of controller decisions. This is true for not just simple BGP injection based detours but also for the more complicated controller designs involving MPLS and such. This simple yet effective Proof-of-Concept design does prove however, that centralized software based control is a great way to ensure that the network ends up as nothing more than a commodity that your applications can leverage to help drive your business goals and decisions.

Read Part 1 here : https://mayuresh82.github.io/2021/02/23/edge-controller-part1/

Building an Edge Traffic Controller - Part 2

Practical demo of using the controller to detour traffic from overloaded interfaces

TOC