- BGP Path Hunting and ASN Scheme for Data Centers
- Implementing BGP for the Underlay
- Auto-Discovered BGP Neighbors
- Summary
Implementing BGP for the Underlay
This section provides implementation specifics for building an eBGP underlay for an IP fabric or a VXLAN fabric using network devices running Junos. A unique ASN per fabric node design is used to demonstrate how spines can have suboptimal paths that can lead to path hunting, since the implementation of using the same ASNs on all spines in a 3-stage Clos network is straightforward and requires no demonstration. Then, routing policies are implemented to prevent path hunting. The implementation is based on the topology shown earlier in Figure 3-1.
In this network, for the underlay, each fabric-facing interface is configured as a point-to-point Layer 3 interface, as shown in Example 3-2 from the perspective of leaf1.
Example 3-2 Point-to-point Layer 3 interface configuration on leaf1 for fabric-facing interfaces
admin@leaf1# show interfaces ge-0/0/0 description "To spine1"; mtu 9100; unit 0 { family inet { address 198.51.100.0/31; } } admin@leaf1# show interfaces ge-0/0/1 description "To spine2"; mtu 9100; unit 0 { family inet { address 198.51.100.2/31; } }
The goal of the underlay is to advertise the loopbacks of the VXLAN Tunnel Endpoints (VTEPs), since these loopbacks are used to build end-to-end VXLAN tunnels. Thus, on each VTEP, which are the fabric leafs in this case, a loopback interface is configured, as shown on leaf1 in Example 3-3.
Example 3-3 Loopback interface on leaf1
admin@leaf1# show interfaces lo0 unit 0 { family inet { address 192.0.2.11/32; } }
The underlay eBGP peering is between these point-to-point interfaces. Since a leaf’s loopback address is sent toward other leafs via multiple spines, each leaf is expected to install multiple, equal cost paths to every other leaf’s loopback address. In Junos, to enable ECMP routing, both the protocol (software) and the hardware need to be explicitly enabled to support it. In the case of BGP, this is enabled using the multipath knob (with the multiple-as configuration option if the routes received have the same AS_PATH length but different ASNs in the list). A subset of the eBGP configuration, for the underlay, is shown from the perspective of both spines and leaf1 in Example 3-4.
Example 3-4 BGP configuration on spine1, spine2, and leaf1
admin@spine1# show protocols bgp group underlay { type external; family inet { unicast; } neighbor 198.51.100.0 { peer-as 65421; } neighbor 198.51.100.4 { peer-as 65422; } neighbor 198.51.100.8 { peer-as 65423; } } admin@spine2# show protocols bgp group underlay { type external; family inet { unicast; } neighbor 198.51.100.2 { peer-as 65421; } neighbor 198.51.100.6 { peer-as 65422; } neighbor 198.51.100.10 { peer-as 65423; } } admin@leaf1# show protocols bgp group underlay { type external; family inet { unicast; } export allow-loopback; multipath { multiple-as; } neighbor 198.51.100.1 { peer-as 65500; } neighbor 198.51.100.3 { peer-as 65501; } }
Every leaf is advertising its loopback address via an export policy attached to the BGP group for the underlay, as shown in Example 3-4. The configuration of this policy is shown in Example 3-5, which enables the advertisement of direct routes in the 192.0.2.0/24 range to its eBGP peers.
Example 3-5 Policy to advertise loopbacks shown on leaf1
admin@leaf1# show policy-options policy-statement allow-loopback term loopback { from { protocol direct; route-filter 192.0.2.0/24 orlonger; } then accept; } term discard { then reject; }
With the other leafs configured in the same way, the spines can successfully form an eBGP peering with each leaf, as shown in Example 3-6.
Example 3-6 eBGP peering on spine1 and spine2 with all leafs
admin@spine1> show bgp summary Threading mode: BGP I/O Default eBGP mode: advertise - accept, receive - accept Groups: 1 Peers: 3 Down peers: 0 Table Tot Paths Act Paths Suppressed History Damp State Pending inet.0 3 3 0 0 0 0 Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped... 198.51.100.0 65421 191 189 0 0 1:24:41 Establ inet.0: 1/1/1/0 198.51.100.4 65422 184 182 0 0 1:21:12 Establ inet.0: 1/1/1/0 198.51.100.8 65423 180 179 0 0 1:19:35 Establ inet.0: 1/1/1/0 admin@spine2> show bgp summary Threading mode: BGP I/O Default eBGP mode: advertise - accept, receive - accept Groups: 1 Peers: 3 Down peers: 0 Table Tot Paths Act Paths Suppressed History Damp State Pending inet.0 3 3 0 0 0 0 Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped... 198.51.100.2 65421 194 191 0 0 1:25:52 Establ inet.0: 1/1/1/0 198.51.100.6 65422 183 181 0 0 1:20:57 Establ inet.0: 1/1/1/0 198.51.100.10 65423 180 179 0 0 1:19:21 Establ inet.0: 1/1/1/0
With the policy configured as shown in Example 3-5, and the BGP peering between the leafs and the spines in an Established state, the loopback address of each leaf should be learned on every other leaf in the fabric.
Consider leaf1 now, to understand how equal cost paths for another leaf’s loopback address are installed. For the loopback address of leaf2, advertised by both spine1 and spine2 to leaf1, two routes are received on leaf1. Since BGP is configured with multipath, both routes are installed as equal cost routes in software, as shown in Example 3-7.
Example 3-7 Equal cost routes to leaf2’s loopback on leaf1
admin@leaf1> show route table inet.0 192.0.2.12 inet.0: 7 destinations, 9 routes (7 active, 0 holddown, 0 hidden) Limit/Threshold: 1048576/1048576 destinations + = Active Route, - = Last Active, * = Both 192.0.2.12/32 *[BGP/170] 02:10:44, localpref 100, from 198.51.100.1 AS path: 65500 65422 I, validation-state: unverified to 198.51.100.1 via ge-0/0/0.0 > to 198.51.100.3 via ge-0/0/1.0 [BGP/170] 02:10:44, localpref 100 AS path: 65501 65422 I, validation-state: unverified > to 198.51.100.3 via ge-0/0/1.0
A validation-state of unverified, as shown in Example 3-7, implies that the BGP route validation feature has not been configured (this is a feature to validate the origin and the path of a BGP route, to ensure that it is legitimate), and the route has been accepted but it was not validated.
These equal cost routes must also be installed in hardware. This is achieved by configuring the Packet Forwarding Engine (PFE) to install equal cost routes, and in turn, program the hardware, by applying an export policy under the routing-options hierarchy, as shown in Example 3-8. The policy itself simply enables per-flow load balancing. This example also demonstrates how the forwarding table, on the Routing Engine, can be viewed for a specific destination IP prefix, using the show route forwarding-table destination [ip-address] table [table-name] operational mode command.
Example 3-8 Equal cost routes in PFE of leaf1 with a policy for load-balancing per flow
admin@leaf1# show routing-options forwarding-table export ecmp; admin@leaf1# show policy-options policy-statement ecmp then { load-balance per-flow; } admin@leaf1> show route forwarding-table destination 192.0.2.12/32 table default Routing table: default.inet Internet: Destination Type RtRef Next hop Type Index NhRef Netif 192.0.2.12/32 user 0 ulst 1048574 3 198.51.100.1 ucst 583 4 ge-0/0/0.0 198.51.100.3 ucst 582 4 ge-0/0/1.0
While the control plane and the route installation in both software and hardware are as expected on the leafs, the spines paint a different picture. If the loopback address of the leafs, advertised by spine1 to other leafs, is chosen as the best route, spine2 will receive and store all suboptimal paths in its routing table. Again, considering leaf1’s loopback address as an example here, spine2 has three paths for this route, as shown in Example 3-9.
Example 3-9 Multiple paths for leaf1’s loopback address on spine2
admin@spine2> show route table inet.0 192.0.2.11/32 inet.0: 10 destinations, 16 routes (10 active, 0 holddown, 0 hidden) Limit/Threshold: 1048576/1048576 destinations + = Active Route, - = Last Active, * = Both 192.0.2.11/32 *[BGP/170] 15:05:38, localpref 100 AS path: 65421 I, validation-state: unverified > to 198.51.100.2 via ge-0/0/0.0 [BGP/170] 00:02:39, localpref 100 AS path: 65422 65500 65421 I, validation-state: unverified > to 198.51.100.6 via ge-0/0/1.0 [BGP/170] 00:01:02, localpref 100 AS path: 65423 65500 65421 I, validation-state: unverified > to 198.51.100.10 via ge-0/0/2.0
This includes the direct path via leaf1, an indirect path via leaf2, and another indirect path via leaf3. Thus, in this case, if spine2 loses the direct path via leaf1, it will start path hunting through the other suboptimal paths, until the network fully converges with all withdraws processed on all fabric nodes. This problem can be addressed by applying an export policy on the spines that adds a BGP community to all advertised routes, and then using this community on the leafs to match and reject such routes from being advertised back to the spines.
In Junos, a routing policy controls the import of routes into the routing table and the export of routes from the routing table, to be advertised to neighbors. In general, a routing policy consists of terms, which include match conditions and associated actions. The routing policy on the spines is shown in Example 3-10 and includes the following two policy terms:
all-bgp: Matches all BGP learned routes, accepts them, and adds a community value from the community name spine-to-leaf.
loopback: Matches all direct routes in the IPv4 subnet 192.0.2.0/24. The orlonger configuration option matches any IPv4 address that is equal to or longer than the defined prefix length.
Example 3-10 Policy to add a BGP community on the spines as they advertise routes to leafs
admin@spine2# show policy-options policy-statement spine-to-leaf term all-bgp { from protocol bgp; then { community add spine-to-leaf; accept; } } term loopback { from { protocol direct; route-filter 192.0.2.0/24 orlonger; } then { community add spine-to-leaf; accept; } } admin@spine2# show policy-options community spine-to-leaf members 0:15;
Once the policy in Example 3-10 is applied as an export policy on the spines for the underlay BGP group, the leafs receive all BGP routes attached with a BGP community of value 0:15. This can be confirmed on leaf2, taking leaf1’s loopback address into consideration, as shown in Example 3-11.
Example 3-11 Leaf1’s loopback address received with a BGP community of 0:15 on leaf2
admin@leaf2> show route table inet.0 192.0.2.11/32 extensive inet.0: 9 destinations, 12 routes (9 active, 0 holddown, 0 hidden) Limit/Threshold: 1048576/1048576 destinations 192.0.2.11/32 (2 entries, 1 announced) TSI: KRT in-kernel 192.0.2.11/32 -> {list:198.51.100.5, 198.51.100.7} Page 0 idx 0, (group underlay type External) Type 1 val 0x85194a0 (adv_entry) Advertised metrics: Nexthop: 198.51.100.5 AS path: [65422] 65500 65421 I Communities: 0:15 Advertise: 00000002 Path 192.0.2.11 from 198.51.100.5 Vector len 4. Val: 0 *BGP Preference: 170/-101 Next hop type: Router, Next hop index: 0 Address: 0x7a46fac Next-hop reference count: 3, Next-hop session id: 0 Kernel Table Id: 0 Source: 198.51.100.5 Next hop: 198.51.100.5 via ge-0/0/0.0 Session Id: 0 Next hop: 198.51.100.7 via ge-0/0/1.0, selected Session Id: 0 State: <Active Ext> Local AS: 65422 Peer AS: 65500 Age: 3:35 Validation State: unverified Task: BGP_65500.198.51.100.5 Announcement bits (3): 0-KRT 1-BGP_Multi_Path 2-BGP_RT_Background AS path: 65500 65421 I Communities: 0:15 Accepted Multipath Localpref: 100 Router ID: 192.0.2.101 Thread: junos-main BGP Preference: 170/-101 Next hop type: Router, Next hop index: 577 Address: 0x77c63f4 Next-hop reference count: 5, Next-hop session id: 321 Kernel Table Id: 0 Source: 198.51.100.7 Next hop: 198.51.100.7 via ge-0/0/1.0, selected Session Id: 321 State: <Ext> Inactive reason: Active preferred Local AS: 65422 Peer AS: 65501 Age: 5:30 Validation State: unverified Task: BGP_65501.198.51.100.7 AS path: 65501 65421 I Communities: 0:15 Accepted MultipathContrib Localpref: 100 Router ID: 192.0.2.102 Thread: junos-main
On the leafs, it is now a simple matter of rejecting any route that has this community to stop it from being readvertised back to the spines. A new policy is created for this, and it is applied using an and operation to the existing policy that advertises the loopback address, as shown in Example 3-12 from the perspective of leaf1.
Example 3-12 Policy on leaf1 to reject BGP routes with a community of 0:15
admin@leaf1# show policy-options policy-statement leaf-to-spine term reject-to-spine { from { protocol bgp; community spine-to-leaf; } then reject; } term accept-all { then accept; } admin@leaf1# show policy-options community spine-to-leaf members 0:15; admin@leaf1# show protocols bgp group underlay { type external; family inet { unicast; } export ( leaf-to-spine && allow-loopback ); multipath { multiple-as; } neighbor 198.51.100.1 { peer-as 65500; } neighbor 198.51.100.3 { peer-as 65501; } }
With this policy applied on all the leafs, the spines will not learn any suboptimal paths to each of the leaf loopbacks. This is confirmed in Example 3-13, with each spine learning every leaf’s loopback address via the direct path to the respective leaf.
Example 3-13 Route to each leaf’s loopback address on spine1 and spine2
admin@spine1> show route table inet.0 192.0.2.11/32 inet.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden) Limit/Threshold: 1048576/1048576 destinations + = Active Route, - = Last Active, * = Both 192.0.2.11/32 *[BGP/170] 15:45:36, localpref 100 AS path: 65421 I, validation-state: unverified > to 198.51.100.0 via ge-0/0/0.0 admin@spine1> show route table inet.0 192.0.2.12/32 inet.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden) Limit/Threshold: 1048576/1048576 destinations + = Active Route, - = Last Active, * = Both 192.0.2.12/32 *[BGP/170] 15:42:09, localpref 100 AS path: 65422 I, validation-state: unverified > to 198.51.100.4 via ge-0/0/1.0 admin@spine1> show route table inet.0 192.0.2.13/32 inet.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden) Limit/Threshold: 1048576/1048576 destinations + = Active Route, - = Last Active, * = Both 192.0.2.13/32 *[BGP/170] 15:40:35, localpref 100 AS path: 65423 I, validation-state: unverified > to 198.51.100.8 via ge-0/0/2.0 admin@spine2> show route table inet.0 192.0.2.11/32 inet.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden) Limit/Threshold: 1048576/1048576 destinations + = Active Route, - = Last Active, * = Both 192.0.2.11/32 *[BGP/170] 15:47:10, localpref 100 AS path: 65421 I, validation-state: unverified > to 198.51.100.2 via ge-0/0/0.0 admin@spine2> show route table inet.0 192.0.2.12/32 inet.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden) Limit/Threshold: 1048576/1048576 destinations + = Active Route, - = Last Active, * = Both 192.0.2.12/32 *[BGP/170] 15:42:18, localpref 100 AS path: 65422 I, validation-state: unverified > to 198.51.100.6 via ge-0/0/1.0 admin@spine2> show route table inet.0 192.0.2.13/32 inet.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden) Limit/Threshold: 1048576/1048576 destinations + = Active Route, - = Last Active, * = Both 192.0.2.13/32 *[BGP/170] 15:40:45, localpref 100 AS path: 65423 I, validation-state: unverified > to 198.51.100.10 via ge-0/0/2.0
Junos also offers the operator a direct way to test the policy, which can be used to confirm that a leaf’s locally owned loopback address is being advertised to the spines, and other loopback addresses learned via BGP are rejected. This uses the test policy operational mode command, as shown in Example 3-14, where only leaf1’s loopback address (192.0.2.11/32) is accepted by the policy, while leaf2’s and leaf3’s loopback addresses, 192.0.2.12/32 and 192.0.2.13/32 respectively, are rejected by the policy.
Example 3-14 Policy rejecting leaf2’s and leaf3’s loopback addresses from being advertised to the spines on leaf1
admin@leaf1> test policy leaf-to-spine 192.0.2.11/32 inet.0: 9 destinations, 11 routes (9 active, 0 holddown, 0 hidden) Limit/Threshold: 1048576/1048576 destinations + = Active Route, - = Last Active, * = Both 192.0.2.11/32 *[Direct/0] 1d 04:38:27 > via lo0.0 Policy leaf-to-spine: 1 prefix accepted, 0 prefix rejected admin@leaf1> test policy leaf-to-spine 192.0.2.12/32 Policy leaf-to-spine: 0 prefix accepted, 1 prefix rejected admin@leaf1> test policy leaf-to-spine 192.0.2.13/32 Policy leaf-to-spine: 0 prefix accepted, 1 prefix rejected
With this configuration in place, the fabric underlay is successfully built, with each leaf’s loopback address reachable from every other leaf, as shown in Example 3-15, while also preventing any path-hunting issues on the spines by using appropriate routing policies.
Example 3-15 Loopback reachability from leaf1
admin@leaf1> ping 192.0.2.12 source 192.0.2.11 PING 192.0.2.12 (192.0.2.12): 56 data bytes 64 bytes from 192.0.2.12: icmp_seq=0 ttl=63 time=3.018 ms 64 bytes from 192.0.2.12: icmp_seq=1 ttl=63 time=2.697 ms 64 bytes from 192.0.2.12: icmp_seq=2 ttl=63 time=4.773 ms 64 bytes from 192.0.2.12: icmp_seq=3 ttl=63 time=3.470 ms ^C --- 192.0.2.12 ping statistics --- 4 packets transmitted, 4 packets received, 0% packet loss round-trip min/avg/max/stddev = 2.697/3.490/4.773/0.790 ms admin@leaf1> ping 192.0.2.13 source 192.0.2.11 PING 192.0.2.13 (192.0.2.13): 56 data bytes 64 bytes from 192.0.2.13: icmp_seq=0 ttl=63 time=2.979 ms 64 bytes from 192.0.2.13: icmp_seq=1 ttl=63 time=2.814 ms 64 bytes from 192.0.2.13: icmp_seq=2 ttl=63 time=2.672 ms 64 bytes from 192.0.2.13: icmp_seq=3 ttl=63 time=2.379 ms ^C --- 192.0.2.13 ping statistics --- 4 packets transmitted, 4 packets received, 0% packet loss round-trip min/avg/max/stddev = 2.379/2.711/2.979/0.220 ms