Lecture 4

Recap

TRILL (TRansparent Interconnect of Lots of Links)
- Defines Rbridges
- "Routing" on layer 2
- Smart bridges, interoperable with 802.1D (STP) and 802.1Q (VLAN)
SPB
- Standardised in 802.1aq
- Allow for shortest-path forwarding in a mesh-connected Ethernet network utilising multiple equal-cost paths
Ethernet VPN (EVPN)
- Uses MP-BGP to distribute L2 MAC and L3 IP information
- Separates control plane from data plane
- Every endpoint knows where a MAC lives instead of having to flood
- Works over multiple data plane encapsulations

The above work by borrowing routing from L3 into L2
The loops still exist physically, but because there is no need for MAC flooding, this is not a problem

Link Aggregation Control Protocol (LACP)

LACP creates Link Aggregation Groups (LAGs)
Combined connection of physical ports
Pasted image 20250912102354.png|300
Defined in 802.3ad
Made to achieve the following:

Automatic addition and deletion of individual links to LAGs
Link monitoring: check whether both ends are connected to the correct group
Balance traffic across links based on some mapping between link number and hashes
- To ensure packets of a connection stay on the same link, and thus arrive in the same order

In Research

SLICES
Create a fabric connecting research facilities through Europe
Based on L2

L2 Architectures

How do you design an efficient, reliable and flexible network?

Network Size

You need to consider the size:

Small: < 200 devices
Medium: 200 - 2000 devices
Large: > 1000 devices

The larger the network, the bigger the need for load balancing and redundancy

Data Centers

Warning

From here on out we consider L3 capable switches

Mostly rely on L2, but nowadays also L3, with L3 links between racks
10.000 - 100.000 hosts

Three-layer Hierarchy (Mostly Legacy)

Pasted image 20250912103051.png|400
Core:

Provide fast transport between distribution switches
L3 capable switches
Provide access to the wider internet
Distribution:
Provide policy-based connectivity and controls the boundary between the access and core layers
E.g. one per floor of a building
When creating STP, distribution switch is usually a good choice as a root
Access: provides hosts access to the network

Collapsed Core Design (Also Legacy)

Pasted image 20250912103500.png|400
Simplify, but still keep benefits of 3-layer design
Lower cost due to less devices
More load on core devices: access control AND throughput
Harder to scale

Problems with three-tier and collapsed core

Oversubscription: not all nodes can transmit at the same time
The links to the internet have less bandwidth than the devices connected to them need
Uses techniques such as Equal Cost Multi Path routing (ECMP)
Performs static load splitting, does not account for flow size
Routing tables become very large due to multiple paths

Fat Trees

Pasted image 20250912103910.png|400
More modern design
Use many small switches
The higher you go in the tree, the more interconnected the devices are
Network divided into pods (the dotted boxes)
Each pod has edge and aggregation switches
Not necessarily a single broadcast domain, can have multiple within it
Can also contain multiple VLANs

Less oversubscription
Not just because of hierarchy, but also rules on how many devices and interconnects a pod should have
Routing and addressing scheme
Less core bottlenecks
Not just a single link between core and aggregation
Can easily scale horizontally by adding more pods
Built on commodity hardware, making it cheaper

K-Tree

Given k ports on a switch

Use $(k / 2)^{2}$ core switches
Use k pods
Use $k / 2$ edge and aggregation switches to form a pod
Connect each agg. switch with $k / 2$ core and $k / 2$ edge switches
Connect each edge switch with $k / 2$ agg. switches and $k / 2$ end nodes

48 ports -> 28.000 hosts w/o oversubscription

Addressing in Fat-Tree

Use 10.0.0.0/8 private IP address space
Pod switches have address 10.pod.switch.1
Core switches have address 10.k.j.i
i and j denotes the position in the core switches
k is constant, to tell it it's not a pod
Hosts have address 10.pod.switch.ID
ID is host-ID in switch subnet
$k < 256$ , this scheme does not scale indefinitely

This works because the IP shows topology, making it easier to route

Two-level Lookup Table

Needed because there are many equal paths
Storing all these paths in a normal routing table would be very inefficient

First level:
Prefix lookup
Used to route down the topology to servers (inside a pod)
If the destination is not found, then the destination is not in this pod, so check suffix to see to which core switch it should go
Second level:
Suffix lookup
Used to route up towards core (between pods)
Used to load balance in a random but deterministic way
Pasted image 20250912105352.png|400

Spine-Leaf Topology

We don't always need a full fat tree
Nowadays we got more east-west traffic than north-south (more traffic inside the datacenter, than in/out of the datacenter)
Pasted image 20250912105510.png|400
Spine switches to route between leaf switches
Leaf switch to route between racks
Each rack has its own switch to connect hosts
Rich interconnection among switches
Any server to any other is only 4 hops
Even more scalability by adding more leaf/spine switches
Every leaf connects to every spine

Protocol Innovation

Datalink layer:
- RoCE: remote DMA (RDMA) over Converged Ethernet
Transport layer:
- ECN (Explicit Congestion Notification) used in transport-layer congestion control (DCTCP, DCQCN)
- experimentation with hop-by-hop (backpressure) congestion control
Routing, management:
- SDN widely used within/among organisations’ datacenters
- place related services, data as close as possible (e.g., in same rack or nearby rack) to minimise tier-2, tier-1 communication

IPv6

Shoulda given it more address space
~ Vint Cerf

$4 \times$ the bits, but $2^{96}$ times the address space
Auto configuration
StateLess Adress AutoConfiguration (SLAAC)
Stateful configuration (DHCPv6)

Security: Built-in IPsec

Optimised headers:
Fixed 40 byte length
Extension header mechanism to $N \times 40 bytes$

Mobility support:
Support for end-to-end route optimisation
Even if one host changes networks

No NAT needed, we have enough addresses
End-to-end principle could be used again
But "smart" hosts in the middle might interfere

IPv6 adoption is quite slow
Started in 2011, now at about 50%

Hierarchical addressing
Needed because of the huge address space
Standard is to follow the 16-32-48-64 for RIP-ISP-ORG-NET

Notation

8 colon (:) separated blocks of 4 hex-digits, totalling 128 bits
Leading zeroes may be skipped
Blocks of all zeroes may be replaced by ::
No broadcasts, only multicast
No subnet masks, only prefixes

IPv4

IPv4 address: 131.211.140.89
Subnet mask: 255.255.255.192
Wildcard mask: 0.0.0.63
Network: 131.211.140.64/26
Broadcast: 131.211.140.127
Mixed notation: 131.211.140.89/26 (Host and network combined)

IPv6

IPv6 address: 2001:0610:0158:bad0:0000:0000:0000:0001
Short form: 2001:610:158:bad0::1
Network: 2001:610:158:bad0::/64
Mixed notation: 2001:610:158:bad0::1/64

IPv6 is globally routable
Pasted image 20250912113447.png|400

Type of Addresses

Unicast: used to identify one interface
Anycast: multiple devices share the same IP, packets get delivered to the "closest" one

IPv6 has 2 addresses per interface:

Globally routable IPv6
Locally routable IPv6: Link-local unicast address

Link-local unicast is used for communicating within the same LAN
Part of the fe80::/10 address block
Each host can compute it's own link-local address by concatenating the fe80::/10 prefix with the 64 bit identifier of its interface
Needed for neighbour discovery and routing

StateLess Address AutoConfiguration (SLAAC)

First compute link-local address (device can do that on it's own)
Check it with Duplicate Address Detection (DAD), done through multicast
If it's already in use, generate a new one (with MAC or random value)

Then listen for Router Advertisement (RA) messages
Contains prefix of the router and flags
Use the prefix from the router + interface identifier (based on MAC or random value)
Again, check usage with DAD (multicast in LAN, since prefix is unique and only used behind your router)

Multicast Addresses

All end systems automatically belong to the ff02::1 multicast group
All routers automatically belong to the ff02::2 multicast group

Neighbour Discover Protocol (NDP)

IPv6 does not use ARP
IPv6 does use ICMPv6 to:

Discover addresses and status of neighbours
Discover properties of networks and prefixes
Execute Duplicate Address Detection (DAD)
Execute Neighbour Unreachability Detection (NUD)

ICMPv6 Errors

Error messages:

Destination Unreachable
- No route to destination
- Communication prohibited
- Beyond scope of source address
- Address unreachable
- Port unreachable
Packet too big
- Nodes in between do not fragment, send back an error to source to make it smaller. IPv6 header set to indicate that the source is fragmenting
TTL exceeded
Parameter problem

ICMPv6 Information

Information messages:

Echo request/reply
Used for NDP:
Router sollicitation
Router advertisement
Neighbour solicitation
Neighbour advertisement
Redirect message

IPv6 header

Fixed header makes it easier to deal with
Pasted image 20250912115031.png|400
Next header field indicates whether a next header should be expected
No checksum means error checking is left to a higher layer

Header fields:
Pasted image 20250912115125.png|400
Header types:
Pasted image 20250912115201.png|400
IPv6-Frag used to indicate the sender is fragmenting
Pasted image 20250912115248.png|400