Lecture 4
Recap
- TRILL (TRansparent Interconnect of Lots of Links)
- Defines Rbridges
- "Routing" on layer 2
- Smart bridges, interoperable with 802.1D (STP) and 802.1Q (VLAN)
- SPB
- Standardised in 802.1aq
- Allow for shortest-path forwarding in a mesh-connected Ethernet network utilising multiple equal-cost paths
- Ethernet VPN (EVPN)
- Uses MP-BGP to distribute L2 MAC and L3 IP information
- Separates control plane from data plane
- Every endpoint knows where a MAC lives instead of having to flood
- Works over multiple data plane encapsulations
The above work by borrowing routing from L3 into L2
The loops still exist physically, but because there is no need for MAC flooding, this is not a problem
Link Aggregation Control Protocol (LACP)
LACP creates Link Aggregation Groups (LAGs)
Combined connection of physical ports
%20Inter-Networking%20and%20Routing/Attachments/Pasted%20image%2020250912102354.png)
Defined in 802.3ad
Made to achieve the following:
- Automatic addition and deletion of individual links to LAGs
- Link monitoring: check whether both ends are connected to the correct group
- Balance traffic across links based on some mapping between link number and hashes
- To ensure packets of a connection stay on the same link, and thus arrive in the same order
In Research
SLICES
Create a fabric connecting research facilities through Europe
Based on L2
L2 Architectures
How do you design an efficient, reliable and flexible network?
Network Size
You need to consider the size:
- Small: < 200 devices
- Medium: 200 - 2000 devices
- Large: > 1000 devices
The larger the network, the bigger the need for load balancing and redundancy
Data Centers
From here on out we consider L3 capable switches
Mostly rely on L2, but nowadays also L3, with L3 links between racks
10.000 - 100.000 hosts
Three-layer Hierarchy (Mostly Legacy)
%20Inter-Networking%20and%20Routing/Attachments/Pasted%20image%2020250912103051.png)
Core:
- Provide fast transport between distribution switches
- L3 capable switches
- Provide access to the wider internet
Distribution: - Provide policy-based connectivity and controls the boundary between the access and core layers
- E.g. one per floor of a building
- When creating STP, distribution switch is usually a good choice as a root
Access: provides hosts access to the network
Collapsed Core Design (Also Legacy)
%20Inter-Networking%20and%20Routing/Attachments/Pasted%20image%2020250912103500.png)
Simplify, but still keep benefits of 3-layer design
Lower cost due to less devices
More load on core devices: access control AND throughput
Harder to scale
Problems with three-tier and collapsed core
Oversubscription: not all nodes can transmit at the same time
The links to the internet have less bandwidth than the devices connected to them need
Uses techniques such as Equal Cost Multi Path routing (ECMP)
Performs static load splitting, does not account for flow size
Routing tables become very large due to multiple paths
Fat Trees
%20Inter-Networking%20and%20Routing/Attachments/Pasted%20image%2020250912103910.png)
More modern design
Use many small switches
The higher you go in the tree, the more interconnected the devices are
Network divided into pods (the dotted boxes)
Each pod has edge and aggregation switches
Not necessarily a single broadcast domain, can have multiple within it
Can also contain multiple VLANs
Less oversubscription
Not just because of hierarchy, but also rules on how many devices and interconnects a pod should have
Routing and addressing scheme
Less core bottlenecks
Not just a single link between core and aggregation
Can easily scale horizontally by adding more pods
Built on commodity hardware, making it cheaper
K-Tree
Given k ports on a switch
- Use
core switches - Use k pods
- Use
edge and aggregation switches to form a pod - Connect each agg. switch with
core and edge switches - Connect each edge switch with
agg. switches and end nodes
48 ports -> 28.000 hosts w/o oversubscription
Addressing in Fat-Tree
Use 10.0.0.0/8 private IP address space
Pod switches have address 10.pod.switch.1
Core switches have address 10.k.j.i
i and j denotes the position in the core switches
k is constant, to tell it it's not a pod
Hosts have address 10.pod.switch.ID
ID is host-ID in switch subnet
This works because the IP shows topology, making it easier to route
Two-level Lookup Table
Needed because there are many equal paths
Storing all these paths in a normal routing table would be very inefficient
First level:
Prefix lookup
Used to route down the topology to servers (inside a pod)
If the destination is not found, then the destination is not in this pod, so check suffix to see to which core switch it should go
Second level:
Suffix lookup
Used to route up towards core (between pods)
Used to load balance in a random but deterministic way
%20Inter-Networking%20and%20Routing/Attachments/Pasted%20image%2020250912105352.png)
Spine-Leaf Topology
We don't always need a full fat tree
Nowadays we got more east-west traffic than north-south (more traffic inside the datacenter, than in/out of the datacenter)
%20Inter-Networking%20and%20Routing/Attachments/Pasted%20image%2020250912105510.png)
Spine switches to route between leaf switches
Leaf switch to route between racks
Each rack has its own switch to connect hosts
Rich interconnection among switches
Any server to any other is only 4 hops
Even more scalability by adding more leaf/spine switches
Every leaf connects to every spine
Protocol Innovation
- Datalink layer:
- RoCE: remote DMA (RDMA) over Converged Ethernet
- Transport layer:
- ECN (Explicit Congestion Notification) used in transport-layer congestion control (DCTCP, DCQCN)
- experimentation with hop-by-hop (backpressure) congestion control
- Routing, management:
- SDN widely used within/among organisations’ datacenters
- place related services, data as close as possible (e.g., in same rack or nearby rack) to minimise tier-2, tier-1 communication
IPv6
Shoulda given it more address space
~ Vint Cerf
Auto configuration
StateLess Adress AutoConfiguration (SLAAC)
Stateful configuration (DHCPv6)
Security: Built-in IPsec
Optimised headers:
Fixed 40 byte length
Extension header mechanism to
Mobility support:
Support for end-to-end route optimisation
Even if one host changes networks
No NAT needed, we have enough addresses
End-to-end principle could be used again
But "smart" hosts in the middle might interfere
IPv6 adoption is quite slow
Started in 2011, now at about 50%
Hierarchical addressing
Needed because of the huge address space
Standard is to follow the 16-32-48-64 for RIP-ISP-ORG-NET
Notation
8 colon (:) separated blocks of 4 hex-digits, totalling 128 bits
Leading zeroes may be skipped
Blocks of all zeroes may be replaced by ::
No broadcasts, only multicast
No subnet masks, only prefixes
IPv4
IPv4 address: 131.211.140.89
Subnet mask: 255.255.255.192
Wildcard mask: 0.0.0.63
Network: 131.211.140.64/26
Broadcast: 131.211.140.127
Mixed notation: 131.211.140.89/26 (Host and network combined)
IPv6
IPv6 address: 2001:0610:0158:bad0:0000:0000:0000:0001
Short form: 2001:610:158:bad0::1
Network: 2001:610:158:bad0::/64
Mixed notation: 2001:610:158:bad0::1/64
IPv6 is globally routable
%20Inter-Networking%20and%20Routing/Attachments/Pasted%20image%2020250912113447.png)
Type of Addresses
Unicast: used to identify one interface
Anycast: multiple devices share the same IP, packets get delivered to the "closest" one
IPv6 has 2 addresses per interface:
- Globally routable IPv6
- Locally routable IPv6: Link-local unicast address
Link-local unicast is used for communicating within the same LAN
Part of the fe80::/10 address block
Each host can compute it's own link-local address by concatenating the fe80::/10 prefix with the 64 bit identifier of its interface
Needed for neighbour discovery and routing
StateLess Address AutoConfiguration (SLAAC)
First compute link-local address (device can do that on it's own)
Check it with Duplicate Address Detection (DAD), done through multicast
If it's already in use, generate a new one (with MAC or random value)
Then listen for Router Advertisement (RA) messages
Contains prefix of the router and flags
Use the prefix from the router + interface identifier (based on MAC or random value)
Again, check usage with DAD (multicast in LAN, since prefix is unique and only used behind your router)
Multicast Addresses
All end systems automatically belong to the ff02::1 multicast group
All routers automatically belong to the ff02::2 multicast group
Neighbour Discover Protocol (NDP)
IPv6 does not use ARP
IPv6 does use ICMPv6 to:
- Discover addresses and status of neighbours
- Discover properties of networks and prefixes
- Execute Duplicate Address Detection (DAD)
- Execute Neighbour Unreachability Detection (NUD)
ICMPv6 Errors
Error messages:
- Destination Unreachable
- No route to destination
- Communication prohibited
- Beyond scope of source address
- Address unreachable
- Port unreachable
- Packet too big
- Nodes in between do not fragment, send back an error to source to make it smaller. IPv6 header set to indicate that the source is fragmenting
- TTL exceeded
- Parameter problem
ICMPv6 Information
Information messages:
- Echo request/reply
Used for NDP: - Router sollicitation
- Router advertisement
- Neighbour solicitation
- Neighbour advertisement
- Redirect message
IPv6 header
Fixed header makes it easier to deal with
%20Inter-Networking%20and%20Routing/Attachments/Pasted%20image%2020250912115031.png)
Next header field indicates whether a next header should be expected
No checksum means error checking is left to a higher layer
Header fields:
%20Inter-Networking%20and%20Routing/Attachments/Pasted%20image%2020250912115125.png)
Header types:
%20Inter-Networking%20and%20Routing/Attachments/Pasted%20image%2020250912115201.png)
IPv6-Frag used to indicate the sender is fragmenting
%20Inter-Networking%20and%20Routing/Attachments/Pasted%20image%2020250912115248.png)