From 568f78407ddb790b9ce3ade743d86a2c385e9624 Mon Sep 17 00:00:00 2001 From: Simon Mayer Date: Wed, 12 Nov 2025 12:56:55 +0100 Subject: [PATCH 1/8] MEP-19: Zone Awareness in metal-stack.io --- .../contributing/01-Proposals/MEP19/README.md | 71 ++++++++++++++++ .../01-Proposals/MEP19/proposal-1.drawio | 51 ++++++++++++ .../01-Proposals/MEP19/proposal-1.svg | 1 + .../01-Proposals/MEP19/proposal-2.drawio | 47 +++++++++++ .../01-Proposals/MEP19/proposal-2.svg | 1 + .../01-Proposals/MEP19/storage-current.drawio | 61 ++++++++++++++ .../01-Proposals/MEP19/storage-current.svg | 1 + .../MEP19/storage-proposal.drawio | 83 +++++++++++++++++++ .../01-Proposals/MEP19/storage-proposal.svg | 1 + docs/contributing/01-Proposals/index.md | 1 + 10 files changed, 318 insertions(+) create mode 100644 docs/contributing/01-Proposals/MEP19/README.md create mode 100644 docs/contributing/01-Proposals/MEP19/proposal-1.drawio create mode 100644 docs/contributing/01-Proposals/MEP19/proposal-1.svg create mode 100644 docs/contributing/01-Proposals/MEP19/proposal-2.drawio create mode 100644 docs/contributing/01-Proposals/MEP19/proposal-2.svg create mode 100644 docs/contributing/01-Proposals/MEP19/storage-current.drawio create mode 100644 docs/contributing/01-Proposals/MEP19/storage-current.svg create mode 100644 docs/contributing/01-Proposals/MEP19/storage-proposal.drawio create mode 100644 docs/contributing/01-Proposals/MEP19/storage-proposal.svg diff --git a/docs/contributing/01-Proposals/MEP19/README.md b/docs/contributing/01-Proposals/MEP19/README.md new file mode 100644 index 0000000..41ae0fa --- /dev/null +++ b/docs/contributing/01-Proposals/MEP19/README.md @@ -0,0 +1,71 @@ +--- +slug: /MEP-19-zone-awareness +title: MEP-19 +sidebar_position: 19 +--- + +# Zone Awareness in metal-stack.io + +In metal-stack, the concepts of regions and zones are currently represented implicitly through partition names rather than as dedicated API entities. This design uses naming conventions to encode both region and zone information within a partition identifier. For example, the partition name `fra_eqx_01` translates to Frankfurt (region), Equinix (zone), and 01 (partition). + +From a networking perspective, `supernetworks` can be scoped to a partition, and traffic is not routed between partitions — except for external networks such as the Internet or MPLS connections. Currently, all networks are configured with disjunct IP prefixes. With the introduction of [MEP-4](../MEP4/README.md), this behavior will change: Network prefixes may overlap across partitions but must remain disjunct within a single project. + +## Motivation + +With [MEP-12](../MEP12/README.md) the rack spreading feature has been introduced. Limitations of this feature are: It can not be explicitly decided, in which racks nodes are placed. Moreover, this is performed with a best-effort strategy. If no machine is available in one rack, it might get placed in the one where already a machine is present. + +Already with current metal-stack installations, it is possible to spread partitions across data centers. However, this is still one failure domain, e.g. a single BGP failure could bring down the whole partition. As known from major cloud providers, zonal distribution of workload enhances availability and fault tolerance. + +## Requirements to Achieve this Goal + +To support explicit region and zone concepts in metal-stack, several functional and architectural requirements must be met. The following considerations focus primarily on the Kubernetes integration and cluster topology aspects. Proper spreading of worker nodes and control plane components across [multiple zones](https://kubernetes.io/docs/setup/best-practices/multiple-zones/) and regions should be possible. Nodes that belong to the same Kubernetes cluster must have the capability to communicate directly with each other, even if they are located in different partitions, provided that network configurations allow this communication using their respective Node CIDRs. It must be possible for nodes within a single Kubernetes cluster to use different Node CIDR ranges, depending on their partition or zone assignment. Major cloud providers use node groups to configure Node CIRDs differently. + +Storage resources must either be strictly located in a single partition or replicated across all partitions. This can be enforced using [`allowedTopologies`](https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies) within a StorageClass. + +An open design question remains regarding Pod and Service CIDRs. Should overlay networks be avoided and purely relied on routed IPv6? Or should an overlay network be introduced across partitions? Further evaluation is needed to determine the optimal approach. + +## Proposals + +**Proposal 1: Disjunct VNIs Across Partitions** + +![proposal 1](proposal-1.svg) + +In this approach, each partition uses a distinct set of VNIs. An additional controller, most likely running on the exit switch, would be required to build and manage the corresponding route maps. + +Each partition would maintain its own VRF. On the exit switch, routes from all VRFs associated with the same project would be imported to enable project-wide routing between partitions while maintaining isolation from other projects. + +The firewall would need to participate in all VRFs of the cluster, ensuring consistent traffic filtering and policy enforcement across partitions. Additionally, a default route must be present within each VRF. + +**Proposal 2: Multi-Site DCI** + +![proposal 2](proposal-2.svg) + +In the second approach, the same VNIs are used across multiple partitions. This capability can be realized by leveraging features provided by the Enterprise Switch OS. + +From a metal-stack perspective, each partition would still define separate node networks, but the same VRFs would be available in each partition. + +To support this, the `metal-api` would need to be extended to allow identical VNIs across different networks and partitions, as long as they belong to the same project. + +**Storage** + +Storage aspects will likely be addressed in a dedicated MEP. However, some initial considerations are outlined here. + +![current storage situation](storage-current.svg) + +In the current architecture as illustrated above, a node accesses storage through the firewall. + +![storage proposal](storage-proposal.svg) + +One possible improvement would be to remove the dependency on the firewall for storage access. This could be achieved by configuring a route map on the leaf switch to establish a direct mapping between the tenant VRF and the storage VRF on a per-project basis. + +## Operational Recommendations and Documentation Notes + +Include a recommendation on the maximum practical distance between partitions within a single zone, particularly with regard to latency-sensitive components such as `etcd`. + +## Roadmap + +The following tasks can be considered as next steps: + +- Verify proposals in containerlab +- Research: Can FRR do the Multi-Site DCI Feature out-of-the-box? +- Create sample for a Gardener shoot spec and the Cluster API manifests diff --git a/docs/contributing/01-Proposals/MEP19/proposal-1.drawio b/docs/contributing/01-Proposals/MEP19/proposal-1.drawio new file mode 100644 index 0000000..a5a0df6 --- /dev/null +++ b/docs/contributing/01-Proposals/MEP19/proposal-1.drawio @@ -0,0 +1,51 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/contributing/01-Proposals/MEP19/proposal-1.svg b/docs/contributing/01-Proposals/MEP19/proposal-1.svg new file mode 100644 index 0000000..f806707 --- /dev/null +++ b/docs/contributing/01-Proposals/MEP19/proposal-1.svg @@ -0,0 +1 @@ +
Partition 1
Partition 1
VRF1
VRF1
10.0.0.1/32
10.0.0....
Partition 2
Partition 2
VRF2
VRF2
10.0.1.1/32
10.0.1....
Route Maps
without NAT
Route Maps...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/contributing/01-Proposals/MEP19/proposal-2.drawio b/docs/contributing/01-Proposals/MEP19/proposal-2.drawio new file mode 100644 index 0000000..4f4ec7f --- /dev/null +++ b/docs/contributing/01-Proposals/MEP19/proposal-2.drawio @@ -0,0 +1,47 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/contributing/01-Proposals/MEP19/proposal-2.svg b/docs/contributing/01-Proposals/MEP19/proposal-2.svg new file mode 100644 index 0000000..fe24aec --- /dev/null +++ b/docs/contributing/01-Proposals/MEP19/proposal-2.svg @@ -0,0 +1 @@ +
Partition 1
Partition 1
VRF1
VRF1
10.0.0.1/32
10.0.0....
Partition 2
Partition 2
VRF1
VRF1
10.0.1.1/32
10.0.1....
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/contributing/01-Proposals/MEP19/storage-current.drawio b/docs/contributing/01-Proposals/MEP19/storage-current.drawio new file mode 100644 index 0000000..c9db890 --- /dev/null +++ b/docs/contributing/01-Proposals/MEP19/storage-current.drawio @@ -0,0 +1,61 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/contributing/01-Proposals/MEP19/storage-current.svg b/docs/contributing/01-Proposals/MEP19/storage-current.svg new file mode 100644 index 0000000..607010c --- /dev/null +++ b/docs/contributing/01-Proposals/MEP19/storage-current.svg @@ -0,0 +1 @@ +
Tenant VRF
Tenant VRF
Storage VRF
Storage VRF
Firewall
Firewall
Worker
Worker
Storage access
Storage access
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/contributing/01-Proposals/MEP19/storage-proposal.drawio b/docs/contributing/01-Proposals/MEP19/storage-proposal.drawio new file mode 100644 index 0000000..4200cb6 --- /dev/null +++ b/docs/contributing/01-Proposals/MEP19/storage-proposal.drawio @@ -0,0 +1,83 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/contributing/01-Proposals/MEP19/storage-proposal.svg b/docs/contributing/01-Proposals/MEP19/storage-proposal.svg new file mode 100644 index 0000000..5462282 --- /dev/null +++ b/docs/contributing/01-Proposals/MEP19/storage-proposal.svg @@ -0,0 +1 @@ +
Route Map
Route Map
Tenant VRF
Tenant VRF
Internet VRF
Internet VRF
Storage VRF
Storage VRF
Firewall
Firewall
Worker
Worker
Storage access
Storage access
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/contributing/01-Proposals/index.md b/docs/contributing/01-Proposals/index.md index 0f6eddc..72dc7e7 100644 --- a/docs/contributing/01-Proposals/index.md +++ b/docs/contributing/01-Proposals/index.md @@ -43,6 +43,7 @@ Once a proposal was accepted, an issue should be raised and the implementation s | [MEP-16](MEP16/README.md) | Firewall Support for Cluster API Provider | `Accepted` | [releases#237](https://github.com/metal-stack/releases/issues/237) | | [MEP-17](MEP17/README.md) | Global Network View | `In Discussion` | | | [MEP-18](MEP18/README.md) | Autonomous Control Plane | `In Discussion` | | +| [MEP-19](MEP19/README.md) | Zone Awareness in metal-stack.io | `In Discussion` | | ## Proposal Process From 3a9543f91cf8804cdadad1ecedb209028cc53ea0 Mon Sep 17 00:00:00 2001 From: Markus Fensterer Date: Wed, 12 Nov 2025 14:30:04 +0100 Subject: [PATCH 2/8] proposal 3 --- .../contributing/01-Proposals/MEP19/README.md | 23 ++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/docs/contributing/01-Proposals/MEP19/README.md b/docs/contributing/01-Proposals/MEP19/README.md index 41ae0fa..f85c982 100644 --- a/docs/contributing/01-Proposals/MEP19/README.md +++ b/docs/contributing/01-Proposals/MEP19/README.md @@ -18,7 +18,14 @@ Already with current metal-stack installations, it is possible to spread partiti ## Requirements to Achieve this Goal -To support explicit region and zone concepts in metal-stack, several functional and architectural requirements must be met. The following considerations focus primarily on the Kubernetes integration and cluster topology aspects. Proper spreading of worker nodes and control plane components across [multiple zones](https://kubernetes.io/docs/setup/best-practices/multiple-zones/) and regions should be possible. Nodes that belong to the same Kubernetes cluster must have the capability to communicate directly with each other, even if they are located in different partitions, provided that network configurations allow this communication using their respective Node CIDRs. It must be possible for nodes within a single Kubernetes cluster to use different Node CIDR ranges, depending on their partition or zone assignment. Major cloud providers use node groups to configure Node CIRDs differently. +To support explicit region and zone concepts in metal-stack, several functional and architectural requirements must be met. The following considerations focus primarily on the Kubernetes integration and cluster topology aspects: +- Proper spreading of worker nodes and control plane components across [multiple zones](https://kubernetes.io/docs/setup/best-practices/multiple-zones/) and regions must be possible. +- Nodes that belong to the same Kubernetes cluster must have the capability to communicate directly with each other, even if they are located in different partitions, provided that network configurations allow this communication using their respective Node CIDRs. +- It must be possible for nodes within a single Kubernetes cluster to use different Node CIDR ranges, depending on their partition or zone assignment. Major cloud providers use node groups to configure Node CIRDs differently. +- Zones stay seperate failure domains (e.g. a failure in the EVPN control-plane of one zone should not affect the other to avoid EVPN fate-sharing) + +## Criteria +- Number of hops: for communication btw. worker nodes, to the internet and to the storage. Storage resources must either be strictly located in a single partition or replicated across all partitions. This can be enforced using [`allowedTopologies`](https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies) within a StorageClass. @@ -58,6 +65,20 @@ In the current architecture as illustrated above, a node accesses storage throug One possible improvement would be to remove the dependency on the firewall for storage access. This could be achieved by configuring a route map on the leaf switch to establish a direct mapping between the tenant VRF and the storage VRF on a per-project basis. +**Proposal 3: Project-Wide Route-Leaking and Open DCI** +This is a mixture of proposal 1 and 2 with disjunct VNIs across partitions. + +In this approach, each partition uses a distinct set of VNIs. The `metal-core`, running on the leaf switches, would be required to build and manage route leaks: +- from certain private networks (e.g. all project networks, storage network) to the local VRF (only locally held at the leaf switches) +- from the local VRF to a DCI VRF (only propagated zone-wide) + +The open DCI is a ring of exit switches speaking plain BGP (no EVPN routes, no VXLAN) for exchanging the private supernetworks of zones (note: prefix length is longer). +They operate as VTEP for the DCI VRF and is not dependent on the Multi-Site DCI feature of Enterprise SONiC. + +Notes: +- cross-zone traffic is very efficiently transported, as the firewall is not in the path (fewer hops) +- this can also be used to provide worker nodes with an more efficient way to access storage systems (also not going through the firewall) + ## Operational Recommendations and Documentation Notes Include a recommendation on the maximum practical distance between partitions within a single zone, particularly with regard to latency-sensitive components such as `etcd`. From 872a3dd43fa6c8bf7ac2f71deac5adce2e79f8c5 Mon Sep 17 00:00:00 2001 From: Markus Fensterer Date: Wed, 12 Nov 2025 14:34:00 +0100 Subject: [PATCH 3/8] fix spelling --- docs/contributing/01-Proposals/MEP19/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/contributing/01-Proposals/MEP19/README.md b/docs/contributing/01-Proposals/MEP19/README.md index f85c982..f2b6118 100644 --- a/docs/contributing/01-Proposals/MEP19/README.md +++ b/docs/contributing/01-Proposals/MEP19/README.md @@ -22,7 +22,7 @@ To support explicit region and zone concepts in metal-stack, several functional - Proper spreading of worker nodes and control plane components across [multiple zones](https://kubernetes.io/docs/setup/best-practices/multiple-zones/) and regions must be possible. - Nodes that belong to the same Kubernetes cluster must have the capability to communicate directly with each other, even if they are located in different partitions, provided that network configurations allow this communication using their respective Node CIDRs. - It must be possible for nodes within a single Kubernetes cluster to use different Node CIDR ranges, depending on their partition or zone assignment. Major cloud providers use node groups to configure Node CIRDs differently. -- Zones stay seperate failure domains (e.g. a failure in the EVPN control-plane of one zone should not affect the other to avoid EVPN fate-sharing) +- Zones stay separate failure domains (e.g. a failure in the EVPN control-plane of one zone should not affect the other to avoid EVPN fate-sharing) ## Criteria - Number of hops: for communication btw. worker nodes, to the internet and to the storage. From 19a82f9e7a4fa4b1bcdbc7a689109277bc887492 Mon Sep 17 00:00:00 2001 From: Simon Mayer <49491825+simcod@users.noreply.github.com> Date: Wed, 10 Dec 2025 11:04:35 +0100 Subject: [PATCH 4/8] Update docs/contributing/01-Proposals/MEP19/README.md Co-authored-by: Gerrit --- docs/contributing/01-Proposals/MEP19/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/contributing/01-Proposals/MEP19/README.md b/docs/contributing/01-Proposals/MEP19/README.md index f2b6118..bbe7a59 100644 --- a/docs/contributing/01-Proposals/MEP19/README.md +++ b/docs/contributing/01-Proposals/MEP19/README.md @@ -8,7 +8,9 @@ sidebar_position: 19 In metal-stack, the concepts of regions and zones are currently represented implicitly through partition names rather than as dedicated API entities. This design uses naming conventions to encode both region and zone information within a partition identifier. For example, the partition name `fra_eqx_01` translates to Frankfurt (region), Equinix (zone), and 01 (partition). -From a networking perspective, `supernetworks` can be scoped to a partition, and traffic is not routed between partitions — except for external networks such as the Internet or MPLS connections. Currently, all networks are configured with disjunct IP prefixes. With the introduction of [MEP-4](../MEP4/README.md), this behavior will change: Network prefixes may overlap across partitions but must remain disjunct within a single project. +From a networking perspective, traffic between private node networks is not routed between partitions. To prevent misconfiguration, private networks are derived from partition-scoped `supernetworks`, preventing private node networks to be used across different partitions. Only external networks such as the Internet or MPLS connections can be used to route traffic between partitions. + +Additionally, all networks have disjunct IP prefixes. With the introduction of [MEP-4](../MEP4/README.md), this behavior will change: Network prefixes may overlap across partitions but must remain disjunct within a single project. This is possible since go-ipam release `v1.12.0`, which introduced the concept of network namespaces. ## Motivation From 008f2bbadb52e8f623aa8fddce5c4bad3c2ab94e Mon Sep 17 00:00:00 2001 From: Simon Mayer <49491825+simcod@users.noreply.github.com> Date: Wed, 10 Dec 2025 11:08:00 +0100 Subject: [PATCH 5/8] Update docs/contributing/01-Proposals/MEP19/README.md Co-authored-by: Gerrit --- docs/contributing/01-Proposals/MEP19/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/contributing/01-Proposals/MEP19/README.md b/docs/contributing/01-Proposals/MEP19/README.md index bbe7a59..445d92a 100644 --- a/docs/contributing/01-Proposals/MEP19/README.md +++ b/docs/contributing/01-Proposals/MEP19/README.md @@ -16,7 +16,7 @@ Additionally, all networks have disjunct IP prefixes. With the introduction of [ With [MEP-12](../MEP12/README.md) the rack spreading feature has been introduced. Limitations of this feature are: It can not be explicitly decided, in which racks nodes are placed. Moreover, this is performed with a best-effort strategy. If no machine is available in one rack, it might get placed in the one where already a machine is present. -Already with current metal-stack installations, it is possible to spread partitions across data centers. However, this is still one failure domain, e.g. a single BGP failure could bring down the whole partition. As known from major cloud providers, zonal distribution of workload enhances availability and fault tolerance. +Another issue with this approach is that the single partition is still one failure domain, e.g. a single BGP failure could bring down the whole partition. As known from major cloud providers, zonal distribution of workload enhances availability and fault tolerance. ## Requirements to Achieve this Goal From d7c2f8342b752121b06ba02294c4709fec7d2e85 Mon Sep 17 00:00:00 2001 From: Simon Mayer <49491825+simcod@users.noreply.github.com> Date: Wed, 10 Dec 2025 11:09:08 +0100 Subject: [PATCH 6/8] Update docs/contributing/01-Proposals/MEP19/README.md Co-authored-by: Gerrit --- docs/contributing/01-Proposals/MEP19/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/contributing/01-Proposals/MEP19/README.md b/docs/contributing/01-Proposals/MEP19/README.md index 445d92a..17a34d7 100644 --- a/docs/contributing/01-Proposals/MEP19/README.md +++ b/docs/contributing/01-Proposals/MEP19/README.md @@ -29,7 +29,7 @@ To support explicit region and zone concepts in metal-stack, several functional ## Criteria - Number of hops: for communication btw. worker nodes, to the internet and to the storage. -Storage resources must either be strictly located in a single partition or replicated across all partitions. This can be enforced using [`allowedTopologies`](https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies) within a StorageClass. +Storage resources must either be strictly located in a single partition or replicated across all partitions. This can be enforced using [`allowedTopologies`](https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies) within a `StorageClass`. An open design question remains regarding Pod and Service CIDRs. Should overlay networks be avoided and purely relied on routed IPv6? Or should an overlay network be introduced across partitions? Further evaluation is needed to determine the optimal approach. From 4edb049cc6d7efdbafc4fc595fe24a0196a1399d Mon Sep 17 00:00:00 2001 From: Simon Mayer <49491825+simcod@users.noreply.github.com> Date: Wed, 10 Dec 2025 11:10:25 +0100 Subject: [PATCH 7/8] Update docs/contributing/01-Proposals/MEP19/README.md Co-authored-by: Gerrit --- docs/contributing/01-Proposals/MEP19/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/contributing/01-Proposals/MEP19/README.md b/docs/contributing/01-Proposals/MEP19/README.md index 17a34d7..687c86f 100644 --- a/docs/contributing/01-Proposals/MEP19/README.md +++ b/docs/contributing/01-Proposals/MEP19/README.md @@ -31,7 +31,7 @@ To support explicit region and zone concepts in metal-stack, several functional Storage resources must either be strictly located in a single partition or replicated across all partitions. This can be enforced using [`allowedTopologies`](https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies) within a `StorageClass`. -An open design question remains regarding Pod and Service CIDRs. Should overlay networks be avoided and purely relied on routed IPv6? Or should an overlay network be introduced across partitions? Further evaluation is needed to determine the optimal approach. +An open design question remains regarding Pod and Service CIDRs, which we usually configure for native routing (using FRR peering with CNI and with MetalLB for service exposal). In case of zonal routing, this would imply that traffic inside the FRR peering range also needs to be routable across zonal partitions. Should overlay networks be allowed or is it possible to depend on IPv6 in order to solve this issue? Further evaluation is needed to determine the optimal approach. ## Proposals From 3750c0c922d4fc32cc9422fb310e34d377d746db Mon Sep 17 00:00:00 2001 From: Simon Mayer Date: Wed, 10 Dec 2025 11:22:33 +0100 Subject: [PATCH 8/8] Add review comments --- docs/contributing/01-Proposals/MEP19/README.md | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/docs/contributing/01-Proposals/MEP19/README.md b/docs/contributing/01-Proposals/MEP19/README.md index 687c86f..02f4088 100644 --- a/docs/contributing/01-Proposals/MEP19/README.md +++ b/docs/contributing/01-Proposals/MEP19/README.md @@ -4,29 +4,33 @@ title: MEP-19 sidebar_position: 19 --- -# Zone Awareness in metal-stack.io +# Zone Awareness In metal-stack, the concepts of regions and zones are currently represented implicitly through partition names rather than as dedicated API entities. This design uses naming conventions to encode both region and zone information within a partition identifier. For example, the partition name `fra_eqx_01` translates to Frankfurt (region), Equinix (zone), and 01 (partition). -From a networking perspective, traffic between private node networks is not routed between partitions. To prevent misconfiguration, private networks are derived from partition-scoped `supernetworks`, preventing private node networks to be used across different partitions. Only external networks such as the Internet or MPLS connections can be used to route traffic between partitions. +From a networking perspective, traffic between private node networks is not routed between partitions. To prevent misconfiguration, private networks are derived from partition-scoped `supernetworks`, preventing private node networks to be used across different partitions. Only external networks such as the Internet or Datacenter Interconnect (DCI) connections can be used to route traffic between partitions. Additionally, all networks have disjunct IP prefixes. With the introduction of [MEP-4](../MEP4/README.md), this behavior will change: Network prefixes may overlap across partitions but must remain disjunct within a single project. This is possible since go-ipam release `v1.12.0`, which introduced the concept of network namespaces. ## Motivation -With [MEP-12](../MEP12/README.md) the rack spreading feature has been introduced. Limitations of this feature are: It can not be explicitly decided, in which racks nodes are placed. Moreover, this is performed with a best-effort strategy. If no machine is available in one rack, it might get placed in the one where already a machine is present. +Already, with current metal-stack installations, it is possible to spread a single partition across data centers. This can be achieved through the rack spreading feature (introduced by [MEP-12](../MEP12/README.md)). + +Limitations of this feature are: It can not be explicitly decided, in which racks nodes are placed. Moreover, this is performed with a best-effort strategy. If no machine is available in one rack, it might get placed in the one where already a machine is present. Another issue with this approach is that the single partition is still one failure domain, e.g. a single BGP failure could bring down the whole partition. As known from major cloud providers, zonal distribution of workload enhances availability and fault tolerance. ## Requirements to Achieve this Goal To support explicit region and zone concepts in metal-stack, several functional and architectural requirements must be met. The following considerations focus primarily on the Kubernetes integration and cluster topology aspects: + - Proper spreading of worker nodes and control plane components across [multiple zones](https://kubernetes.io/docs/setup/best-practices/multiple-zones/) and regions must be possible. - Nodes that belong to the same Kubernetes cluster must have the capability to communicate directly with each other, even if they are located in different partitions, provided that network configurations allow this communication using their respective Node CIDRs. - It must be possible for nodes within a single Kubernetes cluster to use different Node CIDR ranges, depending on their partition or zone assignment. Major cloud providers use node groups to configure Node CIRDs differently. - Zones stay separate failure domains (e.g. a failure in the EVPN control-plane of one zone should not affect the other to avoid EVPN fate-sharing) ## Criteria + - Number of hops: for communication btw. worker nodes, to the internet and to the storage. Storage resources must either be strictly located in a single partition or replicated across all partitions. This can be enforced using [`allowedTopologies`](https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies) within a `StorageClass`. @@ -68,16 +72,18 @@ In the current architecture as illustrated above, a node accesses storage throug One possible improvement would be to remove the dependency on the firewall for storage access. This could be achieved by configuring a route map on the leaf switch to establish a direct mapping between the tenant VRF and the storage VRF on a per-project basis. **Proposal 3: Project-Wide Route-Leaking and Open DCI** + This is a mixture of proposal 1 and 2 with disjunct VNIs across partitions. In this approach, each partition uses a distinct set of VNIs. The `metal-core`, running on the leaf switches, would be required to build and manage route leaks: + - from certain private networks (e.g. all project networks, storage network) to the local VRF (only locally held at the leaf switches) - from the local VRF to a DCI VRF (only propagated zone-wide) -The open DCI is a ring of exit switches speaking plain BGP (no EVPN routes, no VXLAN) for exchanging the private supernetworks of zones (note: prefix length is longer). -They operate as VTEP for the DCI VRF and is not dependent on the Multi-Site DCI feature of Enterprise SONiC. +The open DCI is a ring of exit switches speaking plain BGP (no EVPN routes, no VXLAN) for exchanging the private supernetworks of zones (note: prefix length is longer). They operate as VTEP for the DCI VRF and is not dependent on the Multi-Site DCI feature of Enterprise SONiC. Notes: + - cross-zone traffic is very efficiently transported, as the firewall is not in the path (fewer hops) - this can also be used to provide worker nodes with an more efficient way to access storage systems (also not going through the firewall)