In this blog we’ll discuss the NSX-V routing design options for Active / Active multi-site with single vCenter. This is “Part 1” that covers the background, pre-requisites and routing design option 1. For the option 2 please visit here.
There are various routing designs one can setup for two sites which has two individual vCenters and associated NSX Managers on respective sites. What it provides is a flexibility of having Cross-VC NSX Implementation by utilizing universal distributed logical router and the freedom of vMotioning or recovering a VM from one site to the other.
What if you have only single vCenter that has a consolidated cluster, containing half the number of hosts residing in Site A and the other half on Site B, the underline storage is stretched between two sites by means of vSAN or any other storage virtualization product. The design options become very limited and a challenge when you are also looking for an Active / Active datacenter solution for the Metro/ Stretched Clusters i.e. when a VM is running on Site A should utilize the ESGs on Site A for North / South connectivity and when moved (vMotioned) or recovered to Site B, it uses the Site B ESGs for North / South connectivity.
This was a requirement of one my customers and a challenge for me to provide them a solution as the underline virtualization design limited the NSX-V routing options. I thought it would be good to write a blog about it to help the wider community.
As always, please check the requirements of NSX-V, for the version you are using or planning to deploy and the interoperability with other VMware products versions. I would also encourage you to visit VMware Product Interoperability Matrices, VMware does a great job keeping it updated.
This blog assumes that all pre-requisites are met, and you have some experience with NSX-V and why and where its components are used.
- One vCenter and one vSphere cluster with multiple ESXi Hosts from both sites
- One NSX-V Manager
- Network latency of 150 ms or less between sites
- If you are running vSAN on the same vCenter/cluster, then the network latency would have already been met.
- Appropriate NSX-V Licensing to utilize the Universal Objects.
Please visit this VMwareKB for the NSX-V licensing – I cannot appreciate this KB enough and it is by far the very best, clear and detailed documented matrix by feature.
I would also encourage you to rate the above KB article to appreciate the efforts of the individual who wrote it.
Let’s get started…
Option 1 – Summary of configuration:
- Universal DLR with Local Egress enabled and no Control VM deployed
- Static routing between ESGs and UDLR
- 2 x ESGs in ECMP enabled per site
- 2 x Uplinks from ESGs to ToR switches
- vSphere DRS configured with “must run” rules for only the ESG VMs:
- 2 x ESG VMs for Site A “must run” on Site A ESXi Hosts
- 2 x ESG VMs for Site B “must run” on Site B ESXi Hosts
Yes, you can utilize the universal objects even when using one NSX-V Manager instance and there are no requirements of having another vCenter and NSX-V Manager. All it would require is appropriate NSX-V license and changing the NSX-V Manager role to “Primary” after deployment, as the default role is “Standalone”.
I have tried to summarize the configuration of this option and have been as clear and detailed I can in the image above for it to explain itself – but if you have any queries please feel free to leave a comment and I will try to answer, as best I can
The challenges in the above routing design option:
- If you look at the image above, ECMP is enabled on all 4 ESGs and the UDLR, that would mean that the traffic could go out and come back in from any of the 4 ESGs even though “Local Egress” is enabled. For more information on how “Local Egress” works please see the VMware document here. The question arises “how would you dictate the N/S routing for all VMs running on a specific site to use the site local ESGs?” Here is how:
- Change the “Local IDs” of the ESXi hosts in Site B
- Binding site-specific “Local IDs” of the ESXi hosts with the static routes on the UDLR to go out via the specific ESGs. i.e. 2 x static routes tied to the “Locale IDs” of the Site A ESXi Hosts to go out via the Site A ESGs and another 2 x static routes tied to the local IDs of the Site B ESXi Hosts to go out via Site B ESGs.
For more information on “how to” change the “Locale IDs” on Clusters or Hosts please visit VMware documentation here.
- We addressed the egress traffic in the above point but what about ingress? – this is a challenge that needs to be addressed by the physical network configurations as NSX-V cannot influence ingress traffic.
Now let’s talk about the “Site A Failure” scenario:
All workload VMs including the management VMs i.e. vCenter, NSX-V Manager, etc. would be recovered on Site B ESXi Hosts via “vSphere HA” (provided there are enough compute resources available), except the Site A “ESG VMs” which were configured with the DRS “must run” rule to run on the ESXi Hosts on Site A.
The workload VMs recovered via Sphere HA, will now use Site B ESGs for all N/S south communication as they are now running on Site B ESXi Hosts which have a different “Locale ID” tied to the static routes directing to go out via the Site B ESGs as next hop.
OK , we learnt the following advantages of the Option 1 so far:
- It helps decreasing licensing cost.
- Optimizes the N/S communication to keep traffic local to the site-specific top of the rack switches.
- Helps sustaining a single site failure.
Note: This is a disaster avoidance solution rather than a disaster recovery and using universal objects does bring some limitations.
There is one disadvantage in this design regarding the ESG VMs, it is not major but may or may not raise concerns with a few people – What happens when you have a single ESG Failure at one site?
Let’s talk about the single ESG failure scenario in the next blog here