Networking Considerations for Distributed Deployment of Libvirt/KVM-Based OCP Clusters
Back to all posts

Introduction

In previous blog posts, we have learned about the OCP on Libvirt project and the benefits it brings to us, with regards to the flexible deployment of OCP clusters where the nodes are virtual machines deployed on KVM.

If you remember, we started by commenting how to set up a virtual development environment for OCP agent, and then we continued by describing how to use DCI to easily install an OCP cluster based on Libvirt in a single baremetal server.

However, in both cases, we have only covered the case of one single physical server. Can we use multiple physical servers to set up a lab with multiple, distributed deployment on Libvirt-based OCP clusters? The answer is yes! And we will explain the main requirements and challenges in terms of networking in this blog post.

What would you learn after reading this blog post?

Why Libvirt-based OCP clusters?

The rationale of this idea is not a chance. It's just a natural, good practice that comes up when having multiple resources that can be splitted in order to maximize its usage.

So, imagine you have several powerful physical servers to build a single cluster to deploy OpenShift. You may think that having such a powerful environment to test OCP deployments and/or workloads would be great, but in the end, sometimes (mostly when dealing with testing and troubleshooting, not being in production), it's like using a sledgehammer to crack a nut! Then, you need to ask yourself: do I really need this powerful cluster to just test OCP? Am I really using properly and efficiently the resources I am taking?

If we split the cluster and dedicate each physical server to deploy a virtualized OCP cluster, using Libvirt-based virtual machines, we would eventually benefit from the advantages of Cloud Computing, maximizing the usage of the underlying physical resources and also allowing the parallelization of your development work, as you can dedicate different OCP clusters for each to-do task.

We can summarize this process with the following picture, where the Latin expression "divide et vinces" appears in its full splendor.

From single to multiple clusters

Of course, this is not an easy work, but it is not impossible. The case that we will analyze in this blog post, which is already in place in Telco Partner CI team’s testing environment, might inspire you to adopt this kind of environments.

In our particular case, recently we received 10 brand new, nice, and powerful servers, and we decided to integrate them as part of our CI jobs. Previously, we only had one dedicated server for this purpose, so we had a lack of resources sometimes when dealing with multiple changes being tested at the same time, as this server was a bottleneck.

However, the transition towards a multi-cluster environment, using Libvirt as support to deploy the OCP clusters based on virtual nodes on each server, it really allowed us to have parallel deployments, or dedicated server for troubleshooting and testing, among others.

To summarize, our work is based on the following:

And how can DCI help us here?

You have to consider that deploying an OpenShift cluster with a particular version, custom configuration, workloads on top, etc., is not as easy as it may seem at the beginning. And also, different users and partners may decide on using their own tools and processes to address this, as commented before, so the probability of having failures or issues and not finding proper documentation or troubleshooting guidelines to try to overcome them would increase in these cases.

Here's where DCI appears to try to make your life easier. Just highlighting some concepts from its own workflow, we have to clarify that the hardware to be used allows to use:

So, in the end, it's a platform prepared to do this kind of integrations on a CI basis, following a pipeline-and-component-based logic to install OCP and deploy workloads on top of it, regardless of the hardware used to create the cluster.

Requirements

Mainly, you would need to meet the following requirements, which are mainly estimations based on the work we have done in our own labs:

Network setup

Starting with a single server

If we check the scenarios deployed in the blog posts referenced above, we will see that we only have one server, which holds the OCP clusters and acts as a jumphost to access to the nodes.

Single server scenario

In that case, VMs are connected with an internal network for provisioning, and a NATted network is used as baremetal network, allowing the virtual machines to have connectivity towards Internet (in the case of a connected cluster like this). This is managed by a virtual router created by libvirt, which has its own dnsmasq process to provide DHCP and DNS configuration to the VMs connected to that network.

Also, you can see in the figure above that we can combine the presence of different OCP clusters in the same physical server; such as IPI and SNO, as long as we define different virtual networks to isolate them.

Moving to multiple servers

Then, what do we need to do to move to a distributed deployment by using multiple servers? Some of the challenges are:

The solution proposed is summarized in the next picture.

Multiple servers scenario

Essentially, we need to bear in mind the following distribution of servers:

Consequently, the key points of this solution are:

Deployment and automation

You know that, with DCI, it is feasible to automate this kind of deployments. For example, with dci-openshift-agent, you can do the following with each server:

However, this is true when acting over each phyisical server isolatedly. But what about achieving a real automation of the whole lab? Our recommended option is to apply the automation in the jumphost, so that you can rely on utilities packaged on DCI that we have already covered in several blog posts, such as prefixes, dci-queue or dci-pipeline.

How to deploy at a high level

Let's briefly review, from a high level perspective, how you can deploy this kind of environments, to make you understand the final picture you will have.

Of course, the first step from your side would be to install and configure the DCI-related tools in your system, among other requirements we already commented.

Then, adjust the variables to be used by dci-openshift-agent for your deployment. We recommend you to have different inventory files for each physical server used to deploy OCP, and also a different file for each deployment type (IPI, SNO...) you may want to use.

Let's check an example for IPI deployment. For this, you can take the default inventory file from OCP on Libvirt as base, but you need to take into account the following changes, as this example is related to a single-cluster environment with internal IP addresses:

With this in mind, you would be ready, from the jumphost, to:

Challenges faced and lessons learned

This solution may sound really great, but as we were commenting, it was not a trivial or simple integration. Mainly, when moving from single server to multiple servers, we had to deal with the following issues, already commented:

Also, just note that, depending on the OCP deployment mode, we need to check extra configurations to allow the scenario to be working fine. For example, Libvirt OCP clusters running ACM and deploy Baremetal SNO also need to use bridged networks for machineNetwork and provisioning network.

And finally, just remark that, in labs with multiple servers, you need to plan in advance your IP addressing assignment to really know the number of clusters that you will be able to use. You can have a lot of computing resources to create multiple virtual machines, but depending on the subnet size that you have available in your scenario, you may have limitations in the number of clusters that you can deploy.

Wrap up

Just to sum up, let's briefly review the proposed solution from different perspectives:

In general terms, the benefits from following this integration would be the following:

However, this solution is not without challenges to address, already reviewed in the blog post:

We hope this blog post is useful for you, in case of thinking about moving to this kind of solutions!

Note that this blog post is based on the following Red Hat Networking Summit presentation, which was also recorded (only for Red Hatters), in case you want to take a look.

If you need further support, don't hesitate to reach Telco CI team to discuss about your solution and to exchange ideas about the setup. We will be glad to hear from you and try to help you!