Is Multi Cluster / Multi Cloud Strategy Really a Nightmare?
Cloud computing is the ultimate reality for all of us. With the help of huge Public Cloud Providers like Amazon Web Services, Google Cloud and Microsoft Azure, our workloads, products and applications become available globally in minutes. They all have great features and services and they all have their benefits. Transforming all of our environments and applications to Cloud Native architectures is really becoming much easier with these vendors.
We all love and use Kubernetes because the platform really allows us to be flexible between environments in general. We can easily migrate our workloads between cloud providers or between different Kubernetes engines. I believe Kubernetes really changed “worked on my machine” to “works on any Kubernetes” and it’s really a game changer for all of us. When you have multiple Kubernetes clusters, you can distribute traffic between clusters very easily with native integration of Cloud Native tools and increase your availability, portability, performance and reliability.
Sometimes, organisations choose one Public Cloud Provider and run all their workloads there but most of the companies started going Multi Cloud. But what is Multi Cloud exactly? Multi Cloud is a company’s use of different services from different vendors in a single heterogeneous architecture to improve cloud infrastructure capabilities and cost. Basically, organisations use a service from one provider and use another service from another provider. This could also refer to distribution of services between different Public Cloud Providers to increase availability or for distributing the cost. The Multi Cloud approach also eliminates the dependency to a single cloud provider. When using a Multi Cloud architecture, you can have more flexibility and more performance.
With all of the benefits, the Multi Cluster / Multi Cloud approach comes with problems as well and these problems can be a nightmare for every organisation. This approach might not be the perfect solution for everyone.
But, what can be done?
Things to look out for in Multi Cloud
When we talk about multi cluster, multi cloud architectures, we need to consider multiple aspects and look from different perspectives. There are multiple and crucially important challenges with this approach. So, what are these challenges and problems?
- Operational Excellence / Complexity
- Configuration Challenges / Connectivity
- Organisational Challenges
- Cost
- Security Challenges
- Authentication and Authorisation a.k.a. IAM
- Observability
Even looking at this list can give some people goosebumps. But, let me share details and things to look out for each of them.
Operational Excellence / Complexity
Every added service or cluster to the stack adds more complexity to the whole architecture. When we add another Public Cloud Provider into the picture, the complexity increases even faster. Operational tasks can take up days of multiple engineers’ time. It’s not easy to implement workloads to a new cloud or adapt the whole environment to a new cloud with new and specific configurations of that new cloud. There are of course different and easier ways to manage cloud environments like using an Infrastructure as Code tool like Terraform and modularise the provisioning part. But it adds repetitive tasks for every cloud provider you add to the equation. This takes so much time and effort. You need to take different operational actions for every cloud provider you decide to use.
Of course you can implement GitOps methodology or have another automated way to provision all the required resources and infrastructure components. Designing this architecture as automated as possible would decrease human error and also the time to go online.
Configuration Challenges / Connectivity
There are general best practices for general cloud computing but every cloud provider has different approaches and different configuration options for these best practices. That means you need to apply different configurations or you need to have different approaches for every cloud provider in your environment. This can include setting up a network infrastructure or configuring authentication and authorisation. Having different configurations for each can lead up to serious mistakes.
Because of this, implementing a centralised solution for connectivity between clusters and clouds and also authentication and authorisation would be an ideal approach to overcome these configuration challenges.
Organisational Challenges
Managing every cloud provider requires different expertise and different skills. Organisations need multiple engineers, maybe multiple teams to handle this situation and this could lead to creating silos. Not every organisation has the capacity to manage this scenario. This situation reduces the time spent on innovation. Engineers start spending their time on daily operational, manual and repetitive tasks. People aspect of every organisation should also adapt Cloud Native transformation. Changing only the tech stack is not the ultimate answer for every problem. Training teams for every cloud provider you add to the stack or hiring new engineers for new cloud providers could be the main problem in the long term for organisations.
That’s why planning an onboarding process for a multi-cluster and multi cloud approach and taking one step at a time would be a more possible solution. I truly believe the mental health of internal team members and also the community within the organisation helps companies thrive.
Cost
Pricing on cloud computing relies mostly on a pay-as-you-go model. That means you only pay for what you use and every service has a different pricing model depending on capacity, how long you used the service, inbound and outbound access, etc. When you have a multi-cluster, multi cloud architecture, maybe you need to use the same services in every cloud provider to have redundancy and reduced configuration drift between cloud providers. That multiplies the cost automatically. Not just the increased cost, managing the finance for multiple cloud providers is a huge responsibility. Tiny mistakes can be the end of an organisation.
That’s why most organisations started using ephemeral clusters and ephemeral environments. These ephemeral clusters can be for a development environment or a test environment. Once the work is finished removing/destroying the environment will dramatically reduce the cost.
Also, having a well-defined autoscaling methodology will help reduce the cost and also increase the availability of the workloads between clusters and cloud providers. You can choose between multiple autoscaling tools and solutions for both workloads and underlying infrastructure resources. For example, you can implement HPA (Horizontal Pod Autoscaler), Keda, Cluster Autoscaler or autoscaling services from cloud providers.
Security Challenges
Creating a secure environment is really the ultimate goal. Every cloud provider features different services for different aspects of security. This includes network security, data security, compliance, audit, vulnerability management and supply chain security. Even configuring connections between cloud providers for every service required can easily become a nightmare. Multi cloud approach can easily increase the attack surface if not taken care of carefully.
That’s why centralising most parts of security is the ideal solution. Separating some layers from the underlying infrastructure could ease the management and also gives more control over the environments.
Authentication and Authorisation a.k.a. IAM
User management in a single environment is difficult enough. Least privilege access method is crucially important for managing every environment. Creating roles, policies and identity management rules within the organisation should be the first priority because implementing this afterwards can lead up to bigger problems. It gets more difficult when new cloud providers are added to the equation. Also, you need to have an audit mechanism for monitoring user actions to have control over your whole environment.
The shortest solution for this is restricting access whenever not needed.
Observability
For every metric, log and trace data, you need to have a centralised observability approach to be able to monitor the environment. Also, there should be a centralised alert management mechanism to get into the action when there is a problem in the system. There can be even a preventive solution or automated solution for some of the actions. But configuring this system for multiple cloud providers can be tricky. That’s why when choosing a solution for observability, you should carefully investigate integrations before implementation. If you decide to use the cloud providers’ own services you may not be able to collect all the required data in a central location. That’s why a centralised solution would be the best option.
Final Thoughts
When designing an architecture we all know there are many aspects of the architecture we need to keep in mind. That’s why the design phase for a project should be the most important phase. Because, dealing with problems after implementing a huge architecture couldn’t be possible or can be extremely difficult to overcome. Looking at the design from different perspectives and approaching the problems with pre-defined principles can really help every organisation.