Pain hurts, but then it is meant to. It signals that something is wrong or things haven’t been planned correctly, leading to consequences.
Gather a group of platform engineers, K8s operators, and team leaders together in a room (assuming they have the time, which they don’t), and you will find they have a host of Kubernetes and DevOps pain points they would love to solve.
Continuing our nautical theme again (because why not), operating and maintaining K8s can feel like you are drowning in complexity with Kubernetes pain points circling you waiting to take a bite. We’ve picked out three pain points we regularly encounter and offer a few suggestions and solutions.
1. Going Overboard: Cluster Sprawl
Many large organizations have taken great strides forward in adopting cloud-native and Kubernetes, but as they continue to scale and grow their Kubernetes ecosystem, they find they have created problems that need to be unraveled before their platform can mature. Cluster sprawl is a good example of this.
Organizations that adopt K8s for a handful of business units face a number of Kubernetes pain points when looking to expand and scale their use of the platform to include more business units.
As you might expect, most start on their journey with proof of concepts (PoCs), running clusters in dev, testing, and pilot phases. When teams are ready, workloads are moved into production clusters.
Drowning in the complexity of Kubernetes? Here are some ways to fix common K8s pain points.
There isn’t a limit on the number of Kubernetes clusters you can run, but just because you can do something doesn’t make it a good idea. If you are an organization that has reached a point where you want to reduce the number of clusters you manage and operate, it may be time to consider multi-tenancy.
If your platform team is struggling to keep up with the demands of provisioning, scaling, handling system configuration, and K8s patching, it is likely time to provision managed multi-tenant clusters to enable better scaling and growth. Using multi-tenancy will enable you to decommission individual clusters to reduce technical debt and the general effort involved in maintaining K8s.
When implemented correctly, it should save money on infrastructure costs and improve resource efficiency, as long as you have security considerations in hand, such as cluster isolation, and understand the performance issues, for example, ensuring that you establish resource quotas for each tenant.
The benefits will be significant. Your platform team gains increased visibility and governance while being laser-focused on platform management and lifecycle.
Meanwhile, your product teams and their developers can focus on business requirements and development concerns, and reduce the operational complexity they experience using the platform.
To handle multi-tenancy's added complexity and security considerations, you may also want to consider an enterprise vendor or managed-service Kubernetes, such as VMware Tanzu or Red Hat OpenShift, which offer improved security, stability, and automation.
2. Check Your Knots: Single Points of Failure
No one wants to see service disruptions, but simple misconfigurations, and a lack of central governance, with the defined policies and procedures that come with that, can easily lead to app outages.
In our experience, Kubernetes clusters on hosted services will be highly available and fault-tolerant, but they can’t fix user error; for example, if you don’t specify how many replicas are required in a YAML deployment manifest, Kubernetes will default to creating one replica of a Pod. Obvious? Maybe so, but it is surprisingly common to see K8s deployments run in this way.
This type of oversight stems from a lack of centralization when K8s is increasingly used by more and more product teams. The results of this and similar misconfigurations only come to light when an ill-advised change to an environment impacts an app’s availability or when a cloud provider experiences a regional outage.
If you eliminate all single points of failure through configs, it is possible to deploy new releases and make configuration changes with zero downtime. Some eye-opening results from Humanitec’s recent 2023 DevOps Benchmarking Study reflect this: 82% of the top-performing teams manage application configs in a standardized way and separate environment-specific from environment-agnostic configurations.
It goes without saying (although we still have to say it) that in the event of a cluster failure, a disaster recovery plan should be in place for restoring workloads.
3. Sailing Blind: Unusual Pod States
A common issue we see is pods running in an error, evicted, or unknown state. This can be caused by a host of issues. As mentioned previously, misconfiguration and operator errors are common causes of pain, but it could also be a lack of resources, hardware failure, or something external such as a network spike.
Don't tread water to the Kubernetes pain point; start logging and monitoring all K8s objects, nodes, and control components.
At this point, we should stress that if metrics are not being collected from all Kubernetes objects, nodes, and control components, then you haven’t planned correctly, and a pain point is about to bite. By all objects, we mean deployments, pods, services, etc. When monitoring K8s nodes, this needs to include CPU, disk, MemInfo, LoadAvg, Network, and more. In terms of control components, you need to consider, for instance, Kubelet, etcd, your scheduler, and DNS.
Any pod deployments in this state for an extended time must be reviewed and analyzed. You could do this manually using kubectl to investigate a pod, but, assuming you aren’t a character from Hellraiser who enjoys unnecessary pain; it’s generally much easier to have the ability to check logs and have access to a monitoring platform.
Your dashboards should clearly show the overall state of your K8s clusters and applications. Additionally, you should be capable of drilling down into the state of microservices and individual resource usage.
What you use for monitoring and logging depends on the outcome of your own proof of concept research, but open-source Prometheus is synonymous with monitoring and a well-integrated and free solution. Ultimately, you will want to be able to see clearly identified issues that assist in the root cause analysis and resolve incidents faster.
We know upgrades are a common Kubernetes pain point, but we mentioned that in our last blog on Kubernetes Challenges. If there’s a specific pain you want help with, you can either email us at email@example.com or reach out today, and let's start the conversation.CONNECT WITH US