8 minute read

I recently presented a Tech Talk to the engineering team members as Kudi. It was based on the work I did in the previous quarter to reduce our cloud cost and ensure optimal use of resources. In this blog post, I have extracted out bits of the presentation that I think would be beneficial to a wider audience. I hope you learn a thing or two from our experience. Enjoy!

Introduction

In the fast-paced world of tech startups, the mantra “move fast and break things” often reigns supreme. But what happens when moving fast leads to skyrocketing cloud costs and inefficient resource utilization? This was the challenge we faced at Kudi. We were aware of the possible optimizations but decided to focus on other pressing issues until the mounting cloud costs became impossible to ignore. Our journey of optimization taught us valuable lessons about the delicate balance between speed and efficiency in cloud infrastructure management.

The Challenge: Runaway Cloud Costs

Like many startups, Kudi’s initial focus was on rapid development and deployment. In the very early days, we started out on Heroku, then moved to a self managed kubernetes cluster on AWS (this was before the days of EKS). We eventually settled on Google Cloud Platform (GCP) for our infrastructure needs, leveraging a combination of Compute Engine instances and Kubernetes clusters to power our services. On the data side, we make use of Cloud SQL and Big Query heavily.

As Kudi’s user base and product offerings exploded, so did our cloud infrastructure. We were deploying new features and services at a fast pace. Our cloud cost was steadily growing but remained under the budget. Then eventually came the monthly cloud bill that stopped us in our tracks. A quick look at the cost breakdown revealed the primary culprits (in order) as:

  1. Compute
  2. Cloud SQL
  3. Big Query
  4. Stackdriver Logging
  5. Cloud Memorystore for Redis

Focusing on the biggest expense, Compute, further analysis revealed two problems.

  1. Compute Engine Overprovisioning: Our VM instances, both Standalone Servers and Kubernetes Nodes, were dramatically underutilized/overprovisioned.
  2. Inefficient Kubernetes Resource Allocation: Our container orchestration was powerful but wasteful, with many resources sitting idle during off-peak hours.

Facing this crisis, we set an ambitious goal: achieve a minimum 60% CPU utilization during peak hours (7 AM to 7 PM) across our infrastructure, without compromising performance or reliability.

Unmasking the Root Cause for Overprovisioning: The Template Trap

In our rush to deploy quickly, we’d fallen into a common DevOps pitfall: the overuse of templated deployments. Every new service, regardless of its actual resource needs, was being provisioned using the same resource requirements in the template. It was a classic case of “one size fits none.” To understand how we got here, we need to take a step back to our migration from Heroku to Kubernetes. In a bid to ensure that developers have the same flexibility to easily spin up and deploy new services as on Heroku, the Operations (DevOps) team developed a custom kubernetes resource.

   apiVersion: "stable.kudi.ai/v1"
   kind: "HTTPService"
   metadata:
       name: __SERVICE_NAME__
   spec:
       image: __IMAGE__
       replicas: 1
       containerPort: 9000
       domain: __DOMAIN__
       requests:
          memory: 300M
          cpu: 200m
       privacy: private

The Kudi HttpService resource and the complementary Kubernetes Operator offer a layer of abstraction on top of k8s resources required to deploy a microservice. They allow a developer to describe a backend (http) service using just a few lines of YAML. The k8s operator would in turn pick up the HttpService definition after it is applied and create all underlying k8s resources (Deployment, Service, HPA, ServiceMonitor, NetworkPolicy etc) required to make it a full fledged deployment.

The HTTPService definition shown above is extracted from the Template repository which is used to bootstrap all new microservice at the start of the project. The templating approach, while speeding up initial deployment, led to a false sense of our actual resource needs. The pre-filled resource requirements in the template were only there to serve as a guide. Unfortunately, at the point of deployment, for most services, they were not reviewed and adjusted based on the actual needs of the service. This led to massive resource waste due to overprovisioning.

Inefficient Kubernetes Resource Allocation: The Scheduling Problem

Here is an extract from the k8s descheduler project page, describing the problem with how kubernetes handles scheduling, how it can lead to underutilization of resources and how the descheduler resolves this problem.

Scheduling in Kubernetes is the process of binding pending pods to nodes, and is performed by a component of Kubernetes called kube-scheduler. The scheduler’s decisions on whether or where a pod can or can not be scheduled are influenced by its view of a Kubernetes cluster at that point of time when a new pod appears for scheduling. As Kubernetes clusters are very dynamic and their state changes over time, there may be desire to move already running pods to some other nodes for various reasons:

  • Some nodes are under or over utilized.
  • The original scheduling decision does not hold true any more …
  • Some nodes failed and their pods moved to other nodes.
  • New nodes are added to clusters.

Consequently, there might be several pods scheduled on less desired nodes in a cluster. Descheduler, based on its policy, finds pods that can be moved and evicts them.

You can visit the project page here to learn more.

The Optimization Playbook: Improving our Kubernetes Utilization

To tackle our resource utilization problem, specific to kubernetes, we implemented two key Kubernetes solutions:

  1. Kubernetes Descheduler: This tool became our resource redistribution powerhouse. Here’s a peek at our configuration:

    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      "LowNodeUtilization":
        enabled: true
        params:
          nodeResourceUtilizationThresholds:
            thresholds:
              "cpu" : 30
            targetThresholds:
              "cpu" : 65
    

    This setup tells the descheduler to keep nodes balanced, aiming for at least 65% utilization of CPU. It continuously works to optimize pod placement, ensuring efficient resource use across our cluster.

  2. Vertical Pod Autoscaling (VPA): VPA became our automated resource right-sizer. Here’s a basic configuration we used:

    apiVersion: autoscaling.k8s.io/v1
    kind: VerticalPodAutoscaler
    metadata:
      name: my-app-vpa
    spec:
      targetRef:
        apiVersion: "apps/v1"
        kind: Deployment
        name: my-app
      updatePolicy:
        updateMode: "Auto"
    

    This configuration automatically adjusts the resources for the my-app deployment based on observed usage patterns. This resolves the issue with the hardcoded resource requests in the template deployment file.

The Results: Efficiency Meets Reliability

The impact of these changes was both immediate and significant:

  • Resource Utilization: Most of our clusters now run at 50-60% CPU utilization during peak hours, up from an average of 30%.
  • Cost Reduction: Our cloud bill decreased by 35% month-over-month.
  • Scalability: Our optimized infrastructure could now handle traffic spikes more efficiently, improving our ability to scale during high-demand periods. The VPA tackles overutilization as well, and not just underutilization.

Beyond Compute: Optimizing the Entire Stack

While our Compute Engine and Kubernetes optimizations gave us the biggest wins, we didn’t stop there. We applied the same optimization mindset to other critical services:

  1. BigQuery: Smarter Queries, Lower Costs

    The Data team implemented table partitioning and clustering, reducing our query costs significantly. Here’s an example of how tables can be structured to take advantage of partitioning:

    CREATE TABLE mydataset.transactions
    (
      timestamp TIMESTAMP,
      user_id STRING,
      transaction_amount FLOAT64,
      transaction_type STRING
    )
    PARTITION BY DATE(timestamp)
    CLUSTER BY user_id, transaction_type
    

    This structure allows BigQuery to scan less data for time-based and user-specific queries, significantly reducing costs and improving query performance.

  2. Cloud Logging: From Firehose to Focused Stream

    We tightened up our logging practices:

    • Eliminated debug-level logs in production, reducing log volume by 60%.
    • Implemented log exclusion rules for high-volume, low-value logs.
    # Example log exclusion rule
    exclusions:
    - name: exclude-health-checks
      filter: resource.type="gce_instance" AND log_name="projects/my-project/logs/heartbeat"
    

    This approach not only reduced costs but also improved the signal-to-noise ratio in our logs, making troubleshooting more efficient.

  3. Cloud MemoryStore (Redis): Caching Smarter, Not Harder

    We optimized our caching layer:

    • The minimum size for a Redis Instance that can be created on Cloud Memorystore is 1GB.
    • This led to gross underutilization as most of applications (microservices) did not even use up to 10% of that.
    • We moved from a Redis Instance per application to a cluster setup.
    • We grouped applications together based on the eviction policies applicable to their use case.
    • A group of application was then assigned to a Redis Instance with a fixed eviction policy eg. fin1-lru.
    • Through these optimizations, we reduced our Redis instances from 48 to 28, a 42% reduction.

Key Takeaways: The New DevOps Mantra

Our optimization journey taught us several crucial lessons that have become our new DevOps mantras:

  1. “Move fast, but keep an eye on the speedometer”: Rapid deployment is crucial, but so is regular performance monitoring and optimization.

  2. “Tailoring beats templating”: While templates can speed up initial deployment, custom-fit resources for each project ultimately lead to better efficiency and cost-effectiveness.

  3. “Leverage built-in intelligence”: Many cloud platforms offer sophisticated optimization tools. Use them to your advantage before considering custom solutions.

  4. “Optimization is a continuous process”: Cloud infrastructure isn’t “set it and forget it.” Regular reviews and adjustments are key to maintaining efficiency as your system evolves.

  5. “Small optimizations compound”: Don’t underestimate the power of minor tweaks. When applied across a large infrastructure, they can lead to significant savings.

Conclusion: Balancing Speed and Efficiency in the Cloud

Our experience at Kudi has shown that in the world of cloud infrastructure, success isn’t just about how quickly you can deploy—it’s about how efficiently you can operate at scale. The “move fast and break things” ethos of early-stage startups needs to evolve into “move smartly and optimize things” as organizations grow and scale.

As you manage your own cloud resources, remember that every optimization, no matter how small, can have a significant impact when applied across a large infrastructure. Don’t be afraid to slow down and take stock. Analyze your usage, leverage the tools at your disposal, and optimize relentlessly.

The cloud offers incredible flexibility and power, but it’s up to us as DevOps / Cloud professionals to use it wisely. By finding the right balance between speed and efficiency, we can build systems that are not only robust and scalable but also cost-effective.

Your future self, your users, and yes, even your CFO, will thank you for it.