How to Escape the 3 AM Page as a Kubernetes Site Reliability Engineer
2024-4-3 03:36:12 Author:查看原文) 阅读量:7 收藏

It’s Saturday night. You’re out to dinner with friends. Suddenly, a familiar tune emits from your pocket. Dread fills you as you fish your phone out of your pocket and unlock it. You tap the alert. Maybe it’s a lucky night and this is one alert you can just snooze or resolve. Maybe it’s a bad night, and the next step is you pulling your laptop from your bag — because you bring your laptop everywhere when you’re on-call — and trying to troubleshoot a problem in a crowded, noisy restaurant.

Or maybe it’s just a Tuesday night and you’re fast asleep when that alert interrupts your rest. You have to pull yourself out of bed and grab your laptop, eyes barely open, brain still covered in sleep dust. Your brain has to pivot from that awesome dream you were having to thinking about your Kubernetes infrastructure and troubleshooting steps.

Being on-call is a drag. It requires planning your life (and by extension your family’s lives) around your on-call schedule. You can’t go for a weekend in the woods if you don’t have adequate wireless service. A Saturday trip to the amusement park is (a lot) less fun when you’re dragging your laptop around. Everything you want to do requires you to consider “Will this be a problem if I get paged?”

Managed Kubernetes Services

Fairwinds’ Managed Kubernetes enables you to leave your backpack at home and enjoy your free time without worrying about your infrastructure — because we keep an eye on it for you. We use our internal tooling to configure monitors according to observability best practices. We track metrics that provide insight into the performance of your Kubernetes infrastructure, and use those metrics to alert when things are not going as expected. Some of the more common issues our monitoring captures include:

  • Pods not scheduling
  • Pod scheduling but not staying up, i.e. “crashlooping”
  • Kubelets in an unhealthy state
  • Nodes not coming into service
  • Daemonsets and deployments missing the specified number of replicas
  • Certs expiring (managed by cert-manager)

Shift Left with Insights

While Fairwinds manages your infrastructure and gets all pages related to infrastructure issues, your organization creates and deploys apps and services on that infra — so you’ll need to respond to pages that are not infrastructure related. We still want you to get your sleep though! Our people-led Managed Services offering also includes a subscription to our Fairwinds Insights platform. Insights integrates into your CI/CD pipelines, allowing you to create checks for common misconfigurations that can result in unwanted pages on the application side.You can configure checks that ensure workloads aren’t deployed with missing resource requests and limits, which can cause instability in the underlying infrastructure and lead to resource contention.

Perhaps you’re familiar with seeing pods fail to come up because it can’t pull an image, only to find out there was a typo in the image specification. Validating the image in your manifest to prevent workloads being deployed with typos is another check that you can enable using Insights that will help you avoid unnecessary product-related pages.

Managed Kubernetes + Fairwinds Insights

The combination of our Managed Services and the Insights platform helps lessen on-call fatigue. Your engineers can have a night out, or just sleep soundly, knowing that our Managed Services teams are monitoring your infrastructure and responding to alerts as needed. And for those application-level alerts, Insights sets you up for success by enabling you to catch misconfigurations that can compromise infrastructure stability before they get to the cluster and wake you up in the middle of the night or interrupt your weekend.

And that song or funny effect that you thought would be cute to use as your pager alert but has since become a Pavlovian trauma sound? You can love it again.

Explore Managed Kubernetes

*** This is a Security Bloggers Network syndicated blog from Fairwinds | Blog authored by Stevie Caldwell. Read the original post at: