Creating Scalable Applications with Kubernetes

1.37K 0 0 0 0

📒 Chapter 4: Monitoring, Logging & Observability at Scale in Kubernetes

🌐 Introduction

Modern Kubernetes applications are complex, distributed, and dynamic. At scale, traditional monitoring tools struggle to capture the real-time health, performance metrics, and debugging data you need.

That’s where observability comes in — a deeper, structured approach to understanding what’s happening inside your applications and infrastructure.

In this chapter, we’ll explore:

  • Core principles of observability in Kubernetes
  • Monitoring tools like Prometheus, Grafana, kube-state-metrics
  • Centralized logging with EFK/Loki stacks
  • Tracing with Jaeger and OpenTelemetry
  • Alerting, dashboarding, and scaling observability for production

🔍 Section 1: Observability vs. Monitoring

Concept

Monitoring

Observability

Focus

Predefined metrics and alerts

Debugging unknowns through data correlation

Data Types

Metrics

Metrics, logs, traces

Use Case

Is it working?

Why is it not working?

Examples

CPU usage, memory usage

Distributed tracing, request latency, anomalies

Kubernetes observability means going beyond metrics — it means integrating logs, events, traces, and alerts to achieve full operational visibility.


📊 Section 2: Monitoring with Prometheus & kube-state-metrics

What is Prometheus?

Prometheus is a pull-based metrics collection system. It scrapes HTTP endpoints that expose metrics in a specific format.

Component

Purpose

Prometheus Server

Collects and stores metrics

Exporters

Translate app metrics (e.g., node, cAdvisor)

Alertmanager

Sends alerts via email/SMS/Slack

kube-state-metrics

Exposes resource-level metrics


🛠️ Prometheus Operator (recommended method)

Install with:

bash

 

kubectl apply -f https://github.com/prometheus-operator/prometheus-operator/blob/main/bundle.yaml

Or use Helm:

bash

 

helm install prometheus prometheus-community/kube-prometheus-stack

This installs:

  • Prometheus server
  • Node exporter
  • kube-state-metrics
  • Grafana (optional)
  • Alertmanager

🔧 Sample Prometheus scrape config (for a custom app)

yaml

 

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

  name: my-app-monitor

spec:

  selector:

    matchLabels:

      app: my-app

  endpoints:

    - port: web

      interval: 30s


📊 Common Metrics Tracked

Metric

Meaning

container_cpu_usage_seconds_total

Pod CPU consumption

container_memory_usage_bytes

Memory usage by container

kube_pod_status_phase

Pod status (running/pending/etc.)

http_requests_total

Total HTTP requests (custom apps)


📈 Section 3: Grafana Dashboards

Grafana is the visualization layer in your monitoring stack.

Built-in dashboards for:

  • Cluster health
  • Node CPU/memory
  • Pod health and restart counts
  • Network/ingress traffic

Access Grafana:

bash

 

kubectl port-forward svc/prometheus-grafana 3000:80

Default credentials: admin / prom-operator


📝 Section 4: Logging with EFK Stack or Loki

🔹 EFK: Elasticsearch, Fluent Bit/Fluentd, Kibana

Component

Description

Fluentd/Fluent Bit

Aggregates and forwards logs

Elasticsearch

Stores logs (searchable, indexable)

Kibana

UI to query and visualize logs

🔹 Loki: Lightweight Alternative to Elasticsearch

  • Developed by Grafana Labs
  • Integrates natively with Grafana
  • Uses labels for log indexing (not full-text search)

🛠️ Fluent Bit DaemonSet Config (example)

yaml

 

apiVersion: v1

kind: ConfigMap

metadata:

  name: fluent-bit-config

data:

  fluent-bit.conf: |

    [SERVICE]

        Flush        5

        Daemon       Off

 

    [INPUT]

        Name         tail

        Path         /var/log/containers/*.log

        Tag          kube.*

        Parser       docker

 

    [OUTPUT]

        Name         stdout

        Match        *

Apply:

bash

 

kubectl apply -f fluent-bit-config.yaml


🧵 Section 5: Tracing with Jaeger & OpenTelemetry

Traces allow you to follow a single request across multiple services and pods.

Tools:

  • Jaeger (open source distributed tracing system)
  • OpenTelemetry SDK (instrument your code)

🛠️ Sample Jaeger Deployment (All-in-One)

bash

 

kubectl create namespace observability

kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.44.0/jaeger-operator.yaml

Instrument code (Python example):

python

 

from opentelemetry import trace

from opentelemetry.exporter.jaeger.thrift import JaegerExporter


🚨 Section 6: Alerting & Notifications

Prometheus + Alertmanager supports routing alerts to:

  • Email
  • Slack
  • PagerDuty
  • Opsgenie
  • Webhooks

🔔 Sample Alert Rule (CPU Usage)

yaml

 

groups:

  - name: cpu-alerts

    rules:

      - alert: HighPodCPU

        expr: rate(container_cpu_usage_seconds_total[1m]) > 0.9

        for: 2m

        labels:

          severity: warning

        annotations:

          summary: "Pod CPU usage high"


🧠 Section 7: Observability Best Practices

Practice

Why It Matters

Use labels and annotations

Tag logs and metrics for better filtering

Correlate logs, metrics, and traces

Enables fast root cause analysis

Set retention limits

Controls cost and resource use

Dashboards per service or team

Improves focus and ownership

Use SLOs and error budgets

Drive alerting decisions based on user impact


Summary

Scalable Kubernetes systems demand more than uptime monitoring. You need rich telemetry, actionable alerts, and real-time visibility.

Key takeaways:


  • Use Prometheus + Grafana for full-stack metrics
  • Centralize logs with Fluent Bit and Elasticsearch or Loki
  • Trace requests with Jaeger or OpenTelemetry
  • Route alerts through Alertmanager
  • Build observability into your CI/CD pipeline

Back

FAQs


❓1. What makes Kubernetes ideal for building scalable applications?

Answer:
Kubernetes automates deployment, scaling, and management of containerized applications. It offers built-in features like horizontal pod autoscaling, load balancing, and self-healing, allowing applications to handle traffic spikes and system failures efficiently.

❓2. What is the difference between horizontal and vertical scaling in Kubernetes?

Answer:

  • Horizontal scaling increases or decreases the number of pod replicas.
  • Vertical scaling adjusts the resources (CPU, memory) allocated to a pod.
    Kubernetes primarily supports horizontal scaling through the Horizontal Pod Autoscaler (HPA).

❓3. How does the Horizontal Pod Autoscaler (HPA) work?

Answer:
HPA monitors metrics like CPU or memory usage and automatically adjusts the number of pods in a deployment to meet demand. It uses the Kubernetes Metrics Server or custom metrics APIs.

❓4. Can Kubernetes scale the number of nodes in a cluster?

Answer:
Yes. The Cluster Autoscaler automatically adjusts the number of nodes in a cluster based on resource needs, ensuring pods always have enough room to run.

❓5. What’s the role of Ingress in scalable applications?

Answer:
Ingress manages external access to services within the cluster. It provides SSL termination, routing rules, and load balancing, enabling scalable and secure traffic management.

❓6. How do I manage application rollouts during scaling?

Answer:
Use Kubernetes Deployments to perform rolling updates with zero downtime. You can also perform canary or blue/green deployments using tools like Argo Rollouts or Flagger.

❓7. Is Kubernetes suitable for both stateless and stateful applications?

Answer:
Yes. Stateless apps are easier to scale and deploy. For stateful apps, Kubernetes provides StatefulSets, persistent volumes, and storage classes to ensure data consistency across pod restarts or migrations.

❓8. How can I monitor the scalability of my Kubernetes applications?

Answer:
Use tools like Prometheus for metrics, Grafana for dashboards, ELK stack or Loki for logs, and Kubernetes probes (liveness/readiness) to track application health and scalability trends.

❓9. Can I run scalable Kubernetes apps on multiple clouds?

Answer:
Yes. Kubernetes is cloud-agnostic. You can deploy apps on any provider (AWS, Azure, GCP) or use multi-cloud/hybrid tools like Rancher, Anthos, or KubeFed for federated scaling across environments.

❓10. What are some common mistakes when trying to scale apps with Kubernetes?

Answer:

  • Not setting proper resource limits and requests
  • Overlooking pod disruption budgets during scaling
  • Misconfiguring autoscalers or probes
  • Ignoring log/metrics aggregation for troubleshooting
  • Running all workloads in a single namespace without isolation