Skip to content

Distributed Systems Design

Insights on Site Reliability Engineering and Distributed Systems from a Professional SRE Engineer with over 10 years of experience in designing and implementing Kubernetes-native applications in multi-cloud environments.

About

Senior Site Reliability Engineer with Software engineering background and 11+ years of experience architecting, deploying and scaling business-critical systems in hybrid-cloud environments, both on-prem and public (GCP, AWS). I've spent 8+ years at Vimeo enhancing platform reliability and performance of large-scale cloud-native applications. I was instrumental in migrating on-premises data center services to the cloud and adopting Kubernetes as the core orchestration platform, supporting business growth through high-availability system design, robust observability frameworks, and performance optimization at scale.

My professional work with Linux systems started 20 years ago, when I cross-compiled kernel modules for ARM-based embedded linux, and the learning continues as I study how Istio would complement Cilium in Kubernetes Control Plane V2.

Expertise

Site Reliability Engineering

  • Kubernetes clusters at scale
  • Observability (metrics, logs, traces)
  • SLOs, SLIs, and error budgets
  • Incident response and retrospectives
  • Capacity planning and load testing
  • GitOps and Continuous Integration and Continuous Delivery

Distributed Systems

  • Distributed databases and consistency models
  • Service mesh and microservices patterns
  • Global Load balancing
  • Distributed caching
  • Consensus protocols (Raft, Paxos)
  • Event-driven architectures
  • CAP theorem trade-offs in practice

Technical Skills

  • Languages: Python, Go, Bash
  • Cloud: AWS, GCP, Kubernetes
  • Databases: PostgreSQL, MySQL, Redis, Kafka
  • Tools: Prometheus, Grafana, Terraform, ArgoCD, Varnish

Connect