Principal Site Reliability Engineer (AI Platform Architecture)

Brak informacji o wynagrodzeniu
SeniorFull-time
#332409·Dodano 10 dni temu·15
Źródło: LinkGroup
Aplikuj teraz

Tech Stack / Keywords

AIArchitectureKubernetesPythonGo

Wymagania

  • Extensive experience in SRE or platform engineering, with a proven track record of impact at a principal or staff level.
  • Deep expertise in Kubernetes, specifically in managing autoscaling, resource scheduling, and orchestration for compute-intensive workloads.
  • Advanced programming expertise in Python or Go, with experience building production-grade automation and platform services.
  • Proven ability to influence cross-team technical decisions and elevate technical standards across engineering departments.
  • Experience or strong technical interest in AI/ML infrastructure, model deployment, and GPU workload optimization.
  • A system-level approach to designing reliability into innovative platforms while building strong partnerships with product engineering teams.

Obowiązki

  • Defining the reliability architecture for AI compute services, including SLO frameworks, fault tolerance patterns, and advanced capacity planning models.
  • Driving hands-on development of automation and tooling that scales the SRE team's impact and eliminates operational toil.
  • Designing a comprehensive observability strategy, leveraging existing platforms to build specialized telemetry and GPU-specific monitoring for AI workloads.
  • Architecting deployment safety standards, including progressive rollouts, canary analysis, and automated rollback processes.
  • Embedding reliability into the development lifecycle by influencing product engineering architecture and high-level design decisions.
  • Mentoring and elevating the SRE team through design reviews, code reviews, and hands-on problem-solving.
linkgroup

linkgroup

286 aktywnych ofert

Zobacz wszystkie oferty
Aplikuj teraz