Principal Site Reliability Engineer (AI Platform Architecture)

Brak informacji o wynagrodzeniu

SeniorFull-time

#332262·Dodano 11 dni temu·14

Źródło: nofluffjobs.com

Tech Stack / Keywords

SREKubernetesPythonGoAIML

Extensive experience in SRE or platform engineering, with a proven track record of impact at a principal or staff level.
Deep expertise in Kubernetes, specifically in managing autoscaling, resource scheduling, and orchestration for compute-intensive workloads.
Advanced programming expertise in Python or Go, with experience building production-grade automation and platform services.
Proven ability to influence cross-team technical decisions and elevate technical standards across engineering departments.
Experience or strong technical interest in AI/ML infrastructure, model deployment, and GPU workload optimization.
A system-level approach to designing reliability into innovative platforms while building strong partnerships with product engineering teams.

Defining the reliability architecture for AI compute services, including SLO frameworks, fault tolerance patterns, and advanced capacity planning models.
Driving hands-on development of automation and tooling that scales the SRE team's impact and eliminates operational toil.
Designing a comprehensive observability strategy, leveraging existing platforms to build specialized telemetry and GPU-specific monitoring for AI workloads.
Architecting deployment safety standards, including progressive rollouts, canary analysis, and automated rollback processes.
Embedding reliability into the development lifecycle by influencing product engineering architecture and high-level design decisions.
Mentoring and elevating the SRE team through design reviews, code reviews, and hands-on problem-solving.

Link Group

160 aktywnych ofert