Principal Site Reliability Engineer (AI Platform Architecture)
Brak informacji o wynagrodzeniu
SeniorFull-time
#332262·Dodano 11 dni temu·14
Źródło: nofluffjobs.comTech Stack / Keywords
SREKubernetesPythonGoAIML
Wymagania
- Extensive experience in SRE or platform engineering, with a proven track record of impact at a principal or staff level.
- Deep expertise in Kubernetes, specifically in managing autoscaling, resource scheduling, and orchestration for compute-intensive workloads.
- Advanced programming expertise in Python or Go, with experience building production-grade automation and platform services.
- Proven ability to influence cross-team technical decisions and elevate technical standards across engineering departments.
- Experience or strong technical interest in AI/ML infrastructure, model deployment, and GPU workload optimization.
- A system-level approach to designing reliability into innovative platforms while building strong partnerships with product engineering teams.
Obowiązki
- Defining the reliability architecture for AI compute services, including SLO frameworks, fault tolerance patterns, and advanced capacity planning models.
- Driving hands-on development of automation and tooling that scales the SRE team's impact and eliminates operational toil.
- Designing a comprehensive observability strategy, leveraging existing platforms to build specialized telemetry and GPU-specific monitoring for AI workloads.
- Architecting deployment safety standards, including progressive rollouts, canary analysis, and automated rollback processes.
- Embedding reliability into the development lifecycle by influencing product engineering architecture and high-level design decisions.
- Mentoring and elevating the SRE team through design reviews, code reviews, and hands-on problem-solving.
Link Group
160 aktywnych ofert