Site Reliability Engineer (AI Infrastructure)
Brak informacji o wynagrodzeniu
SeniorFull-time
#332263·Dodano 11 dni temu·20
Źródło: nofluffjobs.comTech Stack / Keywords
SREKubernetesSLOsPrometheusGrafanaPythonGoCI/CD PipelinesTerraformAIML
Wymagania
- Expertise in SRE, infrastructure, or platform engineering, managing large-scale distributed systems with extensive operational experience.
- Expertise in Kubernetes and large-scale containerization systems.
- Experience defining SLOs and working with observability tools like Prometheus, Grafana, and distributed tracing to enhance system monitoring.
- Proficiency in Python or Go for automation, CI/CD pipelines, deployment safety, and infrastructure-as-code like Terraform.
- Interest in or experience with AI/ML infrastructure, model serving, or GPU workloads.
- Ability to resolve issues independently while maintaining accountability throughout the process.
- Accountability for reliability, developing automation and monitoring, and collaborating effectively with engineering teams unfamiliar with SRE practices.
Obowiązki
- Building and maintaining observability for AI workloads, including telemetry, dashboards, alerts, SLO/SLI tracking, and driving improvements when targets are missed.
- Writing automation and tooling to reduce operational toil, improve deployment safety, and accelerate incident response.
- Integrating AI workloads into existing incident management processes, building runbooks, participating in on-call rotations, and conducting blameless post-mortems.
- Building and maintaining CI/CD integrations, deployment safety checks, and rollback automation.
- Collaborating with product engineering teams to improve reliability, contribute to architecture decisions, and ensure operational readiness for product releases.
- Contributing to capacity planning, autoscaling configuration, and workload scheduling for AI compute infrastructure.
Link Group
160 aktywnych ofert