Site Reliability Engineer (AI Infrastructure)

Brak informacji o wynagrodzeniu
SeniorFull-time
#332263·Dodano 11 dni temu·20
Źródło: nofluffjobs.com
Aplikuj teraz

Tech Stack / Keywords

SREKubernetesSLOsPrometheusGrafanaPythonGoCI/CD PipelinesTerraformAIML

Wymagania

  • Expertise in SRE, infrastructure, or platform engineering, managing large-scale distributed systems with extensive operational experience.
  • Expertise in Kubernetes and large-scale containerization systems.
  • Experience defining SLOs and working with observability tools like Prometheus, Grafana, and distributed tracing to enhance system monitoring.
  • Proficiency in Python or Go for automation, CI/CD pipelines, deployment safety, and infrastructure-as-code like Terraform.
  • Interest in or experience with AI/ML infrastructure, model serving, or GPU workloads.
  • Ability to resolve issues independently while maintaining accountability throughout the process.
  • Accountability for reliability, developing automation and monitoring, and collaborating effectively with engineering teams unfamiliar with SRE practices.

Obowiązki

  • Building and maintaining observability for AI workloads, including telemetry, dashboards, alerts, SLO/SLI tracking, and driving improvements when targets are missed.
  • Writing automation and tooling to reduce operational toil, improve deployment safety, and accelerate incident response.
  • Integrating AI workloads into existing incident management processes, building runbooks, participating in on-call rotations, and conducting blameless post-mortems.
  • Building and maintaining CI/CD integrations, deployment safety checks, and rollback automation.
  • Collaborating with product engineering teams to improve reliability, contribute to architecture decisions, and ensure operational readiness for product releases.
  • Contributing to capacity planning, autoscaling configuration, and workload scheduling for AI compute infrastructure.
Link Group

Link Group

160 aktywnych ofert

Zobacz wszystkie oferty
Aplikuj teraz