Senior Site Reliability Engineer

Brak informacji o wynagrodzeniu
SeniorFull-time
#326674·Dodano 19 dni temu·22
Źródło: nofluffjobs.com
Aplikuj teraz

Tech Stack / Keywords

DatadogAIKubernetesGrafanaCopilot

Wymagania

  • Proactive and self-driven; identifies problems, risks, and opportunities for improvement independently
  • Engaged owner mindset; treats system stability as end-to-end responsibility
  • Hands-on engineer; regularly works with clusters, pipelines, monitoring, and code
  • AI-native; uses AI tools extensively daily (Copilot, LLMs, automation, analytics, debugging, documentation) and understands AI impact on system design and maintenance
  • Comfortable working in a dynamic environment with immature processes
  • Experience with Azure DevOps (Boards, Repos, Pipelines)
  • Strong knowledge of Kubernetes, including troubleshooting, scaling, and production operations
  • Proficiency in Datadog (metrics, logs, dashboards, alerting)
  • Experience with Azure Portal for environment operations and configuration
  • Strong knowledge of CI/CD practices, including pipeline optimization, testing, and quality gates
  • 5+ years of experience as an SRE / Production / Platform Engineer
  • Proven experience in production environments
  • Strong knowledge of incident management and root cause analysis (RCA)
  • Ability to build practical monitoring systems
  • Very good command of English, both spoken and written

Nice to have:

  • Experience with Grafana
  • Experience with AI/LLM pipelines and their observability
  • Building multi-app monitoring platforms
  • Working in scaled Kubernetes environments (AKS or similar)

Obowiązki

  • Building and maintaining a central operational "control tower" for AI applications and pipelines
  • Designing and implementing monitoring, alerts, and dashboards (signals, thresholds, routing, runbooks)
  • Incident response: triage, coordination, root cause analysis, post-mortems, and preventive measures
  • Standardization of pipeline telemetry (success/failure, latency, throughput, bottlenecks)
  • CI/CD optimization – release quality, automated testing, reliability gates
  • Collaboration with engineering teams to reduce the number of recurring incidents

Oferta

  • Private medical care
  • Co-financing for the sports card
  • Training & learning opportunities
  • Constant support of dedicated consultant
  • Employee referral program
Opieka zdrowotna
Karta sportowa
Dofinansowanie szkoleń
DCG

DCG

337 aktywnych ofert

Zobacz wszystkie oferty
Aplikuj teraz