Senior Site Reliability Engineer
Brak informacji o wynagrodzeniu
SeniorFull-time
#326674·Dodano 19 dni temu·22
Źródło: nofluffjobs.comTech Stack / Keywords
DatadogAIKubernetesGrafanaCopilot
Wymagania
- Proactive and self-driven; identifies problems, risks, and opportunities for improvement independently
- Engaged owner mindset; treats system stability as end-to-end responsibility
- Hands-on engineer; regularly works with clusters, pipelines, monitoring, and code
- AI-native; uses AI tools extensively daily (Copilot, LLMs, automation, analytics, debugging, documentation) and understands AI impact on system design and maintenance
- Comfortable working in a dynamic environment with immature processes
- Experience with Azure DevOps (Boards, Repos, Pipelines)
- Strong knowledge of Kubernetes, including troubleshooting, scaling, and production operations
- Proficiency in Datadog (metrics, logs, dashboards, alerting)
- Experience with Azure Portal for environment operations and configuration
- Strong knowledge of CI/CD practices, including pipeline optimization, testing, and quality gates
- 5+ years of experience as an SRE / Production / Platform Engineer
- Proven experience in production environments
- Strong knowledge of incident management and root cause analysis (RCA)
- Ability to build practical monitoring systems
- Very good command of English, both spoken and written
Nice to have:
- Experience with Grafana
- Experience with AI/LLM pipelines and their observability
- Building multi-app monitoring platforms
- Working in scaled Kubernetes environments (AKS or similar)
Obowiązki
- Building and maintaining a central operational "control tower" for AI applications and pipelines
- Designing and implementing monitoring, alerts, and dashboards (signals, thresholds, routing, runbooks)
- Incident response: triage, coordination, root cause analysis, post-mortems, and preventive measures
- Standardization of pipeline telemetry (success/failure, latency, throughput, bottlenecks)
- CI/CD optimization – release quality, automated testing, reliability gates
- Collaboration with engineering teams to reduce the number of recurring incidents
Oferta
- Private medical care
- Co-financing for the sports card
- Training & learning opportunities
- Constant support of dedicated consultant
- Employee referral program
Opieka zdrowotna
Karta sportowa
Dofinansowanie szkoleń
DCG
337 aktywnych ofert