Senior ML Platform Reliability & Infrastructure Engineer
23 000 - 25 000 PLN/ mies.Umowa o pracę (brutto)
170 - 190 PLN/ godz.B2B (netto)
SeniorFull-time·Umowa o pracę·B2B
#331226·Dodano 11 dni temu·17
Źródło: Holisticon ConnectTech Stack / Keywords
Software DevelopmentMachine LearningAIGrafanaPrometheusKubernetesRedisNetworking
Firma i stanowisko
Holisticon Connect is a division within NEXER GROUP, a custom software development company started in Poland in 2017 with over 140 people. The team works with world-renowned brands from Scandinavia, the UK, and Western Europe, focusing on competence growth. The role is part of a highly advanced drug discovery platform team working at the intersection of machine learning, large-scale data systems, and computational science, building core infrastructure for AI-driven drug design.
Wymagania
- 5+ years of experience designing, operating, or scaling multi-service distributed systems in production with strong intuition for failure modes, cascading faults, and capacity planning.
- Deep, hands-on experience with Kubernetes, specifically GKE or equivalent managed Kubernetes.
- Comfortable with networking, RBAC, resource management, custom controllers, and debugging pod-level issues such as OOMKilled, CrashLoopBackOff, and scheduling failures.
- Proven track record building monitoring stacks using Prometheus, Grafana, Loki, and OpenTelemetry; able to define meaningful SLIs, configure alerting to reduce noise, and build dashboards to accelerate incident response.
- Strong proficiency in Python, including FastAPI services, async GraphQL APIs, ML serving runtimes, and data pipelines.
- Experience with Infrastructure as Code using Terraform managing cloud resources at scale on GCP or AWS.
- Working knowledge of GCP services including GKE, Cloud SQL, Secret Manager, and IAM.
Nice to have:
- Exposure to ML infrastructure including model serving frameworks (MLServer/Seldon, Ray/Anyscale), training pipelines, and model registry patterns (W&B or similar).
- Experience with message-oriented architectures such as Dapr, Redis Streams, or comparable pub/sub and event-driven patterns.
- Hands-on experience with workflow orchestration tools like Argo Workflows, Prefect, Airflow, or similar DAG-based pipeline engines.
- Familiarity with GraphQL APIs using Apollo Server or Strawberry GraphQL in production.
- Comfortable leading incident response and debugging sessions across distributed components, correlating logs, traces, and metrics under time pressure.
Obowiązki
- Profile and optimise inference latency and throughput for model-serving runtimes handling high-volume prediction requests behind a routing/gateway layer.
- Design and implement comprehensive observability across the platform by adding distributed tracing, effective logging, Grafana dashboards, alerting policies, and SLO/SLI frameworks using Prometheus, Loki, and OpenTelemetry.
- Harden Kubernetes workloads running on GKE by optimising GPU/CPU resource tuning and improving scaling of resources.
- Improve the resilience of asynchronous job pipelines built on Argo Workflows, Dapr pub/sub, and Redis, including retry strategies, dead-letter handling, and backpressure mechanisms.
- Collaborate with ML engineers and scientists to reduce friction in the model lifecycle from training and registration through to production serving.
Oferta
- Opportunity to work on international projects in cutting-edge industries like Automotive, Biotech, and IoT.
- Possibility to develop skills in cloud technologies.
- Respect for private life with no overtime or weekend work.
- Team Events budget for socializing outside work.
- Company Events including Summer Party, Programmer's Day, and trips abroad.
- Fully remote work option or office work in Wrocław.
- Benefits including Luxmed private healthcare, Multisport sport subscription, and life insurance with Nationale Nederlanden.
- Attractive referral system (9,500 PLN for senior referrals).
- Personal Training Budget with additional paid hours.
- Passion Day: an extra day off for personal hobbies.
- Flexible working hours with core hours 9-15 and no micro-management.
- High-quality work equipment plus 2 additional monitors and accessories.
Elastyczne godziny
Imprezy teamowe
Opieka zdrowotna
Ubezpieczenie
Karta sportowa
Płatny urlop
Holisticon Connect Sp. z o.o.
27 aktywnych ofert