Senior ML Platform Reliability & Infrastructure Engineer

23 000 - 25 000 PLN/ mies.Umowa o pracę (brutto)
170 - 190 PLN/ godz.B2B (netto)
SeniorFull-time·Umowa o pracę·B2B
#331226·Dodano 11 dni temu·17
Źródło: Holisticon Connect
Aplikuj teraz

Tech Stack / Keywords

Software DevelopmentMachine LearningAIGrafanaPrometheusKubernetesRedisNetworking

Firma i stanowisko

Holisticon Connect is a division within NEXER GROUP, a custom software development company started in Poland in 2017 with over 140 people. The team works with world-renowned brands from Scandinavia, the UK, and Western Europe, focusing on competence growth. The role is part of a highly advanced drug discovery platform team working at the intersection of machine learning, large-scale data systems, and computational science, building core infrastructure for AI-driven drug design.


Wymagania

  • 5+ years of experience designing, operating, or scaling multi-service distributed systems in production with strong intuition for failure modes, cascading faults, and capacity planning.
  • Deep, hands-on experience with Kubernetes, specifically GKE or equivalent managed Kubernetes.
  • Comfortable with networking, RBAC, resource management, custom controllers, and debugging pod-level issues such as OOMKilled, CrashLoopBackOff, and scheduling failures.
  • Proven track record building monitoring stacks using Prometheus, Grafana, Loki, and OpenTelemetry; able to define meaningful SLIs, configure alerting to reduce noise, and build dashboards to accelerate incident response.
  • Strong proficiency in Python, including FastAPI services, async GraphQL APIs, ML serving runtimes, and data pipelines.
  • Experience with Infrastructure as Code using Terraform managing cloud resources at scale on GCP or AWS.
  • Working knowledge of GCP services including GKE, Cloud SQL, Secret Manager, and IAM.

Nice to have:

  • Exposure to ML infrastructure including model serving frameworks (MLServer/Seldon, Ray/Anyscale), training pipelines, and model registry patterns (W&B or similar).
  • Experience with message-oriented architectures such as Dapr, Redis Streams, or comparable pub/sub and event-driven patterns.
  • Hands-on experience with workflow orchestration tools like Argo Workflows, Prefect, Airflow, or similar DAG-based pipeline engines.
  • Familiarity with GraphQL APIs using Apollo Server or Strawberry GraphQL in production.
  • Comfortable leading incident response and debugging sessions across distributed components, correlating logs, traces, and metrics under time pressure.

Obowiązki

  • Profile and optimise inference latency and throughput for model-serving runtimes handling high-volume prediction requests behind a routing/gateway layer.
  • Design and implement comprehensive observability across the platform by adding distributed tracing, effective logging, Grafana dashboards, alerting policies, and SLO/SLI frameworks using Prometheus, Loki, and OpenTelemetry.
  • Harden Kubernetes workloads running on GKE by optimising GPU/CPU resource tuning and improving scaling of resources.
  • Improve the resilience of asynchronous job pipelines built on Argo Workflows, Dapr pub/sub, and Redis, including retry strategies, dead-letter handling, and backpressure mechanisms.
  • Collaborate with ML engineers and scientists to reduce friction in the model lifecycle from training and registration through to production serving.

Oferta

  • Opportunity to work on international projects in cutting-edge industries like Automotive, Biotech, and IoT.
  • Possibility to develop skills in cloud technologies.
  • Respect for private life with no overtime or weekend work.
  • Team Events budget for socializing outside work.
  • Company Events including Summer Party, Programmer's Day, and trips abroad.
  • Fully remote work option or office work in Wrocław.
  • Benefits including Luxmed private healthcare, Multisport sport subscription, and life insurance with Nationale Nederlanden.
  • Attractive referral system (9,500 PLN for senior referrals).
  • Personal Training Budget with additional paid hours.
  • Passion Day: an extra day off for personal hobbies.
  • Flexible working hours with core hours 9-15 and no micro-management.
  • High-quality work equipment plus 2 additional monitors and accessories.
Elastyczne godziny
Imprezy teamowe
Opieka zdrowotna
Ubezpieczenie
Karta sportowa
Płatny urlop
Holisticon Connect Sp. z o.o.

Holisticon Connect Sp. z o.o.

27 aktywnych ofert

Zobacz wszystkie oferty
Aplikuj teraz