Senior ML Platform Reliability & Infrastructure Engineer

170 - 190 PLN/ godz.B2B (netto)
23 000 - 25 000 PLN/ mies.Umowa o pracę (brutto)
SeniorFull-time·B2B·Umowa o pracę
#331102·Dodano 12 dni temu·21
Źródło: nofluffjobs.com
Aplikuj teraz

Tech Stack / Keywords

PythonKubernetesGrafanaPrometheusAWSOpenTelemetryTerraformLokiGKEIAMGraphQLArgoAirflowDaprRedis

Firma i stanowisko

Holisticon Connect is a division within NEXER GROUP, a custom software development company started in Poland in 2017 with over 140 people. The team works on a highly advanced drug discovery platform at the intersection of machine learning, large-scale data systems, and computational science, building core infrastructure for AI-driven drug design to support new medicine discovery.


Wymagania

  • 5+ years experience designing, operating, or scaling distributed multi-service architectures in production.
  • Deep, hands-on experience with Kubernetes, especially GKE or equivalent managed Kubernetes.
  • Comfortable with networking, RBAC, resource management, custom controllers, and debugging pod-level issues (OOMKilled, CrashLoopBackOff, scheduling failures).
  • Proven track record building monitoring stacks using Prometheus, Grafana, Loki, and OpenTelemetry.
  • Strong proficiency in Python, including FastAPI services, async GraphQL APIs, ML serving runtimes, and data pipelines.
  • Experience with Infrastructure as Code using Terraform managing cloud resources at scale on GCP or AWS.
  • Working knowledge of GCP services: GKE, Cloud SQL, Secret Manager, IAM.

Nice to have:

  • Exposure to ML infrastructure including model serving frameworks (MLServer/Seldon, Ray/Anyscale), training pipelines, and model registry patterns (W&B or similar).
  • Experience with message-oriented architectures such as Dapr, Redis Streams, or comparable pub/sub and event-driven patterns.
  • Hands-on experience with workflow orchestration tools like Argo Workflows, Prefect, Airflow, or similar DAG-based pipeline engines.
  • Familiarity with GraphQL APIs using Apollo Server or Strawberry GraphQL in production.
  • Comfortable leading incident response and debugging sessions under time pressure.

Obowiązki

  • Profile and optimise inference latency and throughput for model-serving runtimes handling high-volume prediction requests behind a routing/gateway layer.
  • Design and implement comprehensive observability across the platform by adding distributed tracing, effective logging, Grafana dashboards, alerting policies, and SLO/SLI frameworks using Prometheus, Loki, and OpenTelemetry.
  • Harden Kubernetes workloads running on GKE by optimising GPU/CPU resource tuning and improving scaling of resources.
  • Improve the resilience of asynchronous job pipelines built on Argo Workflows, Dapr pub/sub, and Redis, including retry strategies, dead-letter handling, and backpressure mechanisms.
  • Collaborate with ML engineers and scientists to reduce friction in the model lifecycle from training and registration through to production serving.
  • Lead debugging sessions across distributed components, correlating logs, traces, and metrics to identify root cause under time pressure.

Oferta

  • Fully remote work or office option in Wrocław.
  • Free benefits including Luxmed private healthcare, Multisport sport subscription, and life insurance with Nationale Nederlanden.
  • Attractive referral system (9,500 PLN for senior, 6,000 PLN for mid, 2,500 PLN for junior).
  • Personal training budget with additional paid hours.
  • Passion Day: an extra day off for personal hobbies.
  • Flexible working hours with core hours 9-15 and no micro-management.
  • High-quality work equipment plus 2 additional monitors and accessories.
  • Team events budget and company events including Summer Party, Programmer's Day, and trips abroad.
Opieka zdrowotna
Karta sportowa
Ubezpieczenie
Dofinansowanie szkoleń
Elastyczne godziny
Imprezy teamowe
Holisticon Connect Sp. z o.o.

Holisticon Connect Sp. z o.o.

27 aktywnych ofert

Zobacz wszystkie oferty
Aplikuj teraz