Senior ML Platform Reliability & Infrastructure Engineer

170 - 190 PLN/ godz.B2B (netto)

23 000 - 25 000 PLN/ mies.Umowa o pracę (brutto)

SeniorFull-time·B2B·Umowa o pracę

#331102·Dodano 12 dni temu·21

Źródło: nofluffjobs.com

Aplikuj teraz

Tech Stack / Keywords

PythonKubernetesGrafanaPrometheusAWSOpenTelemetryTerraformLokiGKEIAMGraphQLArgoAirflowDaprRedis

Firma i stanowisko

Holisticon Connect is a division within NEXER GROUP, a custom software development company started in Poland in 2017 with over 140 people. The team works on a highly advanced drug discovery platform at the intersection of machine learning, large-scale data systems, and computational science, building core infrastructure for AI-driven drug design to support new medicine discovery.

Wymagania

5+ years experience designing, operating, or scaling distributed multi-service architectures in production.
Deep, hands-on experience with Kubernetes, especially GKE or equivalent managed Kubernetes.
Comfortable with networking, RBAC, resource management, custom controllers, and debugging pod-level issues (OOMKilled, CrashLoopBackOff, scheduling failures).
Proven track record building monitoring stacks using Prometheus, Grafana, Loki, and OpenTelemetry.
Strong proficiency in Python, including FastAPI services, async GraphQL APIs, ML serving runtimes, and data pipelines.
Experience with Infrastructure as Code using Terraform managing cloud resources at scale on GCP or AWS.
Working knowledge of GCP services: GKE, Cloud SQL, Secret Manager, IAM.

Nice to have:

Exposure to ML infrastructure including model serving frameworks (MLServer/Seldon, Ray/Anyscale), training pipelines, and model registry patterns (W&B or similar).
Experience with message-oriented architectures such as Dapr, Redis Streams, or comparable pub/sub and event-driven patterns.
Hands-on experience with workflow orchestration tools like Argo Workflows, Prefect, Airflow, or similar DAG-based pipeline engines.
Familiarity with GraphQL APIs using Apollo Server or Strawberry GraphQL in production.
Comfortable leading incident response and debugging sessions under time pressure.

Obowiązki

Profile and optimise inference latency and throughput for model-serving runtimes handling high-volume prediction requests behind a routing/gateway layer.
Design and implement comprehensive observability across the platform by adding distributed tracing, effective logging, Grafana dashboards, alerting policies, and SLO/SLI frameworks using Prometheus, Loki, and OpenTelemetry.
Harden Kubernetes workloads running on GKE by optimising GPU/CPU resource tuning and improving scaling of resources.
Improve the resilience of asynchronous job pipelines built on Argo Workflows, Dapr pub/sub, and Redis, including retry strategies, dead-letter handling, and backpressure mechanisms.
Collaborate with ML engineers and scientists to reduce friction in the model lifecycle from training and registration through to production serving.
Lead debugging sessions across distributed components, correlating logs, traces, and metrics to identify root cause under time pressure.

Oferta

Fully remote work or office option in Wrocław.
Free benefits including Luxmed private healthcare, Multisport sport subscription, and life insurance with Nationale Nederlanden.
Attractive referral system (9,500 PLN for senior, 6,000 PLN for mid, 2,500 PLN for junior).
Personal training budget with additional paid hours.
Passion Day: an extra day off for personal hobbies.
Flexible working hours with core hours 9-15 and no micro-management.
High-quality work equipment plus 2 additional monitors and accessories.
Team events budget and company events including Summer Party, Programmer's Day, and trips abroad.

Opieka zdrowotna

Karta sportowa

Ubezpieczenie

Dofinansowanie szkoleń

Elastyczne godziny

Imprezy teamowe

Holisticon Connect Sp. z o.o.

27 aktywnych ofert

Zobacz wszystkie oferty

Aplikuj teraz