Deloitte

Architect of Platform Engineering – AI Supercompute Infrastructure | Cloud & Engineering

Brak informacji o wynagrodzeniu

SeniorFull-time

#320873·Dodano 29 dni temu·33

Źródło: nofluffjobs.com

Aplikuj teraz

Tech Stack / Keywords

AILinuxStorageInfrastructure as CodeTerraformAnsibleHelmGPUCUDAMIGCSPCloud

Firma i stanowisko

We are a technology consulting firm building and operating next-generation AI supercompute infrastructure for the world's most ambitious organizations. As Architect of Platform Engineering, you will own the full stack, from bare metal and operating system up through cluster orchestration, job scheduling, and observability across engagements with leading enterprise and public sector clients pushing the frontier of AI adoption. We are a repeatedly awarded NVIDIA Consulting Partner of the Year in EMEA, holding one of the deepest and most recognized NVIDIA partnerships in the region. Our Cloud Engineering teams design and deliver cloud projects for clients in Poland and abroad in areas including cloud development, DevOps, integration, migration, data management, and infrastructure.

Wymagania

8+ years of hands-on infrastructure and platform engineering experience, including full ownership of production systems
Experience with cluster architecture, control plane operations, custom controllers/operators, multi-tenancy, and large-scale fleet management
Experience with Slurm or other HPC/AI workload scheduling: job queuing, fair-share scheduling, MPI integration
Strong Linux internals knowledge: kernel tuning, cgroups, namespaces, NUMA topology, hugepages, and storage subsystems
Familiarity with high-speed networking: InfiniBand, RoCE, RDMA; tuning for distributed training workloads
Infrastructure as Code fluency: Terraform, Ansible, Helm or equivalent
Ability to lead technical engagements with enterprise clients, translating ambiguous requirements into clear deliverables and managing stakeholders
Entrepreneurial mindset, comfortable operating with autonomy and moving fast without sacrificing rigor

Bonus Experience:

Proven experience managing NVIDIA GPU infrastructure: driver lifecycle, CUDA toolchain, MIG/MPS partitioning, NVLink/NVSwitch topologies, and GPUDirect RDMA
Familiarity with NVIDIA Base Command Platform, DGX SuperPOD, or CSP GPU cloud deployments
Experience with DCGM or other GPU profiling and telemetry tooling
Prior consulting, professional services, or client delivery experience in infrastructure or cloud practice
Contributions to open-source platform tooling or CNCF ecosystem projects

Obowiązki

Cluster Orchestration:

Design, deploy, and operate Kubernetes and Slurm clusters at scale across client environments
Own the full lifecycle from provisioning to decommission, including upgrades, rollbacks, and capacity planning

Operating System Layer:

Own OS hardening, kernel tuning, driver management (NVIDIA CUDA, OFED, MIG), and node lifecycle automation across heterogeneous GPU fleets

Monitoring & Observability:

Build and evolve monitoring, alerting, and telemetry stacks (Prometheus, Grafana, DCGM Exporter, OpenTelemetry) to deliver deep visibility into cluster health, GPU utilization, and job performance

Reliability Engineering:

Define SLOs, drive postmortem culture, and lead incident response for production AI compute infrastructure
Treat reliability as a systemic, architectural property

Platform Strategy & Advisory:

Translate complex technical requirements into platform roadmaps and architectural recommendations tailored to each client's business context and maturity level
Define how demanding AI compute environments are built and operated across a portfolio of clients
Build a platform engineering practice with broad impact and technical depth

Oferta

Flexible working hours
Permanent employment or contract
Medical and health insurance
Multisport and other lifestyle benefits
Language courses
Friendly coworkers and team spirit
Multiple geographies and clients
Work for well-known brands
Exposure to trailblazing business and technology projects
Opportunities to influence business operations
Development path tailored to individual needs

Elastyczne godziny

Opieka zdrowotna

Karta sportowa

Kursy językowe

Deloitte

39 aktywnych ofert

Zobacz wszystkie oferty

Aplikuj teraz