Architect of Platform Engineering – AI Supercompute Infrastructure | Cloud & Engineering
Tech Stack / Keywords
Firma i stanowisko
We are a technology consulting firm building and operating next-generation AI supercompute infrastructure for the world's most ambitious organizations. As Architect of Platform Engineering, you will own the full stack, from bare metal and operating system up through cluster orchestration, job scheduling, and observability across engagements with leading enterprise and public sector clients pushing the frontier of AI adoption. We are a repeatedly awarded NVIDIA Consulting Partner of the Year in EMEA, holding one of the deepest and most recognized NVIDIA partnerships in the region. Our Cloud Engineering teams design and deliver cloud projects for clients in Poland and abroad in areas including cloud development, DevOps, integration, migration, data management, and infrastructure.
Wymagania
- 8+ years of hands-on infrastructure and platform engineering experience, including full ownership of production systems
- Experience with cluster architecture, control plane operations, custom controllers/operators, multi-tenancy, and large-scale fleet management
- Experience with Slurm or other HPC/AI workload scheduling: job queuing, fair-share scheduling, MPI integration
- Strong Linux internals knowledge: kernel tuning, cgroups, namespaces, NUMA topology, hugepages, and storage subsystems
- Familiarity with high-speed networking: InfiniBand, RoCE, RDMA; tuning for distributed training workloads
- Infrastructure as Code fluency: Terraform, Ansible, Helm or equivalent
- Ability to lead technical engagements with enterprise clients, translating ambiguous requirements into clear deliverables and managing stakeholders
- Entrepreneurial mindset, comfortable operating with autonomy and moving fast without sacrificing rigor
Bonus Experience:
- Proven experience managing NVIDIA GPU infrastructure: driver lifecycle, CUDA toolchain, MIG/MPS partitioning, NVLink/NVSwitch topologies, and GPUDirect RDMA
- Familiarity with NVIDIA Base Command Platform, DGX SuperPOD, or CSP GPU cloud deployments
- Experience with DCGM or other GPU profiling and telemetry tooling
- Prior consulting, professional services, or client delivery experience in infrastructure or cloud practice
- Contributions to open-source platform tooling or CNCF ecosystem projects
Obowiązki
Cluster Orchestration:
- Design, deploy, and operate Kubernetes and Slurm clusters at scale across client environments
- Own the full lifecycle from provisioning to decommission, including upgrades, rollbacks, and capacity planning
Operating System Layer:
- Own OS hardening, kernel tuning, driver management (NVIDIA CUDA, OFED, MIG), and node lifecycle automation across heterogeneous GPU fleets
Monitoring & Observability:
- Build and evolve monitoring, alerting, and telemetry stacks (Prometheus, Grafana, DCGM Exporter, OpenTelemetry) to deliver deep visibility into cluster health, GPU utilization, and job performance
Reliability Engineering:
- Define SLOs, drive postmortem culture, and lead incident response for production AI compute infrastructure
- Treat reliability as a systemic, architectural property
Platform Strategy & Advisory:
- Translate complex technical requirements into platform roadmaps and architectural recommendations tailored to each client's business context and maturity level
- Define how demanding AI compute environments are built and operated across a portfolio of clients
- Build a platform engineering practice with broad impact and technical depth
Oferta
- Flexible working hours
- Permanent employment or contract
- Medical and health insurance
- Multisport and other lifestyle benefits
- Language courses
- Friendly coworkers and team spirit
- Multiple geographies and clients
- Work for well-known brands
- Exposure to trailblazing business and technology projects
- Opportunities to influence business operations
- Development path tailored to individual needs
Deloitte
39 aktywnych ofert