Senior ML Engineer -LLM Inference Optimization

120 000 - 160 000 EUR/ rok.B2B (netto)
SeniorFull-time·B2B
#347770·Dodano wczoraj·0
Źródło: nofluffjobs.com
Aplikuj teraz

Tech Stack / Keywords

AWSPythonML systems

Firma i stanowisko

Cast AI is building Kimchi, a system that automatically matches workloads to the most cost-efficient, best-performing large language model (LLM) and serving configuration on customer infrastructure, optimizing inference performance and cost.


Wymagania

  • 5+ years building real ML systems with depth in inference or training infrastructure.
  • Strong Python skills for production services.
  • Hands-on experience with at least one of vLLM, SGLang, or TensorRT-LLM.
  • Understanding of inference engine performance on GPUs.
  • Fluency with quantization tradeoffs and measuring quality regressions.
  • Comfort with distributed systems including collective communication, sharding strategies, and multi-GPU/multi-node failure modes.
  • Bias toward measurement and instrumentation before optimization.
  • Self-direction and ability to lead technical direction with wide autonomy.

Obowiązki

  • Push throughput through continuous batching, speculative decoding, chunked prefill, and kernel-level tuning across vLLM, SGLang, and TensorRT-LLM.
  • Cut latency by profiling and fixing actual bottlenecks such as compute, memory bandwidth, scheduling, and networking.
  • Optimize KV cache utilization via paged attention, prefix caching, eviction policies, cache reuse across requests, and quantized KV.
  • Quantize models without regressing quality using INT8, INT4, FP8 across weights, activations, and KV, measuring quality on real workloads.
  • Reduce cold starts and memory footprint through faster initialization, smarter weight loading, and tighter memory accounting.
  • Scale inference across nodes with distributed inference topologies, network-aware placement, and checkpointing strategies.
  • Set the technical direction by deciding benchmarks, technology adoption, and internal development, supported by strong writeups and reproducible experiments.

Oferta

  • Competitive salary depending on experience.
  • Flexible, remote-first global work environment.
  • Collaboration with a global team of cloud experts.
  • Equity options.
  • Fast-paced workflow with quick feedback.
  • 10% work time for personal projects or self-improvement.
  • Learning budget including access to international conferences and courses.
  • Annual hackathon.
  • Team-building budget and company events.
  • Equipment budget.
  • Extra days off for work-life balance.
Elastyczne godziny
Udziały pracownicze
Dofinansowanie szkoleń
Budżet konferencyjny
Spotkania integracyjne
CAST AI

CAST AI

Pracodawca

Aplikuj teraz