Senior ML Engineer -LLM Inference Optimization

120 000 - 160 000 EUR/ rok.B2B (netto)

SeniorFull-time·B2B

#347770·Dodano wczoraj·0

Źródło: nofluffjobs.com

Aplikuj teraz

Tech Stack / Keywords

AWSPythonML systems

Firma i stanowisko

Cast AI is building Kimchi, a system that automatically matches workloads to the most cost-efficient, best-performing large language model (LLM) and serving configuration on customer infrastructure, optimizing inference performance and cost.

Wymagania

5+ years building real ML systems with depth in inference or training infrastructure.
Strong Python skills for production services.
Hands-on experience with at least one of vLLM, SGLang, or TensorRT-LLM.
Understanding of inference engine performance on GPUs.
Fluency with quantization tradeoffs and measuring quality regressions.
Comfort with distributed systems including collective communication, sharding strategies, and multi-GPU/multi-node failure modes.
Bias toward measurement and instrumentation before optimization.
Self-direction and ability to lead technical direction with wide autonomy.

Obowiązki

Push throughput through continuous batching, speculative decoding, chunked prefill, and kernel-level tuning across vLLM, SGLang, and TensorRT-LLM.
Cut latency by profiling and fixing actual bottlenecks such as compute, memory bandwidth, scheduling, and networking.
Optimize KV cache utilization via paged attention, prefix caching, eviction policies, cache reuse across requests, and quantized KV.
Quantize models without regressing quality using INT8, INT4, FP8 across weights, activations, and KV, measuring quality on real workloads.
Reduce cold starts and memory footprint through faster initialization, smarter weight loading, and tighter memory accounting.
Scale inference across nodes with distributed inference topologies, network-aware placement, and checkpointing strategies.
Set the technical direction by deciding benchmarks, technology adoption, and internal development, supported by strong writeups and reproducible experiments.

Oferta

Competitive salary depending on experience.
Flexible, remote-first global work environment.
Collaboration with a global team of cloud experts.
Equity options.
Fast-paced workflow with quick feedback.
10% work time for personal projects or self-improvement.
Learning budget including access to international conferences and courses.
Annual hackathon.
Team-building budget and company events.
Equipment budget.
Extra days off for work-life balance.

Elastyczne godziny

Udziały pracownicze

Dofinansowanie szkoleń

Budżet konferencyjny

Spotkania integracyjne

CAST AI

Pracodawca

Aplikuj teraz