Nowa
Senior ML Engineer -LLM Inference Optimization
120 000 - 160 000 EUR/ rok.B2B (netto)
SeniorFull-time·B2B
#347770·Dodano wczoraj·0
Źródło: nofluffjobs.comTech Stack / Keywords
AWSPythonML systems
Firma i stanowisko
Cast AI is building Kimchi, a system that automatically matches workloads to the most cost-efficient, best-performing large language model (LLM) and serving configuration on customer infrastructure, optimizing inference performance and cost.
Wymagania
- 5+ years building real ML systems with depth in inference or training infrastructure.
- Strong Python skills for production services.
- Hands-on experience with at least one of vLLM, SGLang, or TensorRT-LLM.
- Understanding of inference engine performance on GPUs.
- Fluency with quantization tradeoffs and measuring quality regressions.
- Comfort with distributed systems including collective communication, sharding strategies, and multi-GPU/multi-node failure modes.
- Bias toward measurement and instrumentation before optimization.
- Self-direction and ability to lead technical direction with wide autonomy.
Obowiązki
- Push throughput through continuous batching, speculative decoding, chunked prefill, and kernel-level tuning across vLLM, SGLang, and TensorRT-LLM.
- Cut latency by profiling and fixing actual bottlenecks such as compute, memory bandwidth, scheduling, and networking.
- Optimize KV cache utilization via paged attention, prefix caching, eviction policies, cache reuse across requests, and quantized KV.
- Quantize models without regressing quality using INT8, INT4, FP8 across weights, activations, and KV, measuring quality on real workloads.
- Reduce cold starts and memory footprint through faster initialization, smarter weight loading, and tighter memory accounting.
- Scale inference across nodes with distributed inference topologies, network-aware placement, and checkpointing strategies.
- Set the technical direction by deciding benchmarks, technology adoption, and internal development, supported by strong writeups and reproducible experiments.
Oferta
- Competitive salary depending on experience.
- Flexible, remote-first global work environment.
- Collaboration with a global team of cloud experts.
- Equity options.
- Fast-paced workflow with quick feedback.
- 10% work time for personal projects or self-improvement.
- Learning budget including access to international conferences and courses.
- Annual hackathon.
- Team-building budget and company events.
- Equipment budget.
- Extra days off for work-life balance.
Elastyczne godziny
Udziały pracownicze
Dofinansowanie szkoleń
Budżet konferencyjny
Spotkania integracyjne
CAST AI
Pracodawca