Hacker News

Comparing GenAI Inference Engines: TensorRT-LLM, VLLM, HF TGI, and LMDeploy

Hacker News - Tue, 04/08/2025 - 7:32am

Hey everyone, I’ve been diving into the world of generative AI inference engines for quite some time at NLP Cloud, and I wanted to share some insights from a comparison I put together. I looked at four popular options—NVIDIA’s TensorRT-LLM, vLLM, Hugging Face’s Text Generation Inference (TGI), and LMDeploy—and ran some benchmarks to see how they stack up for real-world use cases. Thought this might spark some discussion here since I know a lot of you are working with LLMs or optimizing inference pipelines:

TensorRT-LLM

------------

NVIDIA’s beast for GPU-accelerated inference. Built on TensorRT, it optimizes models with layer fusion, precision tuning (FP16, INT8, even FP8), and custom CUDA kernels.

Pros: Blazing fast on NVIDIA GPUs—think sub-50ms latency for single requests on an A100 and ~700 tokens/sec at 100 concurrent users for LLaMA-3 70B Q4 (per BentoML benchmarks). Dynamic batching and tight integration with Triton Inference Server make it a throughput monster.

Cons: Setup can be complex if you’re not already in the NVIDIA ecosystem. You need to deal with model compilation, and it’s not super flexible for quick prototyping.

vLLM

----

Open-source champion for high-throughput inference. Uses PagedAttention to manage KV caches in chunks, cutting memory waste and boosting speed.

Pros: Easy to spin up (pip install, Python-friendly), and it’s flexible—runs on NVIDIA, AMD, even CPU. Throughput is solid (~600-650 tokens/sec at 100 users for LLaMA-3 70B Q4), and dynamic batching keeps it humming. Latency’s decent at 60-80ms solo.

Cons: It’s less optimized for single-request latency, so if you’re building a chatbot with one user at a time, it might not shine as much. Also, it’s still maturing—some edge cases (like exotic model architectures) might not be supported.

Hugging Face TGI

----------------

Hugging Face’s production-ready inference tool. Ties into their model hub (BERT, GPT, etc.) and uses Rust for speed, with continuous batching to keep GPUs busy.

Pros: Docker setup is quick, and it scales well. Latency’s 50-70ms, throughput matches vLLM (~600-650 tokens/sec at 100 users). Bonus: built-in output filtering for safety. Perfect if you’re already in the HF ecosystem.

Cons: Less raw speed than TensorRT-LLM, and memory can bloat with big batches. Feels a bit restrictive outside HF’s world.

LMDeploy

--------

This Toolkit from the MMRazor/MMDeploy crew, focused on fast, efficient LLM deployment. Features TurboMind (a high-performance engine) and a PyTorch fallback, with persistent batching and blocked KV caching for speed.

Pros: Decoding speed is nuts—up to 1.8x more requests/sec than vLLM on an A100. TurboMind pushes 4-bit inference 2.4x faster than FP16, hitting ~700 tokens/sec at 100 users (LLaMA-3 70B Q4). Low latency (40-60ms), easy one-command server setup, and it even handles multi-round chats efficiently by caching history.

Cons: TurboMind’s picky—doesn’t support sliding window attention (e.g., Mistral) yet. Non-NVIDIA users get stuck with the slower PyTorch engine. Still, on NVIDIA GPUs, it’s a performance beast.

What’s your experience with these tools? Any hidden issues I missed? Or are there other inference engines that should be mentioned? Would love to hear your thoughts!

Julien

Comments URL: https://news.ycombinator.com/item?id=43620472

Points: 1

# Comments: 1

Categories: Hacker News

Show HN: Badgeify – Add Any App to Your Mac Menu Bar

Hacker News - Tue, 04/08/2025 - 7:32am

Article URL: https://badgeify.app/

Comments URL: https://news.ycombinator.com/item?id=43620471

Points: 1

# Comments: 0

Categories: Hacker News

Bug crowd for small startups and vibe coders?

Hacker News - Tue, 04/08/2025 - 7:27am

Article URL: https://picklock.47labs.io/

Comments URL: https://news.ycombinator.com/item?id=43620434

Points: 1

# Comments: 1

Categories: Hacker News

FreeDOS 1.4 Released

Hacker News - Tue, 04/08/2025 - 7:24am
Categories: Hacker News

Tailscale has raised $160M

Hacker News - Tue, 04/08/2025 - 6:36am

Article URL: https://tailscale.com/blog/series-c

Comments URL: https://news.ycombinator.com/item?id=43620141

Points: 1

# Comments: 0

Categories: Hacker News

Go library for generating Anki decks

Hacker News - Tue, 04/08/2025 - 6:34am
Categories: Hacker News

Pages