Hacker News
First medical X-ray taken in space
Article URL: https://news.mit.edu/2025/3-questions-lonnie-petersen-first-medical-x-ray-taken-in-space-0407
Comments URL: https://news.ycombinator.com/item?id=43620479
Points: 1
# Comments: 0
Comparing GenAI Inference Engines: TensorRT-LLM, VLLM, HF TGI, and LMDeploy
Hey everyone, I’ve been diving into the world of generative AI inference engines for quite some time at NLP Cloud, and I wanted to share some insights from a comparison I put together. I looked at four popular options—NVIDIA’s TensorRT-LLM, vLLM, Hugging Face’s Text Generation Inference (TGI), and LMDeploy—and ran some benchmarks to see how they stack up for real-world use cases. Thought this might spark some discussion here since I know a lot of you are working with LLMs or optimizing inference pipelines:
TensorRT-LLM
------------
NVIDIA’s beast for GPU-accelerated inference. Built on TensorRT, it optimizes models with layer fusion, precision tuning (FP16, INT8, even FP8), and custom CUDA kernels.
Pros: Blazing fast on NVIDIA GPUs—think sub-50ms latency for single requests on an A100 and ~700 tokens/sec at 100 concurrent users for LLaMA-3 70B Q4 (per BentoML benchmarks). Dynamic batching and tight integration with Triton Inference Server make it a throughput monster.
Cons: Setup can be complex if you’re not already in the NVIDIA ecosystem. You need to deal with model compilation, and it’s not super flexible for quick prototyping.
vLLM
----
Open-source champion for high-throughput inference. Uses PagedAttention to manage KV caches in chunks, cutting memory waste and boosting speed.
Pros: Easy to spin up (pip install, Python-friendly), and it’s flexible—runs on NVIDIA, AMD, even CPU. Throughput is solid (~600-650 tokens/sec at 100 users for LLaMA-3 70B Q4), and dynamic batching keeps it humming. Latency’s decent at 60-80ms solo.
Cons: It’s less optimized for single-request latency, so if you’re building a chatbot with one user at a time, it might not shine as much. Also, it’s still maturing—some edge cases (like exotic model architectures) might not be supported.
Hugging Face TGI
----------------
Hugging Face’s production-ready inference tool. Ties into their model hub (BERT, GPT, etc.) and uses Rust for speed, with continuous batching to keep GPUs busy.
Pros: Docker setup is quick, and it scales well. Latency’s 50-70ms, throughput matches vLLM (~600-650 tokens/sec at 100 users). Bonus: built-in output filtering for safety. Perfect if you’re already in the HF ecosystem.
Cons: Less raw speed than TensorRT-LLM, and memory can bloat with big batches. Feels a bit restrictive outside HF’s world.
LMDeploy
--------
This Toolkit from the MMRazor/MMDeploy crew, focused on fast, efficient LLM deployment. Features TurboMind (a high-performance engine) and a PyTorch fallback, with persistent batching and blocked KV caching for speed.
Pros: Decoding speed is nuts—up to 1.8x more requests/sec than vLLM on an A100. TurboMind pushes 4-bit inference 2.4x faster than FP16, hitting ~700 tokens/sec at 100 users (LLaMA-3 70B Q4). Low latency (40-60ms), easy one-command server setup, and it even handles multi-round chats efficiently by caching history.
Cons: TurboMind’s picky—doesn’t support sliding window attention (e.g., Mistral) yet. Non-NVIDIA users get stuck with the slower PyTorch engine. Still, on NVIDIA GPUs, it’s a performance beast.
What’s your experience with these tools? Any hidden issues I missed? Or are there other inference engines that should be mentioned? Would love to hear your thoughts!
Julien
Comments URL: https://news.ycombinator.com/item?id=43620472
Points: 1
# Comments: 1
Show HN: Badgeify – Add Any App to Your Mac Menu Bar
Article URL: https://badgeify.app/
Comments URL: https://news.ycombinator.com/item?id=43620471
Points: 1
# Comments: 0
Apple Plans to Source More iPhones from India as Potential Tariff Fix
Article URL: https://www.wsj.com/tech/apple-iphone-production-china-tariffs-6cc37f40
Comments URL: https://news.ycombinator.com/item?id=43620458
Points: 1
# Comments: 0
Tuesday Telescope: Does this Milky Way image remind you of Powers of 10?
Article URL: https://arstechnica.com/space/2025/04/tuesday-telescope-the-heart-of-the-galaxy-revealed-in-two-kinds-of-light/
Comments URL: https://news.ycombinator.com/item?id=43620453
Points: 1
# Comments: 0
Meta got caught gaming AI benchmarks
Article URL: https://www.theverge.com/meta/645012/meta-llama-4-maverick-benchmarks-gaming
Comments URL: https://news.ycombinator.com/item?id=43620452
Points: 2
# Comments: 0
Navy SEAL. Harvard Doctor.NASA Astronaut. Don't Tell Mom About This Overachiever
Article URL: https://www.wsj.com/lifestyle/jonny-kim-nasa-astronaut-navy-seal-harvard-doctor-nasa-astronaut-7ad0e523
Comments URL: https://news.ycombinator.com/item?id=43620444
Points: 1
# Comments: 1
Plebiscitary Override in Venezuela: Eroding Democracy Deepening Authoritarianism
Article URL: https://journals.sagepub.com/doi/10.1177/00027162241309709
Comments URL: https://news.ycombinator.com/item?id=43620441
Points: 1
# Comments: 0
Attack of the Quack-Industrial Complex – Paul Krugman
Article URL: https://paulkrugman.substack.com/p/attack-of-the-quack-industrial-complex
Comments URL: https://news.ycombinator.com/item?id=43620437
Points: 1
# Comments: 0
Bug crowd for small startups and vibe coders?
Article URL: https://picklock.47labs.io/
Comments URL: https://news.ycombinator.com/item?id=43620434
Points: 1
# Comments: 1
Why the Ultrarich Are Unplugging from "Smart Homes"
Article URL: https://www.hollywoodreporter.com/lifestyle/real-estate/tech-free-homes-luxury-trend-1236177909/
Comments URL: https://news.ycombinator.com/item?id=43620421
Points: 1
# Comments: 1
FreeDOS 1.4 Released
Article URL: https://freedos.org/download/announce.html
Comments URL: https://news.ycombinator.com/item?id=43620415
Points: 1
# Comments: 0
What if we taxed advertising?
Article URL: https://matthewsinclair.com/blog/0177-what-if-we-taxed-advertising
Comments URL: https://news.ycombinator.com/item?id=43620407
Points: 1
# Comments: 1
UK Home Office loses attempt to keep legal battle with Apple secret
Article URL: https://www.theguardian.com/politics/2025/apr/07/uk-home-office-loses-attempt-to-keep-legal-battle-with-apple-secret
Comments URL: https://news.ycombinator.com/item?id=43620154
Points: 1
# Comments: 0
Show HN: Perry Lage is dead. An AI short story
Article URL: https://show.franzai.com/a/tiny-queen-zebu
Comments URL: https://news.ycombinator.com/item?id=43620144
Points: 1
# Comments: 0
Tailscale has raised $160M
Article URL: https://tailscale.com/blog/series-c
Comments URL: https://news.ycombinator.com/item?id=43620141
Points: 1
# Comments: 0
Go library for generating Anki decks
Article URL: https://github.com/npcnixel/genanki-go
Comments URL: https://news.ycombinator.com/item?id=43620137
Points: 2
# Comments: 0
One-Time Programs (2022)
Article URL: https://blog.cryptographyengineering.com/2022/10/27/one-time-programs/
Comments URL: https://news.ycombinator.com/item?id=43620132
Points: 1
# Comments: 0
LLM-hacker-news: LLM plugin for pulling content from Hacker News
Article URL: https://github.com/simonw/llm-hacker-news
Comments URL: https://news.ycombinator.com/item?id=43620125
Points: 2
# Comments: 0
Where have all the good bloggers gone?
Article URL: https://old.reddit.com/r/slatestarcodex/comments/1js9nfv/where_have_all_the_good_bloggers_gone/
Comments URL: https://news.ycombinator.com/item?id=43620123
Points: 1
# Comments: 1