vLLM
18.9k 2.5kWhat is vLLM ?
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM Features
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: GPTQ, AWQ, SqueezeLLM
- Optimized CUDA kernels
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs and AMD GPUs
vLLM seamlessly supports many Hugging Face models, including the following architectures:
- Aquila & Aquila2 (
BAAI/AquilaChat2-7B
,BAAI/AquilaChat2-34B
,BAAI/Aquila-7B
,BAAI/AquilaChat-7B
, etc.) - Baichuan & Baichuan2 (
baichuan-inc/Baichuan2-13B-Chat
,baichuan-inc/Baichuan-7B
, etc.) - BLOOM (
bigscience/bloom
,bigscience/bloomz
, etc.) - ChatGLM (
THUDM/chatglm2-6b
,THUDM/chatglm3-6b
, etc.) - DeciLM (
Deci/DeciLM-7B
,Deci/DeciLM-7B-instruct
, etc.) - Falcon (
tiiuae/falcon-7b
,tiiuae/falcon-40b
,tiiuae/falcon-rw-7b
, etc.) - GPT-2 (
gpt2
,gpt2-xl
, etc.) - GPT BigCode (
bigcode/starcoder
,bigcode/gpt_bigcode-santacoder
, etc.) - GPT-J (
EleutherAI/gpt-j-6b
,nomic-ai/gpt4all-j
, etc.) - GPT-NeoX (
EleutherAI/gpt-neox-20b
,databricks/dolly-v2-12b
,stabilityai/stablelm-tuned-alpha-7b
, etc.) - InternLM (
internlm/internlm-7b
,internlm/internlm-chat-7b
, etc.) - LLaMA & LLaMA-2 (
meta-llama/Llama-2-70b-hf
,lmsys/vicuna-13b-v1.3
,young-geng/koala
,openlm-research/open_llama_13b
, etc.) - Mistral (
mistralai/Mistral-7B-v0.1
,mistralai/Mistral-7B-Instruct-v0.1
, etc.) - Mixtral (
mistralai/Mixtral-8x7B-v0.1
,mistralai/Mixtral-8x7B-Instruct-v0.1
, etc.) - MPT (
mosaicml/mpt-7b
,mosaicml/mpt-30b
, etc.) - OPT (
facebook/opt-66b
,facebook/opt-iml-max-30b
, etc.) - Phi (
microsoft/phi-1_5
,microsoft/phi-2
, etc.) - Qwen (
Qwen/Qwen-7B
,Qwen/Qwen-7B-Chat
, etc.) - Qwen2 (
Qwen/Qwen2-7B-beta
,Qwen/Qwen-7B-Chat-beta
, etc.) - StableLM(
stabilityai/stablelm-3b-4e1t
,stabilityai/stablelm-base-alpha-7b-v2
, etc.) - Yi (
01-ai/Yi-6B
,01-ai/Yi-34B
, etc.)
Install vLLM
Install vLLM with pip or from source:
Getting Started
Visit the documentation to get started.