Deploy from the SaladCloud Portal.
Overview
This guide covers deploying vLLM on SaladCloud Secure (DataCenter GPUs) to serve Large Language Models using multi-GPU sharding. vLLM is a high-throughput, open-source inference engine for LLMs. vLLM is widely adopted in production environments because it provides:- Continuous batching for maximizing GPU utilization.
- Tensor and pipeline parallelism for multi-GPU scaling.
- An OpenAI-compatible API.
- Support for quantization formats such as FP8, AWQ, and GPTQ.
Example Models for SaladCloud Secure
You can deploy any Hugging Face model that vLLM supports. Popular examples:- DeepSeek R1 Distill Llama 70B — Distilled 70B model for high performance at lower cost. (default in the recipe)
- Llama 3.1 70B Instruct — 70B parameter instruction-tuned model.
- Mixtral 8×7B Instruct v0.1 — Mixture-of-Experts model for efficiency and reasoning.
- Qwen2.5 72B Instruct — General-purpose 72B parameter instruction-tuned model.
Multi-GPU Configuration
This recipe automatically configures tensor parallelism (TP_SIZE
) to all 8 GPUs in the node. You may also set:
Model
— Hugging Face model ID to load. Set to host DeepSeek R1 Distill Llama 70B on default.Hugging Face Token
— Optional. Required for private or gated models.GPU Memory Utilization
— Fraction of GPU VRAM vLLM may use (default:0.92
). Lower if you need more headroom.
Example Request
Submit chat completion requests to the/v1/chat/completions
endpoint, and receive generated text in response.
How To Use This Recipe
Authentication
If authentication is enabled, requests must include your SaladCloud API key in theSalad-Api-Key
header. See
Sending Requests.
Replica Count
We recommend at least 2 replicas for development and 3–5 replicas for production. Datacenter GPU nodes are still interruptable.Logging
Logs are available in the SaladCloud Portal. You can also connect an external logging provider such as Axiom.Deploy & Wait
When you deploy, SaladCloud will provision nodes, pull the container image, and download the model weights. Large models may take 5-10 minutes or more to fully load. Once replicas show a green checkmark in the Ready column, the service is live.Advanced Settings
By default, this recipe configures tensor parallelism across all GPUs. For extremely large models (e.g., 405B parameters), you may need to combine tensor + pipeline parallelism. You can also adjust additional parameters for performance tuning. Below is a list of supported environment variables:-
DTYPE
Controls the compute precision. Options includeauto
,float16
,float
,bfloat16
,float32
,half
. Lower precision reduces memory usage and speeds up inference, at a potential cost in accuracy. -
TP_SIZE
Number of tensor parallel groups. (how many GPUs share each layer’s computation). Commonly set to the total GPU count (e.g.,8
for 8 GPUs). -
PP_SIZE
Number of pipeline parallel groups. -
MAX_MODEL_LEN
Model context length (prompt and output). -
MAX_NUM_BATCH_TOKENS
Maximum number of tokens to be processed in a single iteration. -
MAX_NUM_SEQS
Maximum number of sequences to be processed in a single iteration. -
GPU_MEM_UTIL
Fraction of GPU VRAM that vLLM is allowed to use (e.g.,0.92
). Lower this to leave headroom for monitoring or sidecar processes. -
QUANTIZATION
Enables model quantization to save memory and speed up inference. Supported values depend on the model (e.g.,awq
,gptq
,marlin
,fp8
). -
KV_CACHE_DTYPE
Data type for the key/value cache. Options includeauto
,fp8
,fp16
,bf16
. Lower precision reduces memory footprint for long contexts. -
DOWNLOAD_DIR
Directory path where model weights are cached inside the container. Useful if you want to control cache location. -
TOKENIZER
Path or Hugging Face repo ID for a custom tokenizer. By default, vLLM uses the tokenizer bundled with the model. -
TRUST_REMOTE_CODE
Enable this flag (set to any non-empty value) if the model requires custom code from Hugging Face (e.g., non-standard architectures).