Deploy from the SaladCloud Portal.
Overview
This guide covers deploying Text Generation Inference (TGI) on SaladCloud Secure (DataCenter GPUs) to serve Large Language Models using multi‑GPU sharding. TGI provides optimized inference (continuous batching, tensor parallelism, efficient memory) and an OpenAI‑compatible API. On SaladCloud Secure, nodes are provisioned in blocks of 8 GPUs (default: 8×L40S, 48 GB each) so you can load and serve models like 70B parameters and beyond.Example Models for SaladCloud Secure
You can deploy any TGI-compatible model that fits within the combined VRAM of your available GPUs. Examples:- Llama 3 70B Instruct — 70B parameter instruction-tuned model for conversational AI.
- Llama 2 70B Chat — Instruction-tuned conversational model.
- Mixtral 8×7B Instruct v0.1 — Mixture-of-Experts model for efficiency and reasoning.
- DeepSeek R1 Distill Llama 70B — Distilled model for high performance at lower cost. (default in the recipe)
Multi‑GPU Configuration
To use multiple GPUs, you must explicitly set:Model
— Hugging Face model ID to load.Hugging Face Token
— Optional. Required only if using a private or gated Hugging Face model. You can create a token at huggingface.co/settings/tokens.
Example Request
Submit chat completion requests to the/v1/chat/completions
endpoint, and receive generated text in response.
request.json
How To Use This Recipe
Authentication
If authentication is enabled in the container gateway, all requests must include your SaladCloud API key in theSalad-Api-Key
header. See Sending Requests for more
information.
Replica Count
We recommend at least 2 replicas for development and 3–5 replicas for production. Datacenter GPU nodes are still subject to occasional interruptions for maintenance or hardware reallocation.Logging
Logs are available in the SaladCloud Portal. For production workloads, connect an external logging source such as Axiom during deployment.Deploy & Wait
When you deploy, SaladCloud will locate qualified datacenter GPU nodes and start downloading the container image and model weights. Large models may take 10–20 minutes or more to fully load. Once running, a green checkmark in the “Ready” column indicates the instance is serving requests. For production, wait until your target number of replicas is ready before sending traffic.Advanced: Running Multiple LLMs on the Same Node
It is possible to deploy multiple LLMs on a single node by configuring differentCUDA_VISIBLE_DEVICES
values for each
container. Since SaladCloud currently does not support exposing multiple ports per container group, each LLM service
must be configured to listen on a different path while sharing the same port.
This approach is not part of this recipe, but you can still use this recipe as a base and adapt it for multi-LLM
deployments. For example, you could deploy two instances of the TGI container on the same node, one serving Llama 3 70B
Chat on /llama3
and another serving Mixtral 8×7B Instruct v0.1 on /mixtral
, both listening on port 80
.
In order to do that you need to pass the following environment variables when creating or editing your container group:
CUDA_VISIBLE_DEVICES
— Comma‑separated GPU IDs visible to the container (e.g.0,1,2,3,4,5,6,7
).NUM_SHARD
— Number of shards to split the model across. Must be ≤ the number of visible GPUs.