Deploy from the SaladCloud Portal.
Overview
This recipe provides an example implementation for using AI Toolkit to train LoRA models for the Flux1-Dev image generation model. Note that this model is not licensed for commercial use, so please ensure you have the appropriate rights to use it for your intended purpose. This recipe uses Kelpie and the Kelpie API to manage the training jobs. Kelpie is a job queueing system that allows you to run long-running jobs on Salad’s infrastructure, as well as manage data transfer between servers and s3-compatible storage. Kelpie has an optional autoscaling feature that automates adjusting replica count based on the number of queued jobs, including scale-to-zero when the queue is empty.Preparing Your Dataset
To train a LoRA model using the AI Toolkit, you need to prepare your dataset in a specific structure, and upload that data to S3-compatible storage. The following steps outline how to set up your dataset for training:- Create a directory named
data
in your project root. - Create a folder inside the
data
directory named for a job id likejob-00001
. - Inside that folder, create a
dataset
folder and place your training images inside, along with a.txt
file for each image that contains a caption for the image of the same name. For example, if you have an image namedimage1.jpg
, you should have a file namedimage1.txt
with the caption for that image. - Copy this config file to
data/job-00001/train.yaml
and make the following updates:- Update
.config.name
to your job id (e.g.,job-00001
). - Update
.config.datasets[0].folder_path
to./dataset
- Update
.model.name_or_path
to/model
, as the model weights are included in the docker image we will be using. - Update
.sample.prompts
to include the prompts you want to use for sampling. This is optional, but it can help you evaluate the quality of your model during training. - Update
.config.save.max_step_saves_to_keep
to2
to save only the most recent progress checkpoint, plus one backup in case of file corruption.
- Update
Captioning Your Dataset
If you don’t have captions for your dataset, you can use any vision-language model to generate the captions. In this example we use OpenAI’s GPT-4.1-Nano model, (their least expensive model) to generate captions for our dataset. You can use the generate-captions.py script to generate captions for your dataset. The script will take a folder of images and generate captions for each image using the GPT-4.1-Nano model. The generated captions will be saved in the same folder as the images with the same name as the image, but with a.txt
extension. You will need your OpenAI API key to use this script. You can set the
OPENAI_API_KEY
environment variable to your OpenAI API key.
Getting Your Container Group ID
.id
field.
Uploading Data And Queuing Jobs
You must upload your dataset to S3-compatible storage and queue jobs to the Kelpie API. The prepare-and-queue-jobs.py script will do this for you. Modify the script to include your salad org and project name. You will need appropriate AWS credentials to access your S3-compatible storage, as well as your Salad API Key set in the environment variableSALAD_API_KEY
. See the Kelpie Docs for more information about the Kelpie API.
kelpie_job.json
file in each job folder that contains the kelpie job id (different from your local job id,
job-00001
) and other information about the job.
You are encouraged to read the script to understand how it works, and how a kelpie job is structured.
Briefly, a kelpie job consists of a command
, some arguments
, optional environment
variables, and a sync
configuration. The sync
configuration tells kelpie what to download before starting a job, what to upload while the
job executes, and what to upload when the job completes. In our case, this looks like this:
- Before starting the job, kelpie will download the following files:
- The training config file
- The dataset folder
- Any previously uploaded progress checkpoints
- While the job is running, kelpie will upload the following files:
- The training progress checkpoints
- Any images sampled during training
- When the job completes, kelpie will upload the following files:
- The final model weights
Monitoring the Job
You can monitor the job using the Kelpie API. You can use the following command to get the status of the job:{job_id}
with the job id of the kelpie job, and <salad-api-key>
with your Salad API key.
This will return a JSON object with the details of the job, including the status, which machine had the job most
recently, and how many heartbeats have been received.
You can also customize the prepare-and-queue-jobs.py
script to include a webhook to be notified when the job
completes.
Autoscaling
Kelpie has an optional autoscaling feature that automates adjusting replica count based on the number of queued jobs, including scale-to-zero when the queue is empty. This feature works through the Salad API, and requires adding the Kelpie user to your Salad Organization to grant the required API access. Currently that is me (shawn.rushefsky@salad.com). To enable autoscaling, you can submit a request to the Kelpie API to establish scaling rules for your container group.- Every 5 minutes, all scaling rules are evaluated.
- The number of replicas in a container group is set to equal the number of queued or running jobs, up to the maximum number of replicas, and down to the minimum number of replicas.
- If the desired number of replicas is 0, the container group will be stopped.
- If the desired number of replicas is greater than 0 and the container group is not currently running, the container group will be started.