Introduction
Please begin by reviewing the GROMACS-SRCG solution and Salad Kelpie. For production workloads involving a large number of simulation jobs, using a job queue is critical for both efficiency and scalability. If a node fails during job execution, the queue’s built-in retry mechanism can automatically reassign the job to another available node, minimizing downtime. Autoscaling can also be implemented to dynamically adjust the GPU resource pool on SaladCloud in response to changing system loads. Several job queue systems are available, including Salad Kelpie, AWS SQS, GCP Pub/Sub, and custom solutions such as RabbitMQ or Redis. Once the GROMACS-SRCG solution with chunked simulation support is implemented, integrating it with a job queue is typically straightforward and can usually be completed within a few hours to a few days. Based on customer implementations on SaladCloud, the following are common patterns for solution integration:Option 1: Salad Kelpie with Self-Managed Data Synchronization | Option 2: Salad Kelpie with Managed Data Synchronization | Option 3: Other Queue Systems | |
---|---|---|---|
Solution Description | Use Kelpie solely as the job queue, while implementing your own data management workflow. | Use Kelpie as both the job queue and data management solution, leveraging its built-in upload and download capabilities to simplify application development | Use third-party or custom queue systems alongside your own data management solution. |
Job Queue | Salad Kelpie | Salad Kelpie | AWS SQS, GCP Pub/Sub or custom queue |
Cloud Storage | Any Storage | S3-Compatible storage | Any Storage |
Auto Scaling | Based on the number of pending jobs in the queue, as well as the active jobs and their execution times on Salad nodes, managed by Kelpie | Based on the number of pending jobs in the queue, as well as the active jobs and their execution times on Salad nodes, managed by Kelpie | Custom implementation using the Instance Deletion Cost feature on SaladCloud, efficiently managing scale-in behaviors. |
Use Cases | Applications that require full visibility and direct control on the data transfer process to optimize performance and reliability. | Typical applications, supporting chunked simulations and processing one job per node at a time. | Complex applications that run multiple jobs simultaneously on a single node, and also migrate or consolidate jobs between nodes for greater efficiency. |
Salad Kelpie with Self-Managed Data Synchronization
Job Definition
To illustrate how the solution works, consider the following job definition example:python /app/main.py
(within the same image)
using the provided environment variables. The container_group_id
ensures that only specific Kelpie workers within the
designated container group can process the job. This ID must first be retrieved from the container group on
SaladCloud.
The sync
section is left empty, meaning that main.py
is responsible for handling data synchronization on its own.
When executed, it downloads the required files from the cloud storage at BUCKET/PREFIX/job1/
, processes the job, and
uploads the state and results back. If a node fails and the job restarts on a new node, the application should be
designed to resume from the last saved state and continue processing until the task either completes successfully or
fails.
The Kelpie platform marks a job as COMPLETED
when the application exits with a status code of 0. If the application
exits with a non-zero status code on a node, the platform automatically retries the job on another node. After three
failed attempts (configurable, excluding infrastructure-related failures, such as node reallocation), the Kelpie
platform marks the job as FAILED
.
Please review
the detailed job information
for the various statuses.
Submitting Jobs
Starting with Kelpie 0.6.0, all requests to the Kelpie platform require yourSalad API Key
to be included in the
Salad-Api-Key header. Please refer to
the example code
for submitting jobs, where you will also need the necessary environment variables to run the code.
Processing jobs
The application needs only minor modifications from the one used in the GROMACS SRCG solution when integrated with Kelpie. Previously, task-related environment variables were provided during container group creation via the SaladCloud APIs. Now, these variables are delivered by the Kelpie platform. In the Dockerfile, configure the Kelpie worker as the entry point, replacing the previous direct execution of the application.Local Run
If you have access to a local GPU environment, perform a test of the image before running it on SaladCloud. Usedocker compose
to start the container defined in
the docker-compose.yaml.
The command automatically loads environment variables from the .env file in the same directory.
Deployment on SaladCloud
Run salad_deploy.py to deploy a container group on SaladCloud, which can returnSALAD_CONTAINER_GROUP_ID
for job submission.
The following environment variables are required for the deployment:
Salad Kelpie with Managed Data Synchronization
Job Definition
To demonstrate how the managed data synchronization works, consider the following job definition example:/app/kelpie_managed_upload_folder/
directory will be automatically and
concurrently uploaded by the Kelpie worker to the S3-compatible storage at BUCKET/PREFIX/job2/
. However, the upload
order may differ from the order in which files are written due to factors like modification time, file size, and upload
duration. The Kelpie worker can ensure that all ongoing uploads complete after the application exits.
If the application requires full visibility and direct control over the data transfer process—such as node filtering for
network performance optimization, upload backlog and error monitoring, and strict ordering guarantees—it is recommended
to implement self-managed data synchronization.
Please review
the detailed job information
for the various statuses.
Submitting Jobs
Please refer to the example code for submitting jobs. The environment variables to run the code are the same as in the previous section. The job input files must be uploaded to the cloud storage before submitting the job.Processing jobs
The application is significantly simplified compared to the one used in the GROMACS SRCG solution when integrated with Kelpie, as the upload thread and local queue are no longer required. Ensure that output and checkpoint files are moved to the/app/kelpie_managed_upload_folder/
directory only once they are finalized and ready for upload.
In the Dockerfile, configure
the Kelpie worker as the entry point, replacing the previous direct execution of the application.
Local Run
As described in the previous section, usedocker compose
to start the container defined in
the docker-compose.yaml.
The required environment variables are from the .env file, as described in the previous section.
Deployment on SaladCloud
Run salad_deploy.py to deploy a container group on SaladCloud, which can returnSALAD_CONTAINER_GROUP_ID
for job submission.
The following environment variables are required for the deployment: