Last Updated: August 13, 2025
Overview
Migrating from AWS Batch to SaladCloud enables you to reduce batch processing costs by up to 90% while
maintaining robust job orchestration and scaling capabilities. If you’re currently running batch jobs on AWS Batch
compute environments, you’ll find that SaladCloud offers similar patterns for job queuing, automatic scaling, and
distributed processing — but at a fraction of the cost.
What Stays Exactly the Same:
- Your application code and processing logic remain unchanged
- Same containerized workloads (Docker/ECS task definitions convert easily)
- Job submission and monitoring patterns
- Automatic retry logic for failed jobs
- Queue-based job distribution
What Gets Simpler:
- No complex compute environment configuration
- Simplified job definitions (just containers and resources)
- Straightforward pricing without EC2/Fargate complexity
- Built-in global distribution without multi-region setup
Key Differences to Consider:
- SaladCloud uses distributed consumer GPUs instead of EC2/Fargate
- Job processing through HTTP endpoints rather than AWS Batch agents
- Cloud storage patterns instead of EBS volumes
- Slower cold starts but dramatically lower costs
The migration primarily involves adapting your AWS Batch job definitions to SaladCloud’s container-based job processing
model while preserving your existing batch processing workflows.
💡 New to SaladCloud? Check out our getting started guide for an
introduction to deploying on SaladCloud, or explore our
job queue documentation to understand how SaladCloud
handles batch processing.
Why Migrate from AWS Batch to SaladCloud?
AWS Batch has served as a reliable batch processing solution, but its costs can quickly escalate, especially for
GPU-intensive workloads. SaladCloud offers a compelling alternative that addresses common AWS Batch pain points:
Cost Advantages:
- 90% Lower Compute Costs: GPU + CPU + RAM combined cost a fraction of EC2 instances
- RTX 4090 setup: $0.36/hr vs P3.2xlarge: $3.06/hr
- Transparent Component Pricing: Simple rates - $0.004/vCPU/hour + $0.001/GB RAM/hour + GPU rate
- Per-Second Billing: Hourly rates tracked per second for running containers
- No Hidden Costs: No charges for VPC endpoints, NAT gateways, or data transfer between AZs
Operational Benefits:
- Simplified Management: No compute environment configuration or AMI management
- Automatic Global Distribution: Access to 11,000+ GPUs worldwide without multi-region complexity
- Built-in Resilience: Automatic failover and retry logic included
- No Infrastructure Overhead: Focus on your batch jobs, not EC2 fleet management
When SaladCloud Excels:
- Long-running batch jobs where startup time is less critical
- GPU-intensive workloads (ML training, rendering, simulations)
- Cost-sensitive batch processing
- Non-time-critical workloads (batch priority adds 40-50% savings on top of base 90% savings)
- Globally distributed data processing
- Development and testing environments
Trade-offs to Consider:
- Cold Start Times: Container startup takes minutes vs. seconds on pre-warmed EC2 instances
- Storage Model: No EBS volumes; use cloud storage APIs instead
- Service Integration: Fewer native AWS service integrations
- Job Complexity: Better suited for containerized workloads than complex multi-step pipelines
Product Comparison: AWS Batch vs. SaladCloud
Core Component Mapping
AWS Batch Component | SaladCloud Equivalent | Key Differences |
---|
Compute Environment | Container Groups | No EC2 configuration needed; automatic GPU provisioning |
Job Queues | Salad Job Queues | HTTP-based job distribution instead of agent-based |
Job Definitions | Container Configuration | Simpler format; no need for vCPU/memory registration |
Array Jobs | Multiple Job Submissions | Submit individual jobs; same parallelization benefits |
Job Dependencies | Application-Level Logic | Handle dependencies in your code or orchestration layer |
CloudWatch Logs | Portal Logs/External Logging | Built-in logs or integrate with Datadog, Axiom, etc. |
Step Functions | External Orchestrators | Use Airflow, Temporal, or similar for complex workflows |
Feature Comparison
Feature | AWS Batch | SaladCloud |
---|
Job Scheduling | Priority-based with fair share | FIFO queue processing |
Auto Scaling | Based on queue depth | Queue-based or custom metrics |
Spot/On-Demand Mix | Configurable compute environments | 4 priority tiers (batch adds 40-50% to base savings) |
GPU Support | Accelerated Computing instances | Consumer GPUs (RTX 4090, 5090, etc.) |
Container Runtime | ECS or EKS | Docker containers |
Job Retries | Configurable retry attempts | Automatic 3 retries (4 total attempts) |
Job Timeouts | Configurable per job | Container-level configuration |
Long-Running Jobs | Supported with spot instance risks | Use Kelpie for checkpointing/resumption |
Multi-Step Jobs | Via Step Functions | Single container jobs (orchestrate externally) |
Storage | EBS volumes, EFS | S3-compatible cloud storage (e.g., R2) |
Networking | VPC, Security Groups | No networking config needed with Job Queues |
Monitoring | CloudWatch Metrics/Logs | Portal metrics, external monitoring tools |
Cost Model | EC2/Fargate pricing + Batch overhead | Simple hourly rates (billed per second) |
Migration Requirements
Technical Requirements
- Containerization: Jobs must run in Docker containers (you likely already have this with ECS task definitions)
- HTTP Interface: Jobs receive work via HTTP endpoints instead of AWS Batch job parameters
- Cloud Storage: Replace EBS/EFS with S3-compatible storage (Cloudflare R2 recommended for no egress fees)
- Queue Worker: Add the Salad Job Queue Worker binary to your container (handles job distribution)
Architectural Shifts
- From Agent-Based to Queue Worker: AWS Batch agents pull jobs; SaladCloud Queue Worker receives and forwards jobs
locally
- From EC2 Fleets to Distributed Nodes: No direct control over compute instances
- From VPC Networking to No Networking: Job Queues eliminate networking configuration entirely
- From IAM Roles to API Keys: Different authentication model
Before You Begin: Key Concepts
Understanding the Job Processing Model
AWS Batch Model:
Job Queue → Compute Environment → EC2 Instance → Batch Agent → Container
SaladCloud Model:
Job Queue → Container Group → Distributed Nodes → Queue Worker → Your App
The key difference is that SaladCloud uses an HTTP-based job distribution model where the Salad Job Queue Worker (a
lightweight binary you add to your container) receives jobs from the queue and forwards them to your application via
localhost HTTP calls. This means your application doesn’t need IPv6 binding or external network access.
Container Startup Behavior
AWS Batch: Containers start when jobs are assigned, run the job, then terminate.
SaladCloud: Containers run continuously and process multiple jobs. You can use
Job Queue Autoscaling to automatically scale to zero
when you have no jobs left to process. Your application should:
- Start an HTTP server to receive jobs
- Process jobs when received
- Return results via HTTP response
- Stay running to process more jobs
Storage Patterns
Since SaladCloud doesn’t support mounted volumes, you’ll need to adapt your storage strategy.
Important: Use Egress-Free Storage We strongly recommend using egress-free storage providers like Cloudflare R2
instead of AWS S3. SaladCloud’s distributed nodes are not in datacenters, so egress fees from traditional cloud storage
can add up quickly.
# AWS Batch pattern with EBS
def process_job(job_params):
input_file = f"/mnt/efs/inputs/{job_params['file_id']}"
output_file = f"/mnt/efs/outputs/{job_params['file_id']}.result"
data = load_from_disk(input_file)
result = process_data(data)
save_to_disk(result, output_file)
# SaladCloud pattern with Cloudflare R2 (recommended) or S3
import boto3
# For Cloudflare R2 (no egress fees)
s3 = boto3.client('s3',
endpoint_url='https://your-account.r2.cloudflarestorage.com',
aws_access_key_id='your-r2-access-key',
aws_secret_access_key='your-r2-secret'
)
# Or for AWS S3 (will incur egress charges)
# s3 = boto3.client('s3')
def process_job(job_params):
# Download from S3
input_data = s3.get_object(
Bucket=job_params['input_bucket'],
Key=job_params['input_key']
)['Body'].read()
# Process in memory or temp storage
result = process_data(input_data)
# Upload to S3
s3.put_object(
Bucket=job_params['output_bucket'],
Key=job_params['output_key'],
Body=result
)
Step-by-Step Migration Process
Step 1: Prepare Your SaladCloud Environment
Account Setup
- Create account at portal.salad.com
- Set up organization and project
- Generate API key for programmatic access
Install SaladCloud SDK (Optional)
# Python
pip install salad-cloud-sdk
# Node.js
npm install @saladtechnologies/salad-cloud-sdk
Step 2: Convert AWS Batch Job Definitions
AWS Batch Job Definition:
{
"jobDefinitionName": "image-processing",
"type": "container",
"containerProperties": {
"image": "my-ecr-repo/processor:latest",
"vcpus": 4,
"memory": 8192,
"jobRoleArn": "arn:aws:iam::123456789012:role/BatchJobRole",
"environment": [{ "name": "PROCESSING_MODE", "value": "batch" }],
"resourceRequirements": [{ "type": "GPU", "value": "1" }]
}
}
SaladCloud Container Configuration:
# Dockerfile with Salad Job Queue Worker
FROM my-ecr-repo/processor:latest
# Download the Salad Job Queue Worker binary
ADD https://github.com/SaladTechnologies/salad-cloud-job-queue-worker/releases/latest/download/salad-job-queue-worker-linux-amd64 /usr/local/bin/salad-job-queue-worker
RUN chmod +x /usr/local/bin/salad-job-queue-worker
# Your existing application setup
WORKDIR /app
COPY . .
# You'll need to manage both processes - your app and the queue worker
# See /container-engine/how-to-guides/job-processing/queue-worker for s6-overlay or wrapper approaches
# The queue worker will forward jobs to your app on localhost:8080
# Your app does NOT need IPv6 binding when using job queues
AWS Batch Job Script:
import os
import json
def main():
# AWS Batch provides job parameters via environment variables
job_params = json.loads(os.environ.get('BATCH_JOB_PARAMETERS', '{}'))
input_file = job_params['inputFile']
output_location = job_params['outputLocation']
# Process the job
result = process_file(input_file)
# Save results
save_to_s3(result, output_location)
if __name__ == "__main__":
main()
SaladCloud HTTP Handler:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
app = FastAPI()
class JobRequest(BaseModel):
inputFile: str
outputLocation: str
@app.post("/process")
async def process_job(request: JobRequest):
try:
# Process the job (same logic as before)
result = process_file(request.inputFile)
save_to_s3(result, request.outputLocation)
return {
"status": "success",
"output": request.outputLocation
}
except Exception as e:
# Return 500 to trigger retry
raise HTTPException(status_code=500, detail=str(e))
# When using Job Queues, bind to localhost (queue worker handles external access)
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8080) # No IPv6 needed with job queues!
Step 3: Choose Your Job Queue Solution
These patterns can be implemented with any job queue, including those not on the Salad platform, but these two have
platform integration with SaladCloud.
Salad Job Queues vs. Kelpie
SaladCloud offers two job queue solutions, each optimized for different use cases:
Salad Job Queues (Recommended for most AWS Batch migrations):
- Best for jobs that complete in minutes to a few hours
- Built-in retry logic (3 retries, 4 total attempts)
- Simple HTTP-based job distribution
- Native autoscaling based on queue depth
- No additional setup required
Salad Kelpie (For long-running or interruptible workloads):
- Designed for jobs running many hours or days (ML training, simulations)
- Built-in checkpointing and resumption capabilities
- Automatic cloud storage integration for progress saves
- Handles node interruptions gracefully
- Ideal for workloads that need to survive node failures
When to use Kelpie instead of Job Queues:
- Jobs that run longer than 30 minutes
- ML model training or fine-tuning
- Molecular dynamics simulations
- Any workload where losing progress would be costly
- Jobs that need to save and resume from checkpoints
For this guide, we’ll use Salad Job Queues as they’re the closest match to AWS Batch for most use cases. If you have
long-running workloads, see our Kelpie documentation.
Create a Salad Job Queue
Job Queues can only be created via the API (not available in the portal):
curl -X POST "https://api.salad.com/api/public/organizations/$ORG/projects/$PROJECT/queues" \
-H "Salad-Api-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "batch-processing-queue",
"display_name": "Batch Processing Queue",
"description": "Queue for batch processing jobs migrated from AWS Batch"
}'
Or via Python SDK:
from salad_cloud_sdk import SaladCloudSdk
sdk = SaladCloudSdk(api_key="YOUR_API_KEY")
queue = sdk.queues.create_queue(
organization_name="your-org",
project_name="your-project",
request_body={
"name": "batch-processing-queue",
"display_name": "Batch Processing Queue"
}
)
Step 4: Deploy Container Group with Queue
Container Group Configuration
from salad_cloud_sdk import SaladCloudSdk
sdk = SaladCloudSdk(api_key="YOUR_API_KEY")
# Create container group connected to job queue
# Note: Container Gateway is NOT needed when using Job Queues
container_group = sdk.container_groups.create_container_group(
organization_name="your-org",
project_name="your-project",
request_body={
"name": "batch-processor",
"container": {
"image": "your-registry/batch-processor:latest",
"resources": {
"cpu": 4,
"memory": 8192,
"gpu_classes": ["ed563892-aacd-40f5-80b7-90c9be6c759b"] # RTX 4090 (24 GB)
},
"environment_variables": {
"PROCESSING_MODE": "batch",
"AWS_REGION": "us-east-1" # For S3 access
}
},
"queue_connection": {
"queue_name": "batch-processing-queue",
"port": 8080 # Port where your app listens locally
},
"replicas": 3, # Start with desired capacity
"autostart_policy": True,
"restart_policy": "always"
# No networking/gateway configuration needed!
}
)
Step 5: Submit and Monitor Jobs
Job Submission
AWS Batch Pattern:
import boto3
batch = boto3.client('batch')
response = batch.submit_job(
jobName='process-image-001',
jobQueue='my-job-queue',
jobDefinition='image-processing',
parameters={
'inputFile': 's3://bucket/input/image.jpg',
'outputLocation': 's3://bucket/output/'
}
)
job_id = response['jobId']
SaladCloud Pattern:
from salad_cloud_sdk import SaladCloudSdk
sdk = SaladCloudSdk(api_key="YOUR_API_KEY")
# Submit job to queue
job = sdk.queues.create_job(
organization_name="your-org",
project_name="your-project",
queue_name="batch-processing-queue",
request_body={
"input": {
"inputFile": "s3://bucket/input/image.jpg",
"outputLocation": "s3://bucket/output/"
}
}
)
job_id = job.id
Job Monitoring
# Check job status
job_status = sdk.queues.get_job(
organization_name="your-org",
project_name="your-project",
queue_name="batch-processing-queue",
job_id=job_id
)
print(f"Job Status: {job_status.status}")
if job_status.status == "completed":
print(f"Results: {job_status.output}")
elif job_status.status == "failed":
print(f"Error: {job_status.error}")
Step 6: Implement Autoscaling
Queue-Based Autoscaling
# Configure autoscaling based on queue depth
sdk.container_groups.update_container_group(
organization_name="your-org",
project_name="your-project",
container_group_name="batch-processor",
request={
"queue_autoscaler": {
"min_replicas": 0, # Scale to zero when idle
"max_replicas": 50, # Maximum capacity
"desired_queue_length": 2, # Target 2 jobs per instance
"polling_period": 30 # Check every 30 seconds
}
}
)
Custom Metrics Autoscaling
import time
from datetime import datetime
def scale_based_on_time():
"""Scale up during business hours"""
sdk = SaladCloudSdk(api_key="YOUR_API_KEY")
while True:
hour = datetime.now().hour
# Scale up during business hours (9 AM - 6 PM)
if 9 <= hour < 18:
target_replicas = 10
else:
target_replicas = 2
sdk.container_groups.update_container_group(
organization_name="your-org",
project_name="your-project",
container_group_name="batch-processor",
request={"replicas": target_replicas}
)
time.sleep(300) # Check every 5 minutes
Migration Patterns for Common AWS Batch Scenarios
Pattern 1: Simple Batch Processing
AWS Batch Approach:
- Submit jobs with parameters
- Process in container
- Write results to S3
SaladCloud Migration:
# 1. Container with HTTP endpoint
@app.post("/process")
async def process_batch(job: dict):
# Same processing logic
result = your_existing_function(job['input'])
return {"output": result}
# 2. Submit jobs to queue
for item in batch_items:
sdk.queues.create_job(
organization_name="your-org",
project_name="your-project",
queue_name="batch-queue",
request_body={"input": item}
)
Pattern 2: Array Jobs
AWS Batch Array Jobs:
aws batch submit-job \
--array-properties size=100 \
--job-name array-job \
--job-queue my-queue
SaladCloud Equivalent:
# Submit multiple jobs to achieve same parallelization
jobs = []
for i in range(100):
job = sdk.queues.create_job(
organization_name="your-org",
project_name="your-project",
queue_name="batch-queue",
request_body={
"input": {
"index": i,
"total": 100,
"data": f"s3://bucket/data/chunk_{i}.json"
}
}
)
jobs.append(job.id)
# Monitor all jobs
for job_id in jobs:
status = sdk.queues.get_job(
organization_name="your-org",
project_name="your-project",
queue_name="batch-queue",
job_id=job_id
)
print(f"Job {job_id}: {status.status}")
Pattern 3: Long-Running Jobs with Kelpie
AWS Batch Long-Running Jobs:
- Multi-hour ML training jobs
- Risk of spot instance termination
- Manual checkpointing required
SaladCloud with Kelpie:
# Add Kelpie to your container
FROM pytorch/pytorch:2.7.1-cuda12.6-cudnn9-runtime
# Add the Kelpie binary
ARG KELPIE_VERSION=0.6.0
ADD https://github.com/SaladTechnologies/kelpie/releases/download/${KELPIE_VERSION}/kelpie /kelpie
RUN chmod +x /kelpie
# Your training code
COPY train.py /app/train.py
WORKDIR /app
# Kelpie handles job execution and checkpointing
CMD ["/kelpie"]
Benefits of Kelpie for long jobs:
- Automatic checkpoint upload to S3-compatible storage
- Resume from last checkpoint after interruption
- No data loss from node failures
- Built-in integration with SaladCloud
See the Kelpie guide for detailed setup.
Pattern 4: GPU-Accelerated ML Training
AWS Batch with GPU:
{
"resourceRequirements": [
{ "type": "GPU", "value": "1" },
{ "type": "MEMORY", "value": "32768" },
{ "type": "VCPU", "value": "8" }
]
}
SaladCloud GPU Configuration:
container_group = {
"container": {
"image": "your-ml-training:latest",
"resources": {
"cpu": 8,
"memory": 32768,
"gpu_classes": [
"ed563892-aacd-40f5-80b7-90c9be6c759b", # RTX 4090 (24 GB)
"a5db5c50-cbcb-4596-ae80-6a0c8090d80f" # RTX 3090 (24 GB)
]
}
}
}
Pattern 5: Dependent Jobs
AWS Batch with Dependencies:
job1 = batch.submit_job(jobName="preprocess")
job2 = batch.submit_job(
jobName="process",
dependsOn=[{"jobId": job1['jobId']}]
)
SaladCloud Pattern:
# Implement dependency logic in your application
@app.post("/process")
async def process_with_dependencies(job: dict):
# Check if prerequisites are complete
if job.get('depends_on'):
for dep_id in job['depends_on']:
dep_status = sdk.queues.get_job(
organization_name="your-org",
project_name="your-project",
queue_name="batch-queue",
job_id=dep_id
)
if dep_status.status != "completed":
# Return 503 to retry later
raise HTTPException(status_code=503, detail="Dependencies not ready")
# Process the job
result = process_data(job['input'])
# Trigger dependent jobs if needed
if job.get('triggers'):
for next_job in job['triggers']:
sdk.queues.create_job(
organization_name="your-org",
project_name="your-project",
queue_name="batch-queue",
request_body=next_job
)
return {"output": result}
Monitoring and Logging
Replace CloudWatch with External Logging
Configure Axiom Logging (Recommended):
# In your container group configuration
container_group = {
"container": {
"logging": {
"axiom": {
"dataset": "salad-batch-jobs",
"token": "YOUR_AXIOM_TOKEN",
"url": "https://cloud.axiom.co"
}
}
}
}
Application-Level Logging:
import logging
import json
# Configure structured logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
@app.post("/process")
async def process_job(job: dict):
job_id = job.get('id', 'unknown')
logger.info(json.dumps({
"event": "job_started",
"job_id": job_id,
"input": job['input']
}))
try:
result = process_data(job['input'])
logger.info(json.dumps({
"event": "job_completed",
"job_id": job_id,
"duration": processing_time
}))
return {"output": result}
except Exception as e:
logger.error(json.dumps({
"event": "job_failed",
"job_id": job_id,
"error": str(e)
}))
raise
Cost Optimization Strategies
1. Use Batch Priority for Non-Time-Sensitive Workloads
SaladCloud offers four priority tiers, with each tier offering additional savings on top of our already competitive base
pricing (which is typically 80-90% less than AWS). For batch processing that isn’t time-critical, the “batch” priority
tier offers the deepest discounts:
Priority | Use Case | Additional Savings vs Salad High Priority | Availability |
---|
High | Production, time-critical | Baseline (already ~90% less than AWS) | Highest |
Medium | Standard workloads | ~15-20% additional savings | Very Good |
Low | Flexible deadlines | ~25-35% additional savings | Good |
Batch | Non-urgent processing | ~40-50% additional savings | Variable |
Example pricing for comparable GPU workload (24 GB VRAM, 8 vCPU, 32 GB RAM):
AWS P4d.24xlarge (8x A100 40GB):
- Total: ~$32.77/hour
- Per GPU: ~$4.10/hour
- Includes: 96 vCPUs, 1152 GB RAM (massive overkill for most batch jobs)
AWS P3.2xlarge (1x V100 16GB):
- Total: ~$3.06/hour
- Per GPU: $3.06/hour
- Includes: 8 vCPUs, 61 GB RAM
SaladCloud RTX 4090 (24 GB):
- High Priority: $0.30/hour GPU + $0.032/hour (8 vCPU) + $0.032/hour (32 GB RAM) = $0.364/hour total (88% less
than P3.2xlarge)
- Medium: $0.26 + $0.032 + $0.032 = $0.324/hour
- Low: $0.22 + $0.032 + $0.032 = $0.284/hour
- Batch: $0.18 + $0.032 + $0.032 = $0.244/hour (92% less than P3.2xlarge!)
# Configure container group with batch priority for maximum savings
# This gives you an additional 40% off SaladCloud's already low prices
container_group = sdk.container_groups.create_container_group(
organization_name="your-org",
project_name="your-project",
request_body={
"name": "batch-processor",
"container": {
# ... container config
},
"priority": "batch", # Additional 40-50% savings on top of base savings
"replicas": 10
}
)
2. Scale to Zero During Off-Hours
# Configure minimum replicas to 0 for idle periods
queue_autoscaler = {
"min_replicas": 0, # Scale to zero when no jobs
"max_replicas": 100,
"desired_queue_length": 1
}
3. Optimize Container Size
# Use multi-stage builds to minimize image size
FROM python:3.9 AS builder
COPY requirements.txt .
RUN pip install --user -r requirements.txt
FROM python:3.9-slim
COPY --from=builder /root/.local /root/.local
COPY . /app
WORKDIR /app
4. Batch Small Jobs
@app.post("/process")
async def process_batch(request: dict):
# Process multiple items in one job
results = []
for item in request['batch']:
result = process_item(item)
results.append(result)
return {"outputs": results}
Migration Checklist
Pre-Migration
Container Preparation
SaladCloud Setup
Testing
Production Migration
Common Challenges and Solutions
Challenge: No Step Functions Equivalent
Solution: Use external workflow orchestrators
# Apache Airflow DAG example
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def submit_salad_job(**context):
sdk = SaladCloudSdk(api_key="YOUR_API_KEY")
job = sdk.queues.create_job(
organization_name="your-org",
project_name="your-project",
queue_name="batch-queue",
request_body=context['params']
)
return job.id
dag = DAG('batch_workflow', default_args=default_args)
preprocess = PythonOperator(
task_id='preprocess',
python_callable=submit_salad_job,
params={'input': 'preprocess_config'}
)
process = PythonOperator(
task_id='process',
python_callable=submit_salad_job,
params={'input': 'process_config'}
)
preprocess >> process # Define dependencies
Challenge: Job Scheduling
Solution: Implement cron-based job submission
from apscheduler.schedulers.blocking import BlockingScheduler
scheduler = BlockingScheduler()
@scheduler.scheduled_job('cron', hour=2) # Run at 2 AM daily
def submit_nightly_batch():
sdk = SaladCloudSdk(api_key="YOUR_API_KEY")
# Submit batch jobs
for job_config in nightly_jobs:
sdk.queues.create_job(
organization_name="your-org",
project_name="your-project",
queue_name="batch-queue",
request_body=job_config
)
scheduler.start()
Challenge: Large Data Transfer
Solution: Use pre-signed URLs and streaming
import boto3
from io import BytesIO
@app.post("/process")
async def process_large_file(job: dict):
s3 = boto3.client('s3')
# Stream large file from S3
response = s3.get_object(
Bucket=job['bucket'],
Key=job['key']
)
# Process in chunks to avoid memory issues
for chunk in response['Body'].iter_chunks(chunk_size=1024*1024):
process_chunk(chunk)
# Upload results with pre-signed URL
presigned_url = s3.generate_presigned_url(
'put_object',
Params={'Bucket': job['output_bucket'], 'Key': job['output_key']},
ExpiresIn=3600
)
return {"upload_url": presigned_url}
Minimize Cold Starts
# Keep containers warm with minimal replicas
container_group = {
"replicas": 2, # Always keep 2 instances running
"queue_autoscaler": {
"min_replicas": 2, # Never scale below 2
"max_replicas": 100
}
}
Optimize Job Distribution
# Process multiple small jobs per container invocation
@app.post("/process")
async def process_job_batch(request: dict):
# Check if this is a batch request
if 'batch_size' in request:
# Pull multiple jobs from queue
jobs = []
for _ in range(request['batch_size']):
job = await get_next_job() # Your queue logic
if job:
jobs.append(job)
# Process all jobs
results = [process_single_job(job) for job in jobs]
return {"results": results}
else:
# Single job processing
return process_single_job(request)
What You’ll Gain
Migrating from AWS Batch to SaladCloud provides:
- 90% Cost Reduction: Dramatically lower compute costs for batch processing
- Simplified Operations: No compute environment or AMI management
- Global Scale: Access to 11,000+ GPUs worldwide
- Transparent Pricing: Simple per-second billing without complex EC2 pricing tiers
Operational Improvements
- Automatic Failover: Built-in retry logic and node replacement
- Flexible Scaling: Scale to zero or thousands of instances
- No Infrastructure Management: Focus on your batch jobs, not EC2 fleets
- Unified Job Processing: Same patterns for CPU and GPU workloads
Trade-offs Accepted
- Longer cold start times (minutes vs. seconds)
- Different storage patterns (cloud APIs vs. mounted volumes)
- Fewer AWS service integrations
- HTTP-based job distribution instead of agent-based
Getting Help
SaladCloud Resources
Migration Support
Job Processing Patterns
Migration Guides
GPU Workloads
Ready to start saving on your batch processing costs? Create your SaladCloud account and begin migrating your AWS Batch
workloads today!