Last Updated: August 13, 2025

Overview

Migrating from AWS Batch to SaladCloud enables you to reduce batch processing costs by up to 90% while maintaining robust job orchestration and scaling capabilities. If you’re currently running batch jobs on AWS Batch compute environments, you’ll find that SaladCloud offers similar patterns for job queuing, automatic scaling, and distributed processing — but at a fraction of the cost. What Stays Exactly the Same:
  • Your application code and processing logic remain unchanged
  • Same containerized workloads (Docker/ECS task definitions convert easily)
  • Job submission and monitoring patterns
  • Automatic retry logic for failed jobs
  • Queue-based job distribution
What Gets Simpler:
  • No complex compute environment configuration
  • Simplified job definitions (just containers and resources)
  • Straightforward pricing without EC2/Fargate complexity
  • Built-in global distribution without multi-region setup
Key Differences to Consider:
  • SaladCloud uses distributed consumer GPUs instead of EC2/Fargate
  • Job processing through HTTP endpoints rather than AWS Batch agents
  • Cloud storage patterns instead of EBS volumes
  • Slower cold starts but dramatically lower costs
The migration primarily involves adapting your AWS Batch job definitions to SaladCloud’s container-based job processing model while preserving your existing batch processing workflows.
💡 New to SaladCloud? Check out our getting started guide for an introduction to deploying on SaladCloud, or explore our job queue documentation to understand how SaladCloud handles batch processing.

Why Migrate from AWS Batch to SaladCloud?

AWS Batch has served as a reliable batch processing solution, but its costs can quickly escalate, especially for GPU-intensive workloads. SaladCloud offers a compelling alternative that addresses common AWS Batch pain points: Cost Advantages:
  • 90% Lower Compute Costs: GPU + CPU + RAM combined cost a fraction of EC2 instances
    • RTX 4090 setup: $0.36/hr vs P3.2xlarge: $3.06/hr
  • Transparent Component Pricing: Simple rates - $0.004/vCPU/hour + $0.001/GB RAM/hour + GPU rate
  • Per-Second Billing: Hourly rates tracked per second for running containers
  • No Hidden Costs: No charges for VPC endpoints, NAT gateways, or data transfer between AZs
Operational Benefits:
  • Simplified Management: No compute environment configuration or AMI management
  • Automatic Global Distribution: Access to 11,000+ GPUs worldwide without multi-region complexity
  • Built-in Resilience: Automatic failover and retry logic included
  • No Infrastructure Overhead: Focus on your batch jobs, not EC2 fleet management
When SaladCloud Excels:
  • Long-running batch jobs where startup time is less critical
  • GPU-intensive workloads (ML training, rendering, simulations)
  • Cost-sensitive batch processing
  • Non-time-critical workloads (batch priority adds 40-50% savings on top of base 90% savings)
  • Globally distributed data processing
  • Development and testing environments
Trade-offs to Consider:
  • Cold Start Times: Container startup takes minutes vs. seconds on pre-warmed EC2 instances
  • Storage Model: No EBS volumes; use cloud storage APIs instead
  • Service Integration: Fewer native AWS service integrations
  • Job Complexity: Better suited for containerized workloads than complex multi-step pipelines

Product Comparison: AWS Batch vs. SaladCloud

Core Component Mapping

AWS Batch ComponentSaladCloud EquivalentKey Differences
Compute EnvironmentContainer GroupsNo EC2 configuration needed; automatic GPU provisioning
Job QueuesSalad Job QueuesHTTP-based job distribution instead of agent-based
Job DefinitionsContainer ConfigurationSimpler format; no need for vCPU/memory registration
Array JobsMultiple Job SubmissionsSubmit individual jobs; same parallelization benefits
Job DependenciesApplication-Level LogicHandle dependencies in your code or orchestration layer
CloudWatch LogsPortal Logs/External LoggingBuilt-in logs or integrate with Datadog, Axiom, etc.
Step FunctionsExternal OrchestratorsUse Airflow, Temporal, or similar for complex workflows

Feature Comparison

FeatureAWS BatchSaladCloud
Job SchedulingPriority-based with fair shareFIFO queue processing
Auto ScalingBased on queue depthQueue-based or custom metrics
Spot/On-Demand MixConfigurable compute environments4 priority tiers (batch adds 40-50% to base savings)
GPU SupportAccelerated Computing instancesConsumer GPUs (RTX 4090, 5090, etc.)
Container RuntimeECS or EKSDocker containers
Job RetriesConfigurable retry attemptsAutomatic 3 retries (4 total attempts)
Job TimeoutsConfigurable per jobContainer-level configuration
Long-Running JobsSupported with spot instance risksUse Kelpie for checkpointing/resumption
Multi-Step JobsVia Step FunctionsSingle container jobs (orchestrate externally)
StorageEBS volumes, EFSS3-compatible cloud storage (e.g., R2)
NetworkingVPC, Security GroupsNo networking config needed with Job Queues
MonitoringCloudWatch Metrics/LogsPortal metrics, external monitoring tools
Cost ModelEC2/Fargate pricing + Batch overheadSimple hourly rates (billed per second)

Migration Requirements

Technical Requirements

  • Containerization: Jobs must run in Docker containers (you likely already have this with ECS task definitions)
  • HTTP Interface: Jobs receive work via HTTP endpoints instead of AWS Batch job parameters
  • Cloud Storage: Replace EBS/EFS with S3-compatible storage (Cloudflare R2 recommended for no egress fees)
  • Queue Worker: Add the Salad Job Queue Worker binary to your container (handles job distribution)

Architectural Shifts

  • From Agent-Based to Queue Worker: AWS Batch agents pull jobs; SaladCloud Queue Worker receives and forwards jobs locally
  • From EC2 Fleets to Distributed Nodes: No direct control over compute instances
  • From VPC Networking to No Networking: Job Queues eliminate networking configuration entirely
  • From IAM Roles to API Keys: Different authentication model

Before You Begin: Key Concepts

Understanding the Job Processing Model

AWS Batch Model:
Job Queue → Compute Environment → EC2 Instance → Batch Agent → Container
SaladCloud Model:
Job Queue → Container Group → Distributed Nodes → Queue Worker → Your App
The key difference is that SaladCloud uses an HTTP-based job distribution model where the Salad Job Queue Worker (a lightweight binary you add to your container) receives jobs from the queue and forwards them to your application via localhost HTTP calls. This means your application doesn’t need IPv6 binding or external network access.

Container Startup Behavior

AWS Batch: Containers start when jobs are assigned, run the job, then terminate. SaladCloud: Containers run continuously and process multiple jobs. You can use Job Queue Autoscaling to automatically scale to zero when you have no jobs left to process. Your application should:
  • Start an HTTP server to receive jobs
  • Process jobs when received
  • Return results via HTTP response
  • Stay running to process more jobs

Storage Patterns

Since SaladCloud doesn’t support mounted volumes, you’ll need to adapt your storage strategy. Important: Use Egress-Free Storage We strongly recommend using egress-free storage providers like Cloudflare R2 instead of AWS S3. SaladCloud’s distributed nodes are not in datacenters, so egress fees from traditional cloud storage can add up quickly.
# AWS Batch pattern with EBS
def process_job(job_params):
    input_file = f"/mnt/efs/inputs/{job_params['file_id']}"
    output_file = f"/mnt/efs/outputs/{job_params['file_id']}.result"

    data = load_from_disk(input_file)
    result = process_data(data)
    save_to_disk(result, output_file)

# SaladCloud pattern with Cloudflare R2 (recommended) or S3
import boto3

# For Cloudflare R2 (no egress fees)
s3 = boto3.client('s3',
    endpoint_url='https://your-account.r2.cloudflarestorage.com',
    aws_access_key_id='your-r2-access-key',
    aws_secret_access_key='your-r2-secret'
)

# Or for AWS S3 (will incur egress charges)
# s3 = boto3.client('s3')

def process_job(job_params):
    # Download from S3
    input_data = s3.get_object(
        Bucket=job_params['input_bucket'],
        Key=job_params['input_key']
    )['Body'].read()

    # Process in memory or temp storage
    result = process_data(input_data)

    # Upload to S3
    s3.put_object(
        Bucket=job_params['output_bucket'],
        Key=job_params['output_key'],
        Body=result
    )

Step-by-Step Migration Process

Step 1: Prepare Your SaladCloud Environment

Account Setup

  1. Create account at portal.salad.com
  2. Set up organization and project
  3. Generate API key for programmatic access

Install SaladCloud SDK (Optional)

# Python
pip install salad-cloud-sdk

# Node.js
npm install @saladtechnologies/salad-cloud-sdk

Step 2: Convert AWS Batch Job Definitions

Transform ECS Task Definitions

AWS Batch Job Definition:
{
  "jobDefinitionName": "image-processing",
  "type": "container",
  "containerProperties": {
    "image": "my-ecr-repo/processor:latest",
    "vcpus": 4,
    "memory": 8192,
    "jobRoleArn": "arn:aws:iam::123456789012:role/BatchJobRole",
    "environment": [{ "name": "PROCESSING_MODE", "value": "batch" }],
    "resourceRequirements": [{ "type": "GPU", "value": "1" }]
  }
}
SaladCloud Container Configuration:
# Dockerfile with Salad Job Queue Worker
FROM my-ecr-repo/processor:latest

# Download the Salad Job Queue Worker binary
ADD https://github.com/SaladTechnologies/salad-cloud-job-queue-worker/releases/latest/download/salad-job-queue-worker-linux-amd64 /usr/local/bin/salad-job-queue-worker
RUN chmod +x /usr/local/bin/salad-job-queue-worker

# Your existing application setup
WORKDIR /app
COPY . .

# You'll need to manage both processes - your app and the queue worker
# See /container-engine/how-to-guides/job-processing/queue-worker for s6-overlay or wrapper approaches
# The queue worker will forward jobs to your app on localhost:8080
# Your app does NOT need IPv6 binding when using job queues

Adapt Job Input/Output Patterns

AWS Batch Job Script:
import os
import json

def main():
    # AWS Batch provides job parameters via environment variables
    job_params = json.loads(os.environ.get('BATCH_JOB_PARAMETERS', '{}'))
    input_file = job_params['inputFile']
    output_location = job_params['outputLocation']

    # Process the job
    result = process_file(input_file)

    # Save results
    save_to_s3(result, output_location)

if __name__ == "__main__":
    main()
SaladCloud HTTP Handler:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI()

class JobRequest(BaseModel):
    inputFile: str
    outputLocation: str

@app.post("/process")
async def process_job(request: JobRequest):
    try:
        # Process the job (same logic as before)
        result = process_file(request.inputFile)
        save_to_s3(result, request.outputLocation)

        return {
            "status": "success",
            "output": request.outputLocation
        }
    except Exception as e:
        # Return 500 to trigger retry
        raise HTTPException(status_code=500, detail=str(e))

# When using Job Queues, bind to localhost (queue worker handles external access)
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)  # No IPv6 needed with job queues!

Step 3: Choose Your Job Queue Solution

These patterns can be implemented with any job queue, including those not on the Salad platform, but these two have platform integration with SaladCloud.

Salad Job Queues vs. Kelpie

SaladCloud offers two job queue solutions, each optimized for different use cases: Salad Job Queues (Recommended for most AWS Batch migrations):
  • Best for jobs that complete in minutes to a few hours
  • Built-in retry logic (3 retries, 4 total attempts)
  • Simple HTTP-based job distribution
  • Native autoscaling based on queue depth
  • No additional setup required
Salad Kelpie (For long-running or interruptible workloads):
  • Designed for jobs running many hours or days (ML training, simulations)
  • Built-in checkpointing and resumption capabilities
  • Automatic cloud storage integration for progress saves
  • Handles node interruptions gracefully
  • Ideal for workloads that need to survive node failures
When to use Kelpie instead of Job Queues:
  • Jobs that run longer than 30 minutes
  • ML model training or fine-tuning
  • Molecular dynamics simulations
  • Any workload where losing progress would be costly
  • Jobs that need to save and resume from checkpoints
For this guide, we’ll use Salad Job Queues as they’re the closest match to AWS Batch for most use cases. If you have long-running workloads, see our Kelpie documentation.

Create a Salad Job Queue

Job Queues can only be created via the API (not available in the portal):
curl -X POST "https://api.salad.com/api/public/organizations/$ORG/projects/$PROJECT/queues" \
  -H "Salad-Api-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "batch-processing-queue",
    "display_name": "Batch Processing Queue",
    "description": "Queue for batch processing jobs migrated from AWS Batch"
  }'
Or via Python SDK:
from salad_cloud_sdk import SaladCloudSdk

sdk = SaladCloudSdk(api_key="YOUR_API_KEY")

queue = sdk.queues.create_queue(
    organization_name="your-org",
    project_name="your-project",
    request_body={
        "name": "batch-processing-queue",
        "display_name": "Batch Processing Queue"
    }
)

Step 4: Deploy Container Group with Queue

Container Group Configuration

from salad_cloud_sdk import SaladCloudSdk

sdk = SaladCloudSdk(api_key="YOUR_API_KEY")

# Create container group connected to job queue
# Note: Container Gateway is NOT needed when using Job Queues
container_group = sdk.container_groups.create_container_group(
    organization_name="your-org",
    project_name="your-project",
    request_body={
        "name": "batch-processor",
        "container": {
            "image": "your-registry/batch-processor:latest",
            "resources": {
                "cpu": 4,
                "memory": 8192,
                "gpu_classes": ["ed563892-aacd-40f5-80b7-90c9be6c759b"]  # RTX 4090 (24 GB)
            },
            "environment_variables": {
                "PROCESSING_MODE": "batch",
                "AWS_REGION": "us-east-1"  # For S3 access
            }
        },
        "queue_connection": {
            "queue_name": "batch-processing-queue",
            "port": 8080  # Port where your app listens locally
        },
        "replicas": 3,  # Start with desired capacity
        "autostart_policy": True,
        "restart_policy": "always"
        # No networking/gateway configuration needed!
    }
)

Step 5: Submit and Monitor Jobs

Job Submission

AWS Batch Pattern:
import boto3

batch = boto3.client('batch')

response = batch.submit_job(
    jobName='process-image-001',
    jobQueue='my-job-queue',
    jobDefinition='image-processing',
    parameters={
        'inputFile': 's3://bucket/input/image.jpg',
        'outputLocation': 's3://bucket/output/'
    }
)
job_id = response['jobId']
SaladCloud Pattern:
from salad_cloud_sdk import SaladCloudSdk

sdk = SaladCloudSdk(api_key="YOUR_API_KEY")

# Submit job to queue
job = sdk.queues.create_job(
    organization_name="your-org",
    project_name="your-project",
    queue_name="batch-processing-queue",
    request_body={
        "input": {
            "inputFile": "s3://bucket/input/image.jpg",
            "outputLocation": "s3://bucket/output/"
        }
    }
)
job_id = job.id

Job Monitoring

# Check job status
job_status = sdk.queues.get_job(
    organization_name="your-org",
    project_name="your-project",
    queue_name="batch-processing-queue",
    job_id=job_id
)

print(f"Job Status: {job_status.status}")
if job_status.status == "completed":
    print(f"Results: {job_status.output}")
elif job_status.status == "failed":
    print(f"Error: {job_status.error}")

Step 6: Implement Autoscaling

Queue-Based Autoscaling

# Configure autoscaling based on queue depth
sdk.container_groups.update_container_group(
    organization_name="your-org",
    project_name="your-project",
    container_group_name="batch-processor",
    request={
        "queue_autoscaler": {
            "min_replicas": 0,  # Scale to zero when idle
            "max_replicas": 50,  # Maximum capacity
            "desired_queue_length": 2,  # Target 2 jobs per instance
            "polling_period": 30  # Check every 30 seconds
        }
    }
)

Custom Metrics Autoscaling

import time
from datetime import datetime

def scale_based_on_time():
    """Scale up during business hours"""
    sdk = SaladCloudSdk(api_key="YOUR_API_KEY")

    while True:
        hour = datetime.now().hour

        # Scale up during business hours (9 AM - 6 PM)
        if 9 <= hour < 18:
            target_replicas = 10
        else:
            target_replicas = 2

        sdk.container_groups.update_container_group(
            organization_name="your-org",
            project_name="your-project",
            container_group_name="batch-processor",
            request={"replicas": target_replicas}
        )

        time.sleep(300)  # Check every 5 minutes

Migration Patterns for Common AWS Batch Scenarios

Pattern 1: Simple Batch Processing

AWS Batch Approach:
  • Submit jobs with parameters
  • Process in container
  • Write results to S3
SaladCloud Migration:
# 1. Container with HTTP endpoint
@app.post("/process")
async def process_batch(job: dict):
    # Same processing logic
    result = your_existing_function(job['input'])
    return {"output": result}

# 2. Submit jobs to queue
for item in batch_items:
    sdk.queues.create_job(
        organization_name="your-org",
        project_name="your-project",
        queue_name="batch-queue",
        request_body={"input": item}
    )

Pattern 2: Array Jobs

AWS Batch Array Jobs:
aws batch submit-job \
  --array-properties size=100 \
  --job-name array-job \
  --job-queue my-queue
SaladCloud Equivalent:
# Submit multiple jobs to achieve same parallelization
jobs = []
for i in range(100):
    job = sdk.queues.create_job(
        organization_name="your-org",
        project_name="your-project",
        queue_name="batch-queue",
        request_body={
            "input": {
                "index": i,
                "total": 100,
                "data": f"s3://bucket/data/chunk_{i}.json"
            }
        }
    )
    jobs.append(job.id)

# Monitor all jobs
for job_id in jobs:
    status = sdk.queues.get_job(
        organization_name="your-org",
        project_name="your-project",
        queue_name="batch-queue",
        job_id=job_id
    )
    print(f"Job {job_id}: {status.status}")

Pattern 3: Long-Running Jobs with Kelpie

AWS Batch Long-Running Jobs:
  • Multi-hour ML training jobs
  • Risk of spot instance termination
  • Manual checkpointing required
SaladCloud with Kelpie:
# Add Kelpie to your container
FROM pytorch/pytorch:2.7.1-cuda12.6-cudnn9-runtime

# Add the Kelpie binary
ARG KELPIE_VERSION=0.6.0
ADD https://github.com/SaladTechnologies/kelpie/releases/download/${KELPIE_VERSION}/kelpie /kelpie
RUN chmod +x /kelpie

# Your training code
COPY train.py /app/train.py
WORKDIR /app

# Kelpie handles job execution and checkpointing
CMD ["/kelpie"]
Benefits of Kelpie for long jobs:
  • Automatic checkpoint upload to S3-compatible storage
  • Resume from last checkpoint after interruption
  • No data loss from node failures
  • Built-in integration with SaladCloud
See the Kelpie guide for detailed setup.

Pattern 4: GPU-Accelerated ML Training

AWS Batch with GPU:
{
  "resourceRequirements": [
    { "type": "GPU", "value": "1" },
    { "type": "MEMORY", "value": "32768" },
    { "type": "VCPU", "value": "8" }
  ]
}
SaladCloud GPU Configuration:
container_group = {
    "container": {
        "image": "your-ml-training:latest",
        "resources": {
            "cpu": 8,
            "memory": 32768,
            "gpu_classes": [
                "ed563892-aacd-40f5-80b7-90c9be6c759b",  # RTX 4090 (24 GB)
                "a5db5c50-cbcb-4596-ae80-6a0c8090d80f"   # RTX 3090 (24 GB)
            ]
        }
    }
}

Pattern 5: Dependent Jobs

AWS Batch with Dependencies:
job1 = batch.submit_job(jobName="preprocess")
job2 = batch.submit_job(
    jobName="process",
    dependsOn=[{"jobId": job1['jobId']}]
)
SaladCloud Pattern:
# Implement dependency logic in your application
@app.post("/process")
async def process_with_dependencies(job: dict):
    # Check if prerequisites are complete
    if job.get('depends_on'):
        for dep_id in job['depends_on']:
            dep_status = sdk.queues.get_job(
                organization_name="your-org",
                project_name="your-project",
                queue_name="batch-queue",
                job_id=dep_id
            )
            if dep_status.status != "completed":
                # Return 503 to retry later
                raise HTTPException(status_code=503, detail="Dependencies not ready")

    # Process the job
    result = process_data(job['input'])

    # Trigger dependent jobs if needed
    if job.get('triggers'):
        for next_job in job['triggers']:
            sdk.queues.create_job(
                organization_name="your-org",
                project_name="your-project",
                queue_name="batch-queue",
                request_body=next_job
            )

    return {"output": result}

Monitoring and Logging

Replace CloudWatch with External Logging

Configure Axiom Logging (Recommended):
# In your container group configuration
container_group = {
    "container": {
        "logging": {
            "axiom": {
                "dataset": "salad-batch-jobs",
                "token": "YOUR_AXIOM_TOKEN",
                "url": "https://cloud.axiom.co"
            }
        }
    }
}
Application-Level Logging:
import logging
import json

# Configure structured logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

@app.post("/process")
async def process_job(job: dict):
    job_id = job.get('id', 'unknown')

    logger.info(json.dumps({
        "event": "job_started",
        "job_id": job_id,
        "input": job['input']
    }))

    try:
        result = process_data(job['input'])

        logger.info(json.dumps({
            "event": "job_completed",
            "job_id": job_id,
            "duration": processing_time
        }))

        return {"output": result}

    except Exception as e:
        logger.error(json.dumps({
            "event": "job_failed",
            "job_id": job_id,
            "error": str(e)
        }))
        raise

Cost Optimization Strategies

1. Use Batch Priority for Non-Time-Sensitive Workloads

SaladCloud offers four priority tiers, with each tier offering additional savings on top of our already competitive base pricing (which is typically 80-90% less than AWS). For batch processing that isn’t time-critical, the “batch” priority tier offers the deepest discounts:
PriorityUse CaseAdditional Savings vs Salad High PriorityAvailability
HighProduction, time-criticalBaseline (already ~90% less than AWS)Highest
MediumStandard workloads~15-20% additional savingsVery Good
LowFlexible deadlines~25-35% additional savingsGood
BatchNon-urgent processing~40-50% additional savingsVariable
Example pricing for comparable GPU workload (24 GB VRAM, 8 vCPU, 32 GB RAM): AWS P4d.24xlarge (8x A100 40GB):
  • Total: ~$32.77/hour
  • Per GPU: ~$4.10/hour
  • Includes: 96 vCPUs, 1152 GB RAM (massive overkill for most batch jobs)
AWS P3.2xlarge (1x V100 16GB):
  • Total: ~$3.06/hour
  • Per GPU: $3.06/hour
  • Includes: 8 vCPUs, 61 GB RAM
SaladCloud RTX 4090 (24 GB):
  • High Priority: $0.30/hour GPU + $0.032/hour (8 vCPU) + $0.032/hour (32 GB RAM) = $0.364/hour total (88% less than P3.2xlarge)
  • Medium: $0.26 + $0.032 + $0.032 = $0.324/hour
  • Low: $0.22 + $0.032 + $0.032 = $0.284/hour
  • Batch: $0.18 + $0.032 + $0.032 = $0.244/hour (92% less than P3.2xlarge!)
# Configure container group with batch priority for maximum savings
# This gives you an additional 40% off SaladCloud's already low prices
container_group = sdk.container_groups.create_container_group(
    organization_name="your-org",
    project_name="your-project",
    request_body={
        "name": "batch-processor",
        "container": {
            # ... container config
        },
        "priority": "batch",  # Additional 40-50% savings on top of base savings
        "replicas": 10
    }
)

2. Scale to Zero During Off-Hours

# Configure minimum replicas to 0 for idle periods
queue_autoscaler = {
    "min_replicas": 0,  # Scale to zero when no jobs
    "max_replicas": 100,
    "desired_queue_length": 1
}

3. Optimize Container Size

# Use multi-stage builds to minimize image size
FROM python:3.9 AS builder
COPY requirements.txt .
RUN pip install --user -r requirements.txt

FROM python:3.9-slim
COPY --from=builder /root/.local /root/.local
COPY . /app
WORKDIR /app

4. Batch Small Jobs

@app.post("/process")
async def process_batch(request: dict):
    # Process multiple items in one job
    results = []
    for item in request['batch']:
        result = process_item(item)
        results.append(result)

    return {"outputs": results}

Migration Checklist

Pre-Migration

  • Inventory AWS Batch job definitions and compute environments
  • Identify storage dependencies (EBS, EFS volumes)
  • Document job dependencies and workflows
  • Review IAM roles and permissions needed
  • Estimate monthly job volumes and compute requirements

Container Preparation

  • Convert job scripts to HTTP endpoints
  • Add Salad Queue Worker to containers
  • Update to use cloud storage instead of mounted volumes
  • Test containers locally with IPv6 binding
  • Push containers to accessible registry

SaladCloud Setup

  • Create SaladCloud account and organization
  • Generate API keys
  • Create job queues
  • Deploy container groups with queue connections
  • Configure autoscaling policies

Testing

  • Submit test jobs to queues
  • Verify job processing and retries
  • Test autoscaling behavior
  • Validate logging and monitoring
  • Compare performance with AWS Batch baseline

Production Migration

  • Migrate batch jobs gradually (start with non-critical)
  • Monitor costs and performance
  • Adjust autoscaling based on actual usage
  • Update job submission scripts/applications
  • Decommission AWS Batch resources once stable

Common Challenges and Solutions

Challenge: No Step Functions Equivalent

Solution: Use external workflow orchestrators
# Apache Airflow DAG example
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def submit_salad_job(**context):
    sdk = SaladCloudSdk(api_key="YOUR_API_KEY")
    job = sdk.queues.create_job(
        organization_name="your-org",
        project_name="your-project",
        queue_name="batch-queue",
        request_body=context['params']
    )
    return job.id

dag = DAG('batch_workflow', default_args=default_args)

preprocess = PythonOperator(
    task_id='preprocess',
    python_callable=submit_salad_job,
    params={'input': 'preprocess_config'}
)

process = PythonOperator(
    task_id='process',
    python_callable=submit_salad_job,
    params={'input': 'process_config'}
)

preprocess >> process  # Define dependencies

Challenge: Job Scheduling

Solution: Implement cron-based job submission
from apscheduler.schedulers.blocking import BlockingScheduler

scheduler = BlockingScheduler()

@scheduler.scheduled_job('cron', hour=2)  # Run at 2 AM daily
def submit_nightly_batch():
    sdk = SaladCloudSdk(api_key="YOUR_API_KEY")

    # Submit batch jobs
    for job_config in nightly_jobs:
        sdk.queues.create_job(
            organization_name="your-org",
            project_name="your-project",
            queue_name="batch-queue",
            request_body=job_config
        )

scheduler.start()

Challenge: Large Data Transfer

Solution: Use pre-signed URLs and streaming
import boto3
from io import BytesIO

@app.post("/process")
async def process_large_file(job: dict):
    s3 = boto3.client('s3')

    # Stream large file from S3
    response = s3.get_object(
        Bucket=job['bucket'],
        Key=job['key']
    )

    # Process in chunks to avoid memory issues
    for chunk in response['Body'].iter_chunks(chunk_size=1024*1024):
        process_chunk(chunk)

    # Upload results with pre-signed URL
    presigned_url = s3.generate_presigned_url(
        'put_object',
        Params={'Bucket': job['output_bucket'], 'Key': job['output_key']},
        ExpiresIn=3600
    )

    return {"upload_url": presigned_url}

Performance Optimization

Minimize Cold Starts

# Keep containers warm with minimal replicas
container_group = {
    "replicas": 2,  # Always keep 2 instances running
    "queue_autoscaler": {
        "min_replicas": 2,  # Never scale below 2
        "max_replicas": 100
    }
}

Optimize Job Distribution

# Process multiple small jobs per container invocation
@app.post("/process")
async def process_job_batch(request: dict):
    # Check if this is a batch request
    if 'batch_size' in request:
        # Pull multiple jobs from queue
        jobs = []
        for _ in range(request['batch_size']):
            job = await get_next_job()  # Your queue logic
            if job:
                jobs.append(job)

        # Process all jobs
        results = [process_single_job(job) for job in jobs]
        return {"results": results}
    else:
        # Single job processing
        return process_single_job(request)

What You’ll Gain

Migrating from AWS Batch to SaladCloud provides:

Immediate Benefits

  • 90% Cost Reduction: Dramatically lower compute costs for batch processing
  • Simplified Operations: No compute environment or AMI management
  • Global Scale: Access to 11,000+ GPUs worldwide
  • Transparent Pricing: Simple per-second billing without complex EC2 pricing tiers

Operational Improvements

  • Automatic Failover: Built-in retry logic and node replacement
  • Flexible Scaling: Scale to zero or thousands of instances
  • No Infrastructure Management: Focus on your batch jobs, not EC2 fleets
  • Unified Job Processing: Same patterns for CPU and GPU workloads

Trade-offs Accepted

  • Longer cold start times (minutes vs. seconds)
  • Different storage patterns (cloud APIs vs. mounted volumes)
  • Fewer AWS service integrations
  • HTTP-based job distribution instead of agent-based

Getting Help

SaladCloud Resources

Migration Support

Job Processing Patterns

Migration Guides

GPU Workloads

Ready to start saving on your batch processing costs? Create your SaladCloud account and begin migrating your AWS Batch workloads today!