Last Updated: February 28, 2025

Real-Time AI Inference with a Redis-Based Queue

Several customers have successfully implemented a Redis-based, flexible and platform-independent queue for real-time applications on SaladCloud, showcasing the following advantages:
  • The Redis cluster, client applications, and Salad nodes are all strategically deployed within the same region to ensure local access and minimize latency.
  • Supports multiple clients and servers, providing real-time request/response functionality in streaming and non-streaming modes with support for synchronous and asynchronous processing.
  • More resilient to burst traffic, node failures, and the variability in AI inference times, while allowing easy customization for specific applications, such as using different timeout settings per request and adjusting streaming granularity (tokens or chunks).
  • The input and output data of a task can be embedded within the request and response. For large datasets, data exchange can occur directly between client applications or SCE instances and cloud storage, with the request and response containing only a reference to the data.
However, implementing this solution requires effort and comes with certain limitations:
  • A self-hosted Redis cluster or a managed service from public cloud providers (considering cost factors).
  • Integrate the Redis worker in both client applications and inference servers.
  • IP whitelisting for access control is not applicable from Salad nodes to the Redis cluster; instead, application-level authentication can be used.
The Redis-based solution is generally used for low-latency, real-time applications where responses or partial responses must be returned immediately as soon as they are ready, or for node-to-node communication within the same container group or across different groups. It may be applied to batch jobs or long-running tasks, but this requires additional logic and effort. For these scenarios, we typically rely on Salad Kelpie and AWS SQS.

Key Concepts in Redis

Redis is single-threaded but handles high levels of concurrency efficiently using asynchronous I/O and event-driven architecture, and it supports asynchronous concurrency at the client level. A list in Redis is an ordered collection of elements where items are added in the order they are inserted. It supports efficient insertion and removal of elements from both ends. A list will be automatically removed from Redis once it contains no elements. It also supports blocking operations, such as blocking reads on a non-existent list, with a specified timeout. A zset (sorted set) in Redis is a data structure that contains unique elements, each assigned a score. The elements are stored in order of their scores, allowing efficient retrieval, with the highest-scoring element being accessed and removed first. A hash in Redis is a collection of key-value pairs, where each key is unique and maps to a specific value. It is ideal for representing objects with multiple fields. The Python Redis client uses a connection pool to manage connections efficiently, reducing overhead. Instead of establishing a new connection for each request, it initializes the connection lazily on the first command and reuses it for subsequent requests. The socket_timeout setting defines the maximum time (in seconds) the client will wait for a response from the Redis cluster before timing out. If the cluster does not respond within this duration, the client raises a timeout error. pydantic_redis simplifies working with Redis and Pydantic models together by providing tools for serializing and deserializing Pydantic models, making it easier to manage data between a Redis store and your application.

Reference Design: Non-streaming

Please refer to the example code (client, server and common code) in this senario. The solution functions as both a real-time queue and a storage system, keeping historical I/O data and information about client applications and servers. The interaction between client applications and servers can be further simplified. For example, servers can directly save the result to ‘Temporary:Request_1’, eliminating the need for one write operation (S7) from servers and one read operation (C9) by client applications from the Redis cluster. The SCE instance can automatically retrieve and process new requests based on its current load and available resources. During node failure, the SCE instance immediately stops fetching new requests, and any existing request IDs are lost, as they have already been removed from the ‘REQUEST:PENDING’ in Redis. Client applications may encounter timeout errors when performing a blocking read from ‘Temporary:Request_1’ due to the following reasons: When a timeout error occurs, applications can perform additional reads for an extended waiting time for specific requests (flexible and customizable). Applications can also resend the request using a new ID and higher priority, while disregarding the previous request (which may still be in the queue or being processed), or simply return an error to users. Rather than implementing a complex logic, applications can query the number of pending requests in the Redis cluster before submitting a large quantity of requests and apply flow control as needed, such as rejecting new user requests during periods of congestion. The system monitor should regularly track the pending requests in the cluster and upscale/downscale the GPU resource pool accordingly. To enhance I/O throughput and AI inference efficiency, your may run multiple Redis worker instances (processes or coroutines) alongside the inference server (running as a separate process), which supports multiple threads or asynchronous concurrency with batched inference.

Reference Design: Streaming

Please refer to the example code (client, server and common code) in this senario. The streaming solution is similar to the non-streaming solution, differing only in how results are generated by servers and delivered to client applications. In Redis, a list is used to achieve the streaming effect, with servers writing to the left and clients reading from the right. Both writes and reads are performed in chunks (tokens or sentences) by clients and servers. On the server side, specifically for LLMs, the Redis worker retrieves partial responses from the inference server, such as Hugging Face’s TGI, and writes them into Redis. Instead of calling the TGI server directly, the Redis worker can also run custom inference code built on the Hugging Face’s library, enabling streaming results that are then returned to Redis. The socket_timeout can be set to a smaller value in this case, as partial inference results can be returned when they are ready, enabling quicker error detection and response handling. Client applications may encounter timeout errors when performing a blocking read from ‘Streaming:Request_1’. A similar logic can be applied—if some chunks have already been received for a request, it is highly likely that a server failure has occurred. The streaming granularity can be customized and changing based on application needs, such as token-by-token, chunk-by-chunk, chunk-by-sentence, or image-by-video, as long as the data can be converted to strings, providing flexibility in how data is processed and delivered.

Local Performance Test

From the test, we can see that the real-time queue does not impose an intensive read/write workload on Redis, and its throughput scales linearly with the number of clients and servers. However, a smaller chunk size may reduce transmission efficiency and overall system throughput.