High Level
Limited local storage (up to 50gb) and the inevitability of node interruptions introduce complexity when managing services that serve inference from hundreds or even thousands of models. We can combine 3 techniques to manage that complexity in an arbitrarily scalable way:- Preload the container image with the most popular models - This way your container can become productive immediately on start, while it downloads more models in the background.
- Local Least-Recently-Used Caching - Only keep the most recently used 50gb of models stored locally on any given node.
- Smart Job Scheduling - Assign jobs to nodes in a way that minimizes model downloading and swapping.