VRAM and Memory Management
When running multiple models in LocalAI, especially on systems with limited GPU memory (VRAM), you may encounter situations where loading a new model fails because there isn’t enough available VRAM. LocalAI provides several mechanisms to automatically manage model memory allocation and prevent VRAM exhaustion:
- Max Active Backends (LRU Eviction): Limit the number of loaded models, evicting the least recently used when the limit is reached
- Watchdog Mechanisms: Automatically unload idle or stuck models based on configurable timeouts
The Problem
By default, LocalAI keeps models loaded in memory once they’re first used. This means:
- If you load a large model that uses most of your VRAM, subsequent requests for other models may fail
- Models remain in memory even when not actively being used
- There’s no automatic mechanism to unload models to make room for new ones, unless done manually via the web interface
This is a common issue when working with GPU-accelerated models, as VRAM is typically more limited than system RAM. For more context, see issues #6068, #7269, and #5352.
Solution 1: Max Active Backends (LRU Eviction)
LocalAI supports limiting the maximum number of active backends (loaded models) using LRU (Least Recently Used) eviction. When the limit is reached and a new model needs to be loaded, the least recently used model is automatically unloaded to make room.
Configuration
Set the maximum number of active backends using CLI flags or environment variables:
Setting the limit to 1 is equivalent to single active backend mode (see below). Setting to 0 disables the limit (unlimited backends).
Use cases
- Systems with limited VRAM that can handle a few models simultaneously
- Multi-model deployments where you want to keep frequently-used models loaded
- Balancing between memory usage and model reload times
- Production environments requiring predictable memory consumption
How it works
- When a model is requested, its “last used” timestamp is updated
- When a new model needs to be loaded and the limit is reached, LocalAI identifies the least recently used model(s)
- The LRU model(s) are automatically unloaded to make room for the new model
- Concurrent requests for loading different models are handled safely - the system accounts for models currently being loaded when calculating evictions
Example
Single Active Backend Mode
The simplest approach is to ensure only one model is loaded at a time. This is now implemented as --max-active-backends=1. When a new model is requested, LocalAI will automatically unload the currently active model before loading the new one.
Note: The
--single-active-backendflag is deprecated but still supported for backward compatibility. It is recommended to use--max-active-backends=1instead.
Single backend use cases
- Single GPU systems with very limited VRAM
- When you only need one model active at a time
- Simple deployments where model switching is acceptable
Solution 2: Watchdog Mechanisms
For more flexible memory management, LocalAI provides watchdog mechanisms that automatically unload models based on their activity state. This allows multiple models to be loaded simultaneously, but automatically frees memory when models become inactive or stuck.
Note: Watchdog settings can be configured via the Runtime Settings web interface, which allows you to adjust settings without restarting the application.
Idle Watchdog
The idle watchdog monitors models that haven’t been used for a specified period and automatically unloads them to free VRAM.
Configuration
Via environment variables or CLI:
Via web UI: Navigate to Settings → Watchdog Settings and enable “Watchdog Idle Enabled” with your desired timeout.
Busy Watchdog
The busy watchdog monitors models that have been processing requests for an unusually long time and terminates them if they exceed a threshold. This is useful for detecting and recovering from stuck or hung backends.
Configuration
Via environment variables or CLI:
Via web UI: Navigate to Settings → Watchdog Settings and enable “Watchdog Busy Enabled” with your desired timeout.
Combined Configuration
You can enable both watchdogs simultaneously for comprehensive memory management:
Or using command line flags:
Use cases
- Multi-model deployments where different models may be used intermittently
- Systems where you want to keep frequently-used models loaded but free memory from unused ones
- Recovery from stuck or hung backend processes
- Production environments requiring automatic resource management
Example
Timeout Format
Timeouts can be specified using Go’s duration format:
15m- 15 minutes1h- 1 hour30s- 30 seconds2h30m- 2 hours and 30 minutes
Combining LRU and Watchdog
You can combine Max Active Backends (LRU eviction) with the watchdog mechanisms for comprehensive memory management:
Or using command line flags:
This configuration:
- Ensures no more than 3 models are loaded at once (LRU eviction kicks in when exceeded)
- Automatically unloads any model that hasn’t been used for 15 minutes
- Provides both hard limits and time-based cleanup
Limitations and Considerations
VRAM Usage Estimation
LocalAI cannot reliably estimate VRAM usage of new models to load across different backends (llama.cpp, vLLM, diffusers, etc.) because:
- Different backends report memory usage differently
- VRAM requirements vary based on model architecture, quantization, and configuration
- Some backends may not expose memory usage information before loading the model
Manual Management
If automatic management doesn’t meet your needs, you can manually stop models using the LocalAI management API:
To stop all models, you’ll need to call the endpoint for each loaded model individually, or use the web UI to stop all models at once.
Best Practices
- Monitor VRAM usage: Use
nvidia-smi(for NVIDIA GPUs) or similar tools to monitor actual VRAM usage - Set an appropriate backend limit: For single-GPU systems,
--max-active-backends=1is often the simplest solution. For systems with more VRAM, you can increase the limit to keep more models loaded - Combine LRU with watchdog: Use
--max-active-backendsto limit the number of loaded models, and enable idle watchdog to unload models that haven’t been used recently - Tune watchdog timeouts: Adjust timeouts based on your usage patterns - shorter timeouts free memory faster but may cause more frequent reloads
- Consider model size: Ensure your VRAM can accommodate at least one of your largest models
- Use quantization: Smaller quantized models use less VRAM and allow more flexibility
Related Documentation
- See Advanced Usage for other configuration options
- See GPU Acceleration for GPU setup and configuration
- See Backend Flags for all available backend configuration options