Multi-model caching allows each replica to serve more models than would fit into its memory by keeping a specified number of models in memory (and disk) at a time. When the in-memory model limit is reached, the least recently accessed model is evicted from the cache. This can be useful when you have many models, and some models are frequently accessed while a larger portion of them are rarely used, or when running on smaller instances to control costs.
The model cache is a two-layer cache, configured by the following parameters in the
cache_size sets the number of models to keep in memory
disk_cache_size sets the number of models to keep on disk (must be greater than or equal to
Both of these fields must be specified, in addition to either the
paths field (which specifies the model paths, see models for documentation). Multi-model caching is only supported if
predictor.processes_per_replica is set to 1 (the default value).
Cortex runs a background process every 10 seconds that counts the number of models in memory and on disk, and evicts the least recently used models if the count exceeds
disk_cache_size. If many new models are requested between executions of the process, there may be more models in memory and/or on disk than the configured
disk_cache_size limits which could lead to out of memory errors.