To use Inferentia ASICs:
You may need to file an AWS support ticket to increase the limit for your desired instance type.
Set the instance type to an AWS Inferentia instance (e.g.
inf1.xlarge) when creating your Cortex cluster.
inf field in the
compute configuration for your API. One unit of
inf corresponds to one Inferentia ASIC with 4 NeuronCores (not the same thing as
cpu) and 8GB of cache memory (not the same thing as
mem). Fractional requests are not allowed.
Inferentia ASICs come in different sizes depending on the instance type:
inf1.2xlarge - each has 1 Inferentia ASIC
inf1.6xlarge - has 4 Inferentia ASICs
inf1.24xlarge - has 16 Inferentia ASICs
A NeuronCore Group (NCG) is a set of NeuronCores that is used to load and run a compiled model. NCGs exist to aggregate NeuronCores to improve hardware performance. Models can be shared within an NCG, but this would require the device driver to dynamically context switch between each model, which degrades performance. Therefore we've decided to only allow one model per NCG (unless you are using a multi-model endpoint, in which case there will be multiple models on a single NCG, and there will be context switching).
Each Cortex API process will have its own copy of the model and will run on its own NCG (the number of API processes is configured by the
processes_per_replica for Realtime APIs field in the API configuration). Each NCG will have an equal share of NeuronCores. Therefore, the size of each NCG will be
4 * inf / processes_per_replica (
inf refers to your API's
compute request, and it's multiplied by 4 because there are 4 NeuronCores per Inferentia chip).
For example, if your API requests 2
inf chips, there will be 8 NeuronCores available. If you set
processes_per_replica to 1, there will be one copy of your model running on a single NCG of size 8 NeuronCores. If
processes_per_replica is 2, there will be two copies of your model, each running on a separate NCG of size 4 NeuronCores. If
processes_per_replica is 4, there will be 4 NCGs of size 2 NeuronCores, and if If
processes_per_replica is 8, there will be 8 NCGs of size 1 NeuronCores. In this scenario, these are the only valid values for
processes_per_replica. In other words the total number of requested NeuronCores (which equals 4 * the number of requested Inferentia chips) must be divisible by
The 8GB cache memory is shared between all 4 NeuronCores of an Inferentia chip. Therefore an NCG with 8 NeuronCores (i.e. 2 Inf chips) will have access to 16GB of cache memory. An NGC with 2 NeuronCores will have access to 8GB of cache memory, which will be shared with the other NGC of size 2 running on the same Inferentia chip.
Before a model can be deployed on Inferentia chips, it must be compiled for Inferentia. The Neuron compiler can be used to convert a regular TensorFlow SavedModel or PyTorch model into the hardware-specific instruction set for Inferentia. Inferentia currently supports compiled models from TensorFlow and PyTorch.
By default, the Neuron compiler will compile a model to use 1 NeuronCore, but can be manually set to a different size (1, 2, 4, etc).
For optimal performance, your model should be compiled to run on the number of NeuronCores available to it. The number of NeuronCores will be
4 * inf / processes_per_replica (
inf refers to your API's
compute request, and it's multiplied by 4 because there are 4 NeuronCores per Inferentia chip). See NeuronCore Groups above for an example, and see Improving performance below for a discussion of choosing the appropriate number of NeuronCores.
Here is an example of compiling a TensorFlow SavedModel for Inferentia:
import tensorflow.neuron as tfntfn.saved_model.compile(model_dir,compiled_model_dir,batch_size,compiler_args=["--num-neuroncores", "1"],)
Here is an example of compiling a PyTorch model for Inferentia:
import torch_neuron, torchmodel.eval()example_input = torch.zeros([batch_size] + input_shape, dtype=torch.float32)model_neuron = torch.neuron.trace(model,example_inputs=[example_input],compiler_args=["--num-neuroncores", "1"])model_neuron.save(compiled_model)
The versions of
torch-neuron that are used by Cortex are found in the Realtime API pre-installed packages list and Batch API pre-installed packages list. When installing these packages with
pip to compile models of your own, use the extra index URL
A few things can be done to improve performance using compiled models on Cortex:
There's a minimum number of NeuronCores for which a model can be compiled. That number depends on the model's architecture. Generally, compiling a model for more cores than its required minimum helps to distribute the model's operators across multiple cores, which in turn can lead to lower latency. However, compiling a model for more NeuronCores means that you'll have to set
processes_per_replica to be lower so that the NeuronCore Group has access to the number of NeuronCores for which you compiled your model. This is acceptable if latency is your top priority, but if throughput is more important to you, this tradeoff is usually not worth it. To maximize throughput, compile your model for as few NeuronCores as possible and increase
processes_per_replica to the maximum possible (see above for a sample calculation).
Try to achieve a near 100% placement of your model's graph onto the NeuronCores. During the compilation phase, any operators that can't execute on NeuronCores will be compiled to execute on the machine's CPU and memory instead. Even if just a few percent of the operations reside on the host's CPU/memory, the maximum throughput of the instance can be significantly limited.
--static-weights compiler option when possible. This option tells the compiler to make it such that the entire model gets cached onto the NeuronCores. This avoids a lot of back-and-forth between the machine's CPU/memory and the Inferentia ASICs.