As you take Cortex from development to production, here are a few pointers that might be useful.
Configure your cluster and APIs to use images from ECR in the same region as your cluster to accelerate scale-ups, reduce ingress costs, and remove the dependency on Cortex's public quay.io registry.
You can find instructions for mirroring Cortex images here
Use a Route 53 hosted zone as a proxy in front of your Cortex cluster. Every new Cortex cluster provisions a new API load balancer with a unique endpoint. Using a Route 53 hosted zone configured with a subdomain will expose your Cortex cluster API endpoint as a static endpoint (e.g.
cortex.your-company.com). You will be able to upgrade Cortex versions without downtime, and you will avoid the need to updated your client code every time you migrate to a new cluster. You can find instructions for setting up a custom domain with a Route 53 hosted zone here, and instructions for updating/upgrading your cluster here.
The following configuration will improve security by preventing your cluster's nodes from being publicly accessible.
subnet_visibility: privatenat_gateway: single # use "highly_available" for large clusters making requests to services outside of the cluster
You can make your load balancer private to prevent your APIs from being publicly accessed. In order to access your APIs, you will need to set up VPC peering between the Cortex cluster's VPC and the VPC containing the consumers of the Cortex APIs. See the VPC peering guide for more details.
You can also restrict access to your load balancers by IP address:
These two fields are also available for the operator load balancer. Keep in mind that if you make the operator load balancer private, you'll need to configure VPC peering to use the
cortex CLI or Python client.
operator_load_balancer_scheme: internaloperator_load_balancer_cidr_white_list: [0.0.0.0/0]
See here for more information about the load balancers.
You can take advantage of the cost savings of spot instances and the reliability of on-demand instances by utilizing the
priority field in node groups. You can deploy two node groups, one that is spot and another that is on-demand. Set the priority of the spot node group to be higher than the priority of the on-demand node group. This encourages the cluster-autoscaler to try to spin up instances from the spot node group first. If there are no more spot instances available, the on-demand node group will be used instead.
node_groups:- name: gpu-spotinstance_type: g4dn.xlargemin_instances: 0max_instances: 5spot: truepriority: 100- name: gpu-on-demandinstance_type: g4dn.xlargemin_instances: 0max_instances: 5priority: 1
If you plan on scaling your Cortex cluster past 300 nodes or 300 pods, it is recommended to set
prometheus_instance_type to an instance type with more memory (the default is
t3.medium, which has 4gb).
Configure your health checks to be as accurate as possible to prevent requests from being routed to pods that aren't ready to handle traffic.
Make sure that
max_concurrency is set to match the concurrency supported by your container.
max_queue_length to lower values if you would like to more aggressively redistribute requests to newer pods as your API scales up rather than allowing requests to linger in queues. This would mean that the clients consuming your APIs should implement retry logic with a delay (such as exponential backoff).
Make sure to specify all of the relevant compute resources (especially cpu and memory) to ensure that your pods aren't starved for resources.