Managing node groups
Last updated
Last updated
You should rarely need to set the minimum and maximum values for a node group. If you find you need to adjust these values often, it could be an indication of another issue that needs to be addressed. Please reach out to our support team if you have any questions or concerns.
You cannot change the instance type of an existing node group, only the number of instances. To change the instance type, you'll need to create a new node group and delete the old one during a maintenance window. Our support team can assist you with any questions or concerns.
Once you have created a cluster, you can view the cluster info. You can also view the associated node groups below the cluster details.
Click the Edit button to show the edit dialog. Here you can adjust the minimum, maximum, and desired node size for this node group.
Click Update Node Group to save your changes.
Keep in mind that the "Desired Nodes" value is only set initially and may change later, for example, as workloads are removed or deployed. The cluster will automatically adjust this number according to the workload present during a scheduler cycle. You may wish to increase the number in advance of a large new deployment, or when you know in advance more capacity is needed. In general, your desired value will be closer to the minimum number than the maximum number for most workloads.
To find out what value to use for "Desired Nodes", check the existing number of nodes you have running in your cluster and set it to the same value (or higher). To enforce the "Desired Nodes" value as the baseline, set "Minimum Size" to be the same as "Desired Nodes". To enforce that the "Desired Nodes" value is never exceeded, set "Maximum Size" to be the same as "Desired Nodes". Get in touch with our support team if you have any questions or concerns.
Setting "Desired Nodes" and/or "Maximum Size" to a value less than the number of nodes you are currently running could cause an outage. For example, if you are currently running five nodes in your node group and you set either "Desired Nodes" or "Maximum Size" to four, you could cause some downtime in your application by forcing one node to be terminated. You should always schedule a maintenance window for this operation if you have workloads that end users will rely on.
Setting the minimum and maximum sizes to 0 will scale your node group down and could cause an outage.
To create a new node group for a cluster, click the Add Node Group button on the Cluster Info page.
This will open the Add Node Group dialog.
The node group you create here will be associated to the current cluster.
Although the minimum and maximum size values are useful to estimate how your workloads will be distributed, keep in mind that having two or more different node groups with different instance types will not guarantee your workloads will scale the correct resources without additional configuration. The Release support team can provide more information about this.
Setting "Maximum Size" to 0 will not allow any pods to run on the new node group, effectively disabling it.
In very rare cases (for example, Windows workloads, machine learning instances, and GPU-enabled instances), you can create a node group with a "Minimum Size" value of 0. In such a case, no new nodes will be created until the appropriate pods are scheduled and the workloads are running. This is called "scale up and down from zero" and typically requires extra configuration to enable. The most common way to scale from zero is to use node selectors.
We recommend having at least three nodes of any type available, even for an empty cluster, since there are service-level pods that need to run for even small workloads in the cluster. In rare cases where you are running very small workloads, you can set the minimum to two. We do not recommend having less than two nodes available in the cluster.
Volume Size:
This setting will determine the amount of disk space assigned to the physical node. Depending on your cloud provider, this volume is probably a block device attached across the network, as opposed to being a local disk attached to the instance. However, this block device is indistinguishable from local disks for most intents and purposes. This volume will be used by the node for any local operations that may require disk space to support Kubernetes. In general, it is not used by workloads, but rather by the node itself. The most likely reason to increase this volume size is to host large docker image caches. The Kubernetes nodes most likely do not need a lot of disk space other than for large amounts of docker images, but you can tune it above or below 100GB at your preference.
Additional (comma-separated) policy identifier(s) you want to add to the default nodegroup IAM role. This feature is additive, meaning, the policy(ies) you add here will provide additional default policy(ies) access for all workloads running on this nodegroup. As an example, you may want all your workloads to be able to read from an SQS queue or a DynamoDB table (or other cloud-service) without explicitly adding a service account and role.
Policies can block access to services with explicit deny which take effect "last". This is similar but not exactly the same as Permissions Boundaries (see next section). It is recommended to sticking with additive permissions in the policy(ies) and removing or filtering explicitly allowed permissions with the Permissions Boundary.
This feature is only currently implemented in AWS but GCP is coming soon.
Permissions Boundary (AWS):
You can read more about permissions boundaries from the write-up on cloud integration permissions here in our section here.
This setting will allow you to apply a permissions boundary to the nodes that will restrict access to certain cloud functions by default from workloads running on the nodes. A permissions boundary acts as a final seal of approval for what permissions are allowed to pass through from the woker node to your AWS IAM control plane. Follows is a list of default permissions allowed for worker nodes in AWS as of approximately March 2021:
Default EKS requirements
AmazonEKSWorkerNodePolicy, AmazonEC2ContainerRegistryReadOnly, AmazonEKS_CNI_Policy
Basic work that an EKS node performs (do not change)
Storage
s3:*
Push and pull files from S3; can be scoped to specific buckets/paths
KMS secrets
kms:key/*
Decrypt/encrypt secrets for EKS cluster; can be scoped to specific keys
SSM
ssm:<various>, ssmmessages:<various>
Allow SSM agent to run (optional, but recommended)
Vault Secrets
dynamodb:table/vault-*
DynamoDB backing for Vault secrets (optional -- not all customers use this)
Various EC2 permissions
loadbalancer:*; ec2:tags, ec2:describe*
Permissions to tag resources and add ingress loadbalancers (this may be deprecated soon)
The above settings can be scoped down to fit your security profile and needs. As one example, if you needed to restrict which S3 buckets were being accessed, you could follow the following steps to add a permissions boundary:
Locate and copy the existing permissions from the IAM policy named release-nodegroup-policy-<suffix>
(the suffix will be different for each nodegroup and cluster)
Create a new IAM policy with permissions based on all of the policy contents in step 1
Restrict any access you are uncomfortable with (for example, substitute s3:my_safe_bucket*
for the s3 permissions). Just remember that cutting too many permissions too closely may cause things to break. Make sure you test these changes before affecting any live workloads!
Also remember that the permissions boundary is positive: if you add a specific permission it will be allowed, but if you remove or are missing a permission from the permissions boundary, then it will be filtered/blocked. Unfortunately, this means you cannot add a negative condition (like "all s3 buckets except..."). You must add a positive condition (like "these specific buckets and prefixes"). See the AWS documentation for more examples and resources.
Save your new policy and copy the ARN for use when creating the nodegroup.
Make sure you test it!
DANGER: deleting a node group could cause downtime for running pods. You should always schedule a pod deletion during off-hours or inside a maintenance window.
After adding a new node group, you may want to delete the old node group. This can be to save on resources by removing old nodes, or it can help freshen the lifecycle of your instances. Either way, you will want to schedule a maintenance window with your customers and then click the Delete button to delete the node group.
Release will perform a cordon-and-drain operation to smoothly move pods from the old node group on to the newer available node group(s). You should ensure you have adequate resources on the new node group(s) to be able to move the pods over. If you have any questions or concerns, feel free to reach out to our support team. After approximately 10-30 minutes, your old node group should be deleted and the new node groups will take over.
While deleting and draining nodes should not cause an outage on a scaled production cluster, many customers have no fault-resilience or extra replicas setup on a pre-production cluster. This means that this operation could cause downtime and needs to be scheduled when it will impact your end users the least. Even some production environments have a single point of failure for non-essential services, so you will need to ensure your application can survive and repair itself if any non-resilient service goes down temporarily.
Release always recommends you schedule a maintenance window with your customers when performing any action that could impact the performance of your application or stability of your site.