This is the multi-page printable view of this section. Click here to print.
Tasks
- 1: Workload management
- 1.1: Deploy test workload
- 1.2: Add an external load balancer
- 1.3: Add an ingress controller
- 1.4: Secure connectivity with CNI and Network Policy
- 2: Cluster management
- 2.1: Cluster management overview
- 2.2: Scale cluster
- 2.2.1: Scale Bare Metal cluster
- 2.2.2: Scale CloudStack cluster
- 2.2.3: Scale Nutanix cluster
- 2.2.4: Scale vSphere cluster
- 2.3: Upgrade cluster
- 2.4: Etcd Backup and Restore
- 2.5: Verify cluster
- 2.6: Add cluster integrations
- 2.7: Reboot nodes
- 2.8: Connect cluster to console
- 2.9: License cluster
- 2.10: Multus CNI plugin configuration
- 2.11: Authenticate cluster with AWS IAM Authenticator
- 2.12: Manage cluster with GitOps
- 2.13: Manage cluster with Terraform
- 2.14: Delete cluster
- 3: Cluster troubleshooting
- 3.1: Troubleshooting
- 3.2: Generating a Support Bundle
- 4: EKS Anywhere curated package management
- 4.1: Package Prerequisites
- 4.2: Curated Packages Troubleshooting
- 4.3: Cert-Manager
- 4.4: Cluster Autoscaler
- 4.5: Metrics Server
- 4.6: AWS Distro for OpenTelemetry (ADOT)
- 4.7: Prometheus
- 4.8: Emissary Ingress
- 4.9: Harbor
- 4.10: MetalLB
1 - Workload management
1.1 - Deploy test workload
We’ve created a simple test application for you to verify your cluster is working properly. You can deploy it with the following command:
kubectl apply -f "https://anywhere.eks.amazonaws.com/manifests/hello-eks-a.yaml"
To see the new pod running in your cluster, type:
kubectl get pods -l app=hello-eks-a
Example output:
NAME READY STATUS RESTARTS AGE
hello-eks-a-745bfcd586-6zx6b 1/1 Running 0 22m
To check the logs of the container to make sure it started successfully, type:
kubectl logs -l app=hello-eks-a
There is also a default web page being served from the container. You can forward the deployment port to your local machine with
kubectl port-forward deploy/hello-eks-a 8000:80
Now you should be able to open your browser or use curl
to http://localhost:8000
to view the page example application.
curl localhost:8000
Example output:
⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢
Thank you for using
███████╗██╗ ██╗███████╗
██╔════╝██║ ██╔╝██╔════╝
█████╗ █████╔╝ ███████╗
██╔══╝ ██╔═██╗ ╚════██║
███████╗██║ ██╗███████║
╚══════╝╚═╝ ╚═╝╚══════╝
█████╗ ███╗ ██╗██╗ ██╗██╗ ██╗██╗ ██╗███████╗██████╗ ███████╗
██╔══██╗████╗ ██║╚██╗ ██╔╝██║ ██║██║ ██║██╔════╝██╔══██╗██╔════╝
███████║██╔██╗ ██║ ╚████╔╝ ██║ █╗ ██║███████║█████╗ ██████╔╝█████╗
██╔══██║██║╚██╗██║ ╚██╔╝ ██║███╗██║██╔══██║██╔══╝ ██╔══██╗██╔══╝
██║ ██║██║ ╚████║ ██║ ╚███╔███╔╝██║ ██║███████╗██║ ██║███████╗
╚═╝ ╚═╝╚═╝ ╚═══╝ ╚═╝ ╚══╝╚══╝ ╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝╚══════╝
You have successfully deployed the hello-eks-a pod hello-eks-a-c5b9bc9d8-qp6bg
For more information check out
https://anywhere.eks.amazonaws.com
⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢
If you would like to expose your applications with an external load balancer or an ingress controller, you can follow the steps in Adding an external load balancer .
1.2 - Add an external load balancer
While you are free to use any load balancer you like with your EKS Anywhere cluster, AWS currently only supports the MetalLB load balancer. For information on how to configure a MetalLB curated package for EKS Anywhere, see the Add MetalLB page.
1.3 - Add an ingress controller
While you are free to use any Ingress Controller you like with your EKS Anywhere cluster, AWS currently only supports Emissary Ingress. For information on how to configure a Emissary Ingress curated package for EKS Anywhere, see the Add Emissary Ingress page.
Setting up Emissary-ingress for Ingress Controller
-
Deploy the Hello EKS Anywhere test application.
kubectl apply -f "https://anywhere.eks.amazonaws.com/manifests/hello-eks-a.yaml"
-
Set up a load balancer: Set up MetalLB Load Balancer by following the instructions here
-
Install Emissary Ingress: Follow the instructions here Add Emissary Ingress
-
Create Emissary Listeners on your cluster (This is a one time setup).
kubectl apply -f - <<EOF --- apiVersion: getambassador.io/v3alpha1 kind: Listener metadata: name: http-listener namespace: default spec: port: 8080 protocol: HTTPS securityModel: XFP hostBinding: namespace: from: ALL --- apiVersion: getambassador.io/v3alpha1 kind: Listener metadata: name: https-listener namespace: default spec: port: 8443 protocol: HTTPS securityModel: XFP hostBinding: namespace: from: ALL EOF
-
Create a Mapping on your cluster. This Mapping tells Emissary-ingress to route all traffic inbound to the /backend/ path to the Hello EKS Anywhere Service. This hostname IP is the IP found from the LoadBalancer resource deployed by MetalLB for you.
kubectl apply -f - <<EOF --- apiVersion: getambassador.io/v2 kind: Mapping metadata: name: hello-backend spec: prefix: /backend/ service: hello-eks-a hostname: "195.16.99.65" EOF
-
Store the Emissary-ingress load balancer IP address to a local environment variable. You will use this variable to test accessing your service.
export EMISSARY_LB_ENDPOINT=$(kubectl get svc ambassador -o "go-template={{range .status.loadBalancer.ingress}}{{or .ip .hostname}}{{end}}")
-
Test the configuration by accessing the service through the Emissary-ingress load balancer.
curl -Lk http://$EMISSARY_LB_ENDPOINT/backend/
NOTE: URL base path will need to match what is specified in the prefix exactly, including the trailing ‘/’
You should see something like this in the output
⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢ Thank you for using ███████╗██╗ ██╗███████╗ ██╔════╝██║ ██╔╝██╔════╝ █████╗ █████╔╝ ███████╗ ██╔══╝ ██╔═██╗ ╚════██║ ███████╗██║ ██╗███████║ ╚══════╝╚═╝ ╚═╝╚══════╝ █████╗ ███╗ ██╗██╗ ██╗██╗ ██╗██╗ ██╗███████╗██████╗ ███████╗ ██╔══██╗████╗ ██║╚██╗ ██╔╝██║ ██║██║ ██║██╔════╝██╔══██╗██╔════╝ ███████║██╔██╗ ██║ ╚████╔╝ ██║ █╗ ██║███████║█████╗ ██████╔╝█████╗ ██╔══██║██║╚██╗██║ ╚██╔╝ ██║███╗██║██╔══██║██╔══╝ ██╔══██╗██╔══╝ ██║ ██║██║ ╚████║ ██║ ╚███╔███╔╝██║ ██║███████╗██║ ██║███████╗ ╚═╝ ╚═╝╚═╝ ╚═══╝ ╚═╝ ╚══╝╚══╝ ╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝╚══════╝ You have successfully deployed the hello-eks-a pod hello-eks-a-c5b9bc9d8-fx2fr For more information check out https://anywhere.eks.amazonaws.com ⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢
1.4 - Secure connectivity with CNI and Network Policy
EKS Anywhere uses Cilium for pod networking and security.
Cilium is installed by default as a Kubernetes CNI plugin and so is already running in your EKS Anywhere cluster.
This section provides information about:
-
Understanding Cilium components and requirements
-
Validating your Cilium networking setup.
-
Using Cilium to securing workload connectivity using Kubernetes Network Policy.
Cilium Components
The primary Cilium Agent runs as a DaemonSet on each Kubernetes node. Each cluster also includes a Cilium Operator Deployment to handle certain cluster-wide operations. For EKS Anywhere, Cilium is configured to use the Kubernetes API server as the identity store, so no etcd cluster connectivity is required.
In a properly working environment, each Kubernetes node should have a Cilium Agent pod (cilium-WXYZ
) in “Running” and ready (1/1) state.
By default there will be two
Cilium Operator pods (cilium-operator-123456-WXYZ
) in “Running” and ready (1/1) state on different Kubernetes nodes for high-availability.
Run the following command to ensure all cilium related pods are in a healthy state.
kubectl get pods -n kube-system | grep cilium
Example output for this command in a 3 node environment is:
kube-system cilium-fsjmd 1/1 Running 0 4m
kube-system cilium-nqpkv 1/1 Running 0 4m
kube-system cilium-operator-58ff67b8cd-jd7rf 1/1 Running 0 4m
kube-system cilium-operator-58ff67b8cd-kn6ss 1/1 Running 0 4m
kube-system cilium-zz4mt 1/1 Running 0 4m
Network Connectivity Requirements
To provide pod connectivity within an on-premises environment, the Cilium agent implements an overlay network using the GENEVE tunneling protocol. As a result, UDP port 6081 connectivity MUST be allowed by any firewall running between Kubernetes nodes running the Cilium agent.
Allowing ICMP Ping (type = 8, code = 0) as well as TCP port 4240 is also recommended in order for Cilium Agents to validate node-to-node connectivity as part of internal status reporting.
Validating Connectivity
Cilium includes a connectivity check YAML that can be deployed into a test namespace in order to validate proper installation and connectivity within a Kubernetes cluster. If the connectivity check passes, all pods created by the YAML manifest will reach “Running” and ready (1/1) state. We recommend running this test only once you have multiple worker nodes in your environment to ensure you are validating cross-node connectivity.
It is important that this test is run in a dedicated namespace, with no existing network policy. For example:
kubectl create ns cilium-test
kubectl apply -n cilium-test -f https://docs.isovalent.com/v1.10/public/connectivity-check-eksa.yaml
Once all pods have started, simply checking the status of pods in this namespace will indicate whether the tests have passed:
kubectl get pods -n cilium-test
Successful test output will show all pods in a “Running” and ready (1/1) state:
NAME READY STATUS RESTARTS AGE
echo-a-d576c5f8b-zlfsk 1/1 Running 0 59s
echo-b-787dc99778-sxlcc 1/1 Running 0 59s
echo-b-host-675cd8cfff-qvvv8 1/1 Running 0 59s
host-to-b-multi-node-clusterip-6fd884bcf7-pvj5d 1/1 Running 0 58s
host-to-b-multi-node-headless-79f7df47b9-8mzbp 1/1 Running 0 58s
pod-to-a-57695cc7ff-6tqpv 1/1 Running 0 59s
pod-to-a-allowed-cnp-7b6d5ff99f-4rhrs 1/1 Running 0 59s
pod-to-a-denied-cnp-6887b57579-zbs2t 1/1 Running 0 59s
pod-to-b-intra-node-hostport-7d656d7bb9-6zjrl 1/1 Running 0 57s
pod-to-b-intra-node-nodeport-569d7c647-76gn5 1/1 Running 0 58s
pod-to-b-multi-node-clusterip-fdf45bbbc-8l4zz 1/1 Running 0 59s
pod-to-b-multi-node-headless-64b6cbdd49-9hcqg 1/1 Running 0 59s
pod-to-b-multi-node-hostport-57fc8854f5-9d8m8 1/1 Running 0 58s
pod-to-b-multi-node-nodeport-54446bdbb9-5xhfd 1/1 Running 0 58s
pod-to-external-1111-56548587dc-rmj9f 1/1 Running 0 59s
pod-to-external-fqdn-allow-google-cnp-5ff4986c89-z4h9j 1/1 Running 0 59s
Afterward, simply delete the namespace to clean-up the connectivity test:
kubectl delete ns cilium-test
Kubernetes Network Policy
By default, all Kubernetes workloads within a cluster can talk to any other workloads in the cluster, as well as any workloads outside the cluster. To enable a stronger security posture, Cilium implements the Kubernetes Network Policy specification to provide identity-aware firewalling / segmentation of Kubernetes workloads.
Network policies are defined as Kubernetes YAML specifications that are applied to a particular namespaces to describe that connections should be allowed to or from a given set of pods. These network policies are “identity-aware” in that they describe workloads within the cluster using Kubernetes metadata like namespace and labels, rather than by IP Address.
Basic network policies are validated as part of the above Cilium connectivity check test.
For next steps on leveraging Network Policy, we encourage you to explore:
-
A hands-on Network Policy Intro Tutorial .
-
The visual Network Policy Editor .
-
The #networkpolicy channel on Cilium Slack .
-
Other resources on networkpolicy.io .
Additional Cilium Features
Many advanced features of Cilium are not yet enabled as part of EKS Anywhere, including: Hubble observability, DNS-aware and HTTP-Aware Network Policy, Multi-cluster Routing, Transparent Encryption, and Advanced Load-balancing.
Please contact the EKS Anywhere team if you are interested in leveraging these advanced features along with EKS Anywhere.
2 - Cluster management
2.1 - Cluster management overview
The content in this page will describe the tools and interfaces available to an administrator after an EKS Anywhere cluster is up and running. It will also describe which administrative actions done:
- Directly in Kubernetes itself (such as adding nodes with
kubectl
) - Through the EKS Anywhere API (such as deleting a cluster with
eksctl
). - Through tools which interface with the Kubernetes API (such as managing a cluster with
terraform
)
Note that direct changes to OVAs before nodes are deployed is not yet supported. However, we are working on a solution for that issue.
2.2 - Scale cluster
2.2.1 - Scale Bare Metal cluster
Scaling nodes on Bare Metal clusters
When you are horizontally scaling your Bare Metal EKS Anywhere cluster, consider the number of nodes you need for your control plane and for your data plane.
See the Kubernetes Components documentation to learn the differences between the control plane and the data plane (worker nodes).
Horizontally scaling the cluster is done by increasing the number for the control plane or worker node groups under the Cluster specification.
NOTE: If etcd is running on your control plane (the default configuration) you should scale your control plane in odd numbers (3, 5, 7…).
apiVersion: anywhere.eks.amazonaws.com/v1
kind: Cluster
metadata:
name: test-cluster
spec:
controlPlaneConfiguration:
count: 1 # increase this number to horizontally scale your control plane
...
workerNodeGroupsConfiguration:
- count: 1 # increase this number to horizontally scale your data plane
Next, you must ensure you have enough available hardware for the scale-up operation to function. Available hardware could have been fed to the cluster as extra hardware during a prior create command, or could be fed to the cluster during the scale-up process by providing the hardware CSV file to the upgrade cluster command (explained in detail below). For scale-down operation, you can skip directly to the upgrade cluster command .
To check if you have enough available hardware for scale up, you can use the kubectl
command below to check if there are hardware with the selector labels corresponding to the controlplane/worker node group and without the ownerName
label.
kubectl get hardware -n eksa-system --show-labels
For example, if you want to scale a worker node group with selector label type=worker-group-1
, then you must have an additional hardware object in your cluster with the label type=worker-group-1
that doesn’t have the ownerName
label.
In the command shown below, eksa-worker2
matches the selector label and it doesn’t have the ownerName
label. Thus, it can be used to scale up worker-group-1
by 1.
kubectl get hardware -n eksa-system --show-labels
NAME STATE LABELS
eksa-controlplane type=controlplane,v1alpha1.tinkerbell.org/ownerName=abhnvp-control-plane-template-1656427179688-9rm5f,v1alpha1.tinkerbell.org/ownerNamespace=eksa-system
eksa-worker1 type=worker-group-1,v1alpha1.tinkerbell.org/ownerName=abhnvp-md-0-1656427179689-9fqnx,v1alpha1.tinkerbell.org/ownerNamespace=eksa-system
eksa-worker2 type=worker-group-1
If you don’t have any available hardware that match this requirement in the cluster, you can setup a new hardware CSV . You can feed this hardware inventory file during the upgrade cluster command .
Upgrade Cluster Command for Scale Up/Down
With Hardware CSV File
eksctl anywhere upgrade cluster -f cluster.yaml --hardware-csv <hardware.csv>
Without Hardware CSV File
eksctl anywhere upgrade cluster -f cluster.yaml
Autoscaling
EKS Anywhere supports autoscaling of worker node groups using the Kubernetes Cluster Autoscaler and as a curated package .
See here for details on how to configure your cluster spec to autoscale worker node groups for autoscaling.
2.2.2 - Scale CloudStack cluster
Autoscaling
EKS Anywhere supports autoscaling of worker node groups using the Kubernetes Cluster Autoscaler and as a curated package .
See here for details on how to configure your cluster spec to autoscale worker node groups for autoscaling.
2.2.3 - Scale Nutanix cluster
When you are scaling your Nutanix EKS Anywhere cluster, consider the number of nodes you need for your control plane and for your data plane. Each plane can be scaled horizontally (add more nodes) or vertically (provide nodes with more resources). In each case you can scale the cluster manually, semi-automatically, or automatically.
See the Kubernetes Components documentation to learn the differences between the control plane and the data plane (worker nodes).
Manual cluster scaling
Horizontally scaling the cluster is done by increasing the number for the control plane or worker node groups under the Cluster specification.
NOTE: If etcd is running on your control plane (the default configuration) you should scale your control plane in odd numbers (3, 5, 7…).
apiVersion: anywhere.eks.amazonaws.com/v1
kind: Cluster
metadata:
name: test-cluster
spec:
controlPlaneConfiguration:
count: 1 # increase this number to horizontally scale your control plane
...
workerNodeGroupsConfiguration:
- count: 1 # increase this number to horizontally scale your data plane
Vertically scaling your cluster is done by updating the machine config spec for your infrastructure provider. For a Nutanix cluster an example is
NOTE: Not all providers can be vertically scaled (e.g. bare metal)
apiVersion: anywhere.eks.amazonaws.com/v1
kind: NutanixMachineConfig
metadata:
name: test-machine
namespace: default
spec:
systemDiskSize: 50 # increase this number to make the VM disk larger
vcpuSockets: 8 # increase this number to add vCPUs to your VM
memorySize: 8192 # increase this number to add memory to your VM
Once you have made configuration updates you can apply the changes to your cluster. If you are adding or removing a node, only the terminated nodes will be affected. If you are vertically scaling your nodes, then all nodes will be replaced one at a time.
eksctl anywhere upgrade cluster -f cluster.yaml
Semi-automatic scaling
Scaling your cluster in a semi-automatic way still requires changing your cluster manifest configuration. In a semi-automatic mode you change your cluster spec and then have automation make the cluster changes.
You can do this by storing your cluster config manifest in git and then having a CI/CD system deploy your changes. Or you can use a GitOps controller to apply the changes. To read more about making changes with the integrated Flux GitOps controller you can read how to Manage a cluster with GitOps .
Autoscaling
EKS Anywhere supports autoscaling of worker node groups using the Kubernetes Cluster Autoscaler and as a curated package .
See here for details on how to configure your cluster spec to autoscale worker node groups for autoscaling.
2.2.4 - Scale vSphere cluster
When you are scaling your vSphere EKS Anywhere cluster, consider the number of nodes you need for your control plane and for your data plane. Each plane can be scaled horizontally (add more nodes) or vertically (provide nodes with more resources). In each case you can scale the cluster manually, semi-automatically, or automatically.
See the Kubernetes Components documentation to learn the differences between the control plane and the data plane (worker nodes).
Manual cluster scaling
Horizontally scaling the cluster is done by increasing the number for the control plane or worker node groups under the Cluster specification.
NOTE: If etcd is running on your control plane (the default configuration) you should scale your control plane in odd numbers (3, 5, 7…).
apiVersion: anywhere.eks.amazonaws.com/v1
kind: Cluster
metadata:
name: test-cluster
spec:
controlPlaneConfiguration:
count: 1 # increase this number to horizontally scale your control plane
...
workerNodeGroupsConfiguration:
- count: 1 # increase this number to horizontally scale your data plane
Vertically scaling your cluster is done by updating the machine config spec for your infrastructure provider. For a vSphere cluster an example is
NOTE: Not all providers can be vertically scaled (e.g. bare metal)
apiVersion: anywhere.eks.amazonaws.com/v1
kind: VSphereMachineConfig
metadata:
name: test-machine
namespace: default
spec:
diskGiB: 25 # increase this number to make the VM disk larger
numCPUs: 2 # increase this number to add vCPUs to your VM
memoryMiB: 8192 # increase this number to add memory to your VM
Once you have made configuration updates you can apply the changes to your cluster. If you are adding or removing a node, only the terminated nodes will be affected. If you are vertically scaling your nodes, then all nodes will be replaced one at a time.
eksctl anywhere upgrade cluster -f cluster.yaml
Semi-automatic scaling
Scaling your cluster in a semi-automatic way still requires changing your cluster manifest configuration. In a semi-automatic mode you change your cluster spec and then have automation make the cluster changes.
You can do this by storing your cluster config manifest in git and then having a CI/CD system deploy your changes. Or you can use a GitOps controller to apply the changes. To read more about making changes with the integrated Flux GitOps controller you can read how to Manage a cluster with GitOps .
Autoscaling
EKS Anywhere supports autoscaling of worker node groups using the Kubernetes Cluster Autoscaler and as a curated package .
See here for details on how to configure your cluster spec to autoscale worker node groups for autoscaling.
2.3 - Upgrade cluster
2.3.1 - Upgrade Bare Metal cluster
EKS Anywhere provides the command upgrade
, which allows you to upgrade
various aspects of your EKS Anywhere cluster.
When you run eksctl anywhere upgrade cluster -f ./cluster.yaml
, EKS Anywhere runs a set of preflight checks to ensure your cluster is ready to be upgraded.
EKS Anywhere then performs the upgrade, modifying your cluster to match the updated specification.
The upgrade command also upgrades core components of EKS Anywhere and lets the user enjoy the latest features, bug fixes and security patches.
NOTE: Currently only Minor Version Upgrades are support for Bare Metal clusters. No other aspects of the cluster upgrades are currently supported.
Minor Version Upgrades
Kubernetes has minor releases three times per year and EKS Distro follows a similar cadence. EKS Anywhere will add support for new EKS Distro releases as they are released, and you are advised to upgrade your cluster when possible.
Cluster upgrades are not handled automatically and require administrator action to modify the cluster specification and perform an upgrade. You are advised to upgrade your clusters in development environments first and verify your workloads and controllers are compatible with the new version.
Cluster upgrades are performed using a rolling upgrade process (similar to Kubernetes Deployments).
Upgrades can only happen one minor version at a time (e.g. 1.24
-> 1.25
).
Control plane components will be upgraded before worker nodes.
Prerequisites
This type of upgrade requires you to have one spare hardware server for control plane upgrade and one for each worker node group upgrade. The spare hardware server is provisioned with the new version and then an old server is deprovisioned. The deprovisioned server is then reprovisioned with the new version while another old server is deprovisioned. This happens one at a time until all the control plane components have been upgraded, followed by worker node upgrades.
Core component upgrades
EKS Anywhere upgrade
also supports upgrading the following core components:
- Core CAPI
- CAPI providers
- Cilium CNI plugin
- Cert-manager
- Etcdadm CAPI provider
- EKS Anywhere controllers and CRDs
- GitOps controllers (Flux) - this is an optional component, will be upgraded only if specified
The latest versions of these core EKS Anywhere components are embedded into a bundles manifest that the CLI uses to fetch the latest versions and image builds needed for each component upgrade. The command detects both component version changes and new builds of the same versioned component. If there is a new Kubernetes version that is going to get rolled out, the core components get upgraded before the Kubernetes version. Irrespective of a Kubernetes version change, the upgrade command will always upgrade the internal EKS Anywhere components mentioned above to their latest available versions. All upgrade changes are backwards compatible.
Check upgrade components
Before you perform an upgrade, check the current and new versions of components that are ready to upgrade by typing:
eksctl anywhere upgrade plan cluster -f cluster.yaml
The output should appear similar to the following:
Worker node group name not specified. Defaulting name to md-0.
Warning: The recommended number of control plane nodes is 3 or 5
Worker node group name not specified. Defaulting name to md-0.
Checking new release availability...
NAME CURRENT VERSION NEXT VERSION
EKS-A v0.0.0-dev+build.1000+9886ba8 v0.0.0-dev+build.1105+46598cb
cluster-api v1.0.2+e8c48f5 v1.0.2+1274316
kubeadm v1.0.2+92c6d7e v1.0.2+aa1a03a
vsphere v1.0.1+efb002c v1.0.1+ef26ac1
kubadm v1.0.2+f002eae v1.0.2+f443dcf
etcdadm-bootstrap v1.0.2-rc3+54dcc82 v1.0.0-rc3+df07114
etcdadm-controller v1.0.2-rc3+a817792 v1.0.0-rc3+a310516
To format the output in json, add -o json
to the end of the command line.
Check hardware availability
Next, you must ensure you have enough available hardware for the rolling upgrade operation to function. This type of upgrade requires you to have one spare hardware server for control plane upgrade and one for each worker node group upgrade. Check prerequisites for more information. Available hardware could have been fed to the cluster as extra hardware during a prior create command, or could be fed to the cluster during the upgrade process by providing the hardware CSV file to the upgrade cluster command .
To check if you have enough available hardware for rolling upgrade, you can use the kubectl
command below to check if there are hardware objects with the selector labels corresponding to the controlplane/worker node group and without the ownerName
label.
kubectl get hardware -n eksa-system --show-labels
For example, if you want to perform upgrade on a cluster with one worker node group with selector label type=worker-group-1
, then you must have an additional hardware object in your cluster with the label type=controlplane
(for control plane upgrade) and one with type=worker-group-1
(for worker node group upgrade) that doesn’t have the ownerName
label.
In the command shown below, eksa-worker2
matches the selector label and it doesn’t have the ownerName
label. Thus, it can be used to perform rolling upgrade of worker-group-1
. Similarly, eksa-controlplane-spare
will be used for rolling upgrade of control plane.
kubectl get hardware -n eksa-system --show-labels
NAME STATE LABELS
eksa-controlplane type=controlplane,v1alpha1.tinkerbell.org/ownerName=abhnvp-control-plane-template-1656427179688-9rm5f,v1alpha1.tinkerbell.org/ownerNamespace=eksa-system
eksa-controlplane-spare type=controlplane
eksa-worker1 type=worker-group-1,v1alpha1.tinkerbell.org/ownerName=abhnvp-md-0-1656427179689-9fqnx,v1alpha1.tinkerbell.org/ownerNamespace=eksa-system
eksa-worker2 type=worker-group-1
If you don’t have any available hardware that match this requirement in the cluster, you can setup a new hardware CSV . You can feed this hardware inventory file during the upgrade cluster command .
Performing a cluster upgrade
To perform a cluster upgrade you can modify your cluster specification kubernetesVersion
field to the desired version.
As an example, to upgrade a cluster with version 1.24 to 1.25 you would change your spec as follows:
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
name: dev
spec:
controlPlaneConfiguration:
count: 1
endpoint:
host: "198.18.99.49"
machineGroupRef:
kind: TinkerbellMachineConfig
name: dev
...
kubernetesVersion: "1.25"
...
NOTE: If you have a custom machine image for your nodes in your cluster config yaml you may also need to update your
TinkerbellDatacenterConfig
with a newosImageURL
.
and then you will run the upgrade cluster command .
Upgrade cluster command
With hardware CSV
eksctl anywhere upgrade cluster -f cluster.yaml --hardware-csv <hardware.csv>
Without hardware CSV
eksctl anywhere upgrade cluster -f cluster.yaml
This will upgrade the cluster specification (if specified), upgrade the core components to the latest available versions and apply the changes using the provisioner controllers.
Output
Example output:
✅ control plane ready
✅ worker nodes ready
✅ nodes ready
✅ cluster CRDs ready
✅ cluster object present on workload cluster
✅ upgrade cluster kubernetes version increment
✅ validate immutable fields
🎉 all cluster upgrade preflight validations passed
Performing provider setup and validations
Pausing EKS-A cluster controller reconcile
Pausing Flux kustomization
GitOps field not specified, pause flux kustomization skipped
Creating bootstrap cluster
Installing cluster-api providers on bootstrap cluster
Moving cluster management from workload to bootstrap cluster
Upgrading workload cluster
Moving cluster management from bootstrap to workload cluster
Applying new EKS-A cluster resource; resuming reconcile
Resuming EKS-A controller reconciliation
Updating Git Repo with new EKS-A cluster spec
GitOps field not specified, update git repo skipped
Forcing reconcile Git repo with latest commit
GitOps not configured, force reconcile flux git repo skipped
Resuming Flux kustomization
GitOps field not specified, resume flux kustomization skipped
During the upgrade process, EKS Anywhere pauses the cluster controller reconciliation by adding the paused annotation anywhere.eks.amazonaws.com/paused: true
to the EKS Anywhere cluster, provider datacenterconfig and machineconfig resources, before the components upgrade. After upgrade completes, the annotations are removed so that the cluster controller resumes reconciling the cluster.
Though not recommended, you can manually pause the EKS Anywhere cluster controller reconciliation to perform extended maintenance work or interact with Cluster API objects directly. To do it, you can add the paused annotation to the cluster resource:
kubectl annotate clusters.anywhere.eks.amazonaws.com ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE} anywhere.eks.amazonaws.com/paused=true
After finishing the task, make sure you resume the cluster reconciliation by removing the paused annotation, so that EKS Anywhere cluster controller can continue working as expected.
kubectl annotate clusters.anywhere.eks.amazonaws.com ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE} anywhere.eks.amazonaws.com/paused-
Upgradeable cluster attributes
Cluster
:
kubernetesVersion
Advanced configuration for rolling upgrade
EKS Anywhere allows an optional configuration to customize the behavior of upgrades.
It allows the specification of Two parameters that control the desired behavior of rolling upgrades:
- maxSurge - The maximum number of machines that can be scheduled above the desired number of machines. When not specified, the current CAPI default of 1 is used.
- maxUnavailable - The maximum number of machines that can be unavailable during the upgrade. When not specified, the current CAPI default of 0 is used.
Example configuration:
upgradeRolloutStrategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # only configurable for worker nodes
‘upgradeRolloutStrategy’ configuration can be specified separately for control plane and for each worker node group. This template contains an example for control plane under the ‘controlPlaneConfiguration’ section and for worker node group under ‘workerNodeGroupConfigurations’:
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
name: my-cluster-name
spec:
clusterNetwork:
cniConfig:
cilium: {}
pods:
cidrBlocks:
- 192.168.0.0/16
services:
cidrBlocks:
- 10.96.0.0/12
controlPlaneConfiguration:
count: 1
endpoint:
host: "10.61.248.209"
machineGroupRef:
kind: TinkerbellMachineConfig
name: my-cluster-name-cp
upgradeRolloutStrategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
datacenterRef:
kind: TinkerbellDatacenterConfig
name: my-cluster-name
kubernetesVersion: "1.25"
managementCluster:
name: my-cluster-name
workerNodeGroupConfigurations:
- count: 2
machineGroupRef:
kind: TinkerbellMachineConfig
name: my-cluster-name
name: md-0
upgradeRolloutStrategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
---
...
upgradeRolloutStrategy
Configuration parameters for upgrade strategy.
upgradeRolloutStrategy.type
Type of rollout strategy. Currently only RollingUpdate
is supported.
upgradeRolloutStrategy.rollingUpdate
Configuration parameters for customizing rolling upgrade behavior.
upgradeRolloutStrategy.rollingUpdate.maxSurge
Default: 1
This can not be 0 if maxUnavailable is 0.
The maximum number of machines that can be scheduled above the desired number of machines.
Example: When this is set to n, the new worker node group can be scaled up immediately by n when the rolling upgrade starts. Total number of machines in the cluster (old + new) never exceeds (desired number of machines + n). Once scale down happens and old machines are brought down, the new worker node group can be scaled up further ensuring that the total number of machines running at any time does not exceed the desired number of machines + n.
upgradeRolloutStrategy.rollingUpdate.maxUnavailable
Default: 0
This can not be 0 if MaxSurge is 0.
The maximum number of machines that can be unavailable during the upgrade.
Example: When this is set to n, the old worker node group can be scaled down by n machines immediately when the rolling upgrade starts. Once new machines are ready, old worker node group can be scaled down further, followed by scaling up the new worker node group, ensuring that the total number of machines unavailable at all times during the upgrade never falls below n.
Rolling upgrades with no additional hardware
When maxSurge is set to 0 and maxUnavailable is set to 1, it allows for a rolling upgrade without need for additional hardware. Use this configuration if your workloads can tolerate node unavailability.
NOTE: This could ONLY be used if unavailability of a maximum of 1 node is acceptable. For single node clusters, an additional temporary machine is a must. Alternatively, you may recreate the single node cluster for upgrading and handle data recovery manually.
With this kind of configuration, the rolling upgrade will proceed node by node, deprovision and delete a node fully before re-provisioning it with upgraded version, and re-join it to the cluster. This means that any point during the course of the rolling upgrade, there could be one unavailable node.
Troubleshooting
Attempting to upgrade a cluster with more than 1 minor release will result in receiving the following error.
✅ validate immutable fields
❌ validation failed {"validation": "Upgrade preflight validations", "error": "validation failed with 1 errors: WARNING: version difference between upgrade version (1.21) and server version (1.19) do not meet the supported version increment of +1", "remediation": ""}
Error: failed to upgrade cluster: validations failed
For more errors you can see the troubleshooting section .
2.3.2 - Upgrade vSphere, CloudStack, Nutanix, or Snow cluster
EKS Anywhere provides the command upgrade
, which allows you to upgrade
various aspects of your EKS Anywhere cluster.
When you run eksctl anywhere upgrade cluster -f ./cluster.yaml
, EKS Anywhere runs a set of preflight checks to ensure your cluster is ready to be upgraded.
EKS Anywhere then performs the upgrade, modifying your cluster to match the updated specification.
The upgrade command also upgrades core components of EKS Anywhere and lets the user enjoy the latest features, bug fixes and security patches.
NOTE: If an upgrade fails, it is very important not to delete the Docker containers running the KinD bootstrap cluster. During an upgrade, the bootstrap cluster contains critical EKS Anywhere components. If it is deleted after a failed upgrade, they cannot be recovered.
Minor Version Upgrades
Kubernetes has minor releases three times per year and EKS Distro follows a similar cadence. EKS Anywhere will add support for new EKS Distro releases as they are released, and you are advised to upgrade your cluster when possible.
Cluster upgrades are not handled automatically and require administrator action to modify the cluster specification and perform an upgrade. You are advised to upgrade your clusters in development environments first and verify your workloads and controllers are compatible with the new version.
Cluster upgrades are performed in place using a rolling process (similar to Kubernetes Deployments).
Upgrades can only happen one minor version at a time (e.g. 1.24
-> 1.25
).
Control plane components will be upgraded before worker nodes.
A new VM is created with the new version and then an old VM is removed. This happens one at a time until all the control plane components have been upgraded.
Core component upgrades
EKS Anywhere upgrade
also supports upgrading the following core components:
- Core CAPI
- CAPI providers
- Cilium CNI plugin
- Cert-manager
- Etcdadm CAPI provider
- EKS Anywhere controllers and CRDs
- GitOps controllers (Flux) - this is an optional component, will be upgraded only if specified
The latest versions of these core EKS Anywhere components are embedded into a bundles manifest that the CLI uses to fetch the latest versions and image builds needed for each component upgrade. The command detects both component version changes and new builds of the same versioned component. If there is a new Kubernetes version that is going to get rolled out, the core components get upgraded before the Kubernetes version. Irrespective of a Kubernetes version change, the upgrade command will always upgrade the internal EKS Anywhere components mentioned above to their latest available versions. All upgrade changes are backwards compatible.
Specifically for Snow provider, a new Admin instance is needed when upgrading to the new versions of EKS Anywhere. See Upgrade EKS Anywhere AMIs in Snowball Edge devices to upgrade and use a new Admin instance in Snow devices. After that, ugrades of other components can be done as described in this document.
Check upgrade components
Before you perform an upgrade, check the current and new versions of components that are ready to upgrade by typing:
Management Cluster
eksctl anywhere upgrade plan cluster -f mgmt-cluster.yaml
Workload Cluster
eksctl anywhere upgrade plan cluster -f workload-cluster.yaml --kubeconfig mgmt/mgmt-eks-a-cluster.kubeconfig
The output should appear similar to the following:
Worker node group name not specified. Defaulting name to md-0.
Warning: The recommended number of control plane nodes is 3 or 5
Worker node group name not specified. Defaulting name to md-0.
Checking new release availability...
NAME CURRENT VERSION NEXT VERSION
EKS-A v0.0.0-dev+build.1000+9886ba8 v0.0.0-dev+build.1105+46598cb
cluster-api v1.0.2+e8c48f5 v1.0.2+1274316
kubeadm v1.0.2+92c6d7e v1.0.2+aa1a03a
vsphere v1.0.1+efb002c v1.0.1+ef26ac1
kubadm v1.0.2+f002eae v1.0.2+f443dcf
etcdadm-bootstrap v1.0.2-rc3+54dcc82 v1.0.0-rc3+df07114
etcdadm-controller v1.0.2-rc3+a817792 v1.0.0-rc3+a310516
To the format output in json, add -o json
to the end of the command line.
Performing a cluster upgrade
To perform a cluster upgrade you can modify your cluster specification kubernetesVersion
field to the desired version.
As an example, to upgrade a cluster with version 1.24 to 1.25 you would change your spec
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
name: dev
spec:
controlPlaneConfiguration:
count: 1
endpoint:
host: "198.18.99.49"
machineGroupRef:
kind: VSphereMachineConfig
name: dev
...
kubernetesVersion: "1.25"
...
NOTE: If you have a custom machine image for your nodes you may also need to update your
vsphereMachineConfig
with a newtemplate
.
and then you will run the command
Management Cluster
eksctl anywhere upgrade cluster -f mgmt-cluster.yaml
Workload Cluster
eksctl anywhere upgrade cluster -f workload-cluster.yaml --kubeconfig mgmt/mgmt-eks-a-cluster.kubeconfig
This will upgrade the cluster specification (if specified), upgrade the core components to the latest available versions and apply the changes using the provisioner controllers.
Example output:
✅ control plane ready
✅ worker nodes ready
✅ nodes ready
✅ cluster CRDs ready
✅ cluster object present on workload cluster
✅ upgrade cluster kubernetes version increment
✅ validate immutable fields
🎉 all cluster upgrade preflight validations passed
Performing provider setup and validations
Pausing EKS-A cluster controller reconcile
Pausing Flux kustomization
GitOps field not specified, pause flux kustomization skipped
Creating bootstrap cluster
Installing cluster-api providers on bootstrap cluster
Moving cluster management from workload to bootstrap cluster
Upgrading workload cluster
Moving cluster management from bootstrap to workload cluster
Applying new EKS-A cluster resource; resuming reconcile
Resuming EKS-A controller reconciliation
Updating Git Repo with new EKS-A cluster spec
GitOps field not specified, update git repo skipped
Forcing reconcile Git repo with latest commit
GitOps not configured, force reconcile flux git repo skipped
Resuming Flux kustomization
GitOps field not specified, resume flux kustomization skipped
During the upgrade process, EKS Anywhere pauses the cluster controller reconciliation by adding the paused annotation anywhere.eks.amazonaws.com/paused: true
to the EKS Anywhere cluster, provider datacenterconfig and machineconfig resources, before the components upgrade. After upgrade completes, the annotations are removed so that the cluster controller resumes reconciling the cluster.
Though not recommended, you can manually pause the EKS Anywhere cluster controller reconciliation to perform extended maintenance work or interact with Cluster API objects directly. To do it, you can add the paused annotation to the cluster resource:
kubectl annotate clusters.anywhere.eks.amazonaws.com ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE} anywhere.eks.amazonaws.com/paused=true
After finishing the task, make sure you resume the cluster reconciliation by removing the paused annotation, so that EKS Anywhere cluster controller can continue working as expected.
kubectl annotate clusters.anywhere.eks.amazonaws.com ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE} anywhere.eks.amazonaws.com/paused-
Upgradeable Cluster Attributes
EKS Anywhere upgrade
supports upgrading more than just the kubernetesVersion
,
allowing you to upgrade a number of fields simultaneously with the same procedure.
Upgradeable Attributes
Cluster
:
kubernetesVersion
controlPlaneConfig.count
controlPlaneConfigurations.machineGroupRef.name
workerNodeGroupConfigurations.count
workerNodeGroupConfigurations.machineGroupRef.name
etcdConfiguration.externalConfiguration.machineGroupRef.name
identityProviderRefs
(Only forkind:OIDCConfig
,kind:AWSIamConfig
is immutable)gitOpsRef
(Once set, you can’t change or delete the field’s content later)
VSphereMachineConfig
:
datastore
diskGiB
folder
memoryMiB
numCPUs
resourcePool
template
users
NutanixMachineConfig
:
vcpusPerSocket
vcpuSockets
memorySize
image
cluster
subnet
systemDiskSize
SnowMachineConfig
:
amiID
instanceType
physicalNetworkConnector
sshKeyName
devices
containersVolume
osFamily
network
OIDCConfig
:
clientID
groupsClaim
groupsPrefix
issuerUrl
requiredClaims.claim
requiredClaims.value
usernameClaim
usernamePrefix
AWSIamConfig
:
mapRoles
mapUsers
EKS Anywhere upgrade
also supports adding more worker node groups post-creation.
To add more worker node groups, modify your cluster config file to define the additional group(s).
Example:
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
name: dev
spec:
controlPlaneConfiguration:
...
workerNodeGroupConfigurations:
- count: 2
machineGroupRef:
kind: VSphereMachineConfig
name: my-cluster-machines
name: md-0
- count: 2
machineGroupRef:
kind: VSphereMachineConfig
name: my-cluster-machines
name: md-1
...
Worker node groups can use the same machineGroupRef as previous groups, or you can define a new machine configuration for your new group.
Resume upgrade after failure
EKS Anywhere supports re-running the upgrade
command post-failure as an experimental feature.
If the upgrade
command fails, the user can manually fix the issue (when applicable) and simply rerun the same command. At this point, the CLI will skip the completed tasks, restore the state of the operation, and resume the upgrade process.
The completed tasks are stored in the generated
folder as a file named <clusterName>-checkpoint.yaml
.
This feature is experimental. To enable this feature, export the following environment variable:
export CHECKPOINT_ENABLED=true
Troubleshooting
Attempting to upgrade a cluster with more than 1 minor release will result in receiving the following error.
✅ validate immutable fields
❌ validation failed {"validation": "Upgrade preflight validations", "error": "validation failed with 1 errors: WARNING: version difference between upgrade version (1.21) and server version (1.19) do not meet the supported version increment of +1", "remediation": ""}
Error: failed to upgrade cluster: validations failed
For more errors you can see the troubleshooting section .
2.4 - Etcd Backup and Restore
NOTE: External etcd topology is supported for vSphere and CloudStack clusters, but not yet for Bare Metal or Nutanix clusters.
This page contains steps for backing up a cluster by taking an etcd snapshot, and restoring the cluster from a snapshot. These steps are for an EKS Anywhere cluster provisioned using the external etcd topology (selected by default) and Ubuntu OVAs.
Use case
EKS-Anywhere clusters use etcd as the backing store. Taking a snapshot of etcd backs up the entire cluster data. This can later be used to restore a cluster back to an earlier state if required. Etcd backups can be taken prior to cluster upgrade, so if the upgrade doesn’t go as planned you can restore from the backup.
Backup
Etcd offers a built-in snapshot mechanism. You can take a snapshot using the etcdctl snapshot save
command by following the steps given below.
- Login to any one of the etcd VMs
ssh -i $PRIV_KEY ec2-user@$ETCD_VM_IP
- Run the etcdctl command to take a snapshot with the following steps
sudo su
source /etc/etcd/etcdctl.env
etcdctl snapshot save snapshot.db
chown ec2-user snapshot.db
- Exit the VM. Copy the snapshot from the VM to your local/admin setup where you can save snapshots in a secure place. Before running scp, make sure you don’t already have a snapshot file saved by the same name locally.
scp -i $PRIV_KEY ec2-user@$ETCD_VM_IP:/home/ec2-user/snapshot.db .
NOTE: This snapshot file contains all information stored in the cluster, so make sure you save it securely (encrypt it).
Restore
Restoring etcd is a 2-part process. The first part is restoring etcd using the snapshot, creating a new data-dir for etcd. The second part is replacing the current etcd data-dir with the one generated after restore. During etcd data-dir replacement, we cannot have any kube-apiserver instances running in the cluster. So we will first stop all instances of kube-apiserver and other controlplane components using the following steps for every controlplane VM:
Pausing Etcdadm controller reconcile
During restore, it is required to pause the Etcdadm controller reconcile for the target cluster (whether it is management or workload cluster). To do that, you need to add a cluster.x-k8s.io/paused
annotation to the target cluster’s etcdadmclusters
resource. For example,
kubectl annotate etcdadmclusters workload-cluster-1-etcd cluster.x-k8s.io/paused=true -n eksa-system --kubeconfig mgmt-cluster.kubeconfig
Stopping the controlplane components
- Login to a controlplane VM
ssh -i $PRIV_KEY ec2-user@$CONTROLPLANE_VM_IP
- Stop controlplane components by moving the static pod manifests under a temp directory:
sudo su
mkdir temp-manifests
mv /etc/kubernetes/manifests/*.yaml temp-manifests
- Repeat these steps for all other controlplane VMs
After this you can restore etcd from a saved snapshot using the etcdctl snapshot save
command following the steps given below.
Restoring from the snapshot
- The snapshot file should be made available in every etcd VM of the cluster. You can copy it to each etcd VM using this command:
scp -i $PRIV_KEY snapshot.db ec2-user@$ETCD_VM_IP:/home/ec2-user
- To run the etcdctl snapshot restore command, you need to provide the following configuration parameters:
- name: This is the name of the etcd member. The value of this parameter should match the value used while starting the member. This can be obtained by running:
export ETCD_NAME=$(cat /etc/etcd/etcd.env | grep ETCD_NAME | awk -F'=' '{print $2}')
- initial-advertise-peer-urls: This is the advertise peer URL with which this etcd member was configured. It should be the exact value with which this etcd member was started. This can be obtained by running:
export ETCD_INITIAL_ADVERTISE_PEER_URLS=$(cat /etc/etcd/etcd.env | grep ETCD_INITIAL_ADVERTISE_PEER_URLS | awk -F'=' '{print $2}')
- initial-cluster: This should be a comma-separated mapping of etcd member name and its peer URL. For this, get the
ETCD_NAME
andETCD_INITIAL_ADVERTISE_PEER_URLS
values for each member and join them. And then use this exact value for all etcd VMs. For example, for a 3 member etcd cluster this is what the value would look like (The command below cannot be run directly without substituting the required variables and is meant to be an example)
export ETCD_INITIAL_CLUSTER=${ETCD_NAME_1}=${ETCD_INITIAL_ADVERTISE_PEER_URLS_1},${ETCD_NAME_2}=${ETCD_INITIAL_ADVERTISE_PEER_URLS_2},${ETCD_NAME_3}=${ETCD_INITIAL_ADVERTISE_PEER_URLS_3}
- initial-cluster-token: Set this to a unique value and use the same value for all etcd members of the cluster. It can be any value such as
etcd-cluster-1
as long as it hasn’t been used before.
- Gather the required env vars for the restore command
cat <<EOF >> restore.env
export ETCD_NAME=$(cat /etc/etcd/etcd.env | grep ETCD_NAME | awk -F'=' '{print $2}')
export ETCD_INITIAL_ADVERTISE_PEER_URLS=$(cat /etc/etcd/etcd.env | grep ETCD_INITIAL_ADVERTISE_PEER_URLS | awk -F'=' '{print $2}')
EOF
cat /etc/etcd/etcdctl.env >> restore.env
- Make sure you form the correct
ETCD_INITIAL_CLUSTER
value using all etcd members, and set it as an env var in the restore.env file created in the above step. - Once you have obtained all the right values, run the following commands to restore etcd replacing the required values:
sudo su
source restore.env
etcdctl snapshot restore snapshot.db --name=${ETCD_NAME} --initial-cluster=${ETCD_INITIAL_CLUSTER} --initial-cluster-token=etcd-cluster-1 --initial-advertise-peer-urls=${ETCD_INITIAL_ADVERTISE_PEER_URLS}
- This is going to create a new data-dir for the restored contents under a new directory
{ETCD_NAME}.etcd
. To start using this, restart etcd with the new data-dir with the following steps:
systemctl stop etcd.service
mv /var/lib/etcd/member /var/lib/etcd/member.bak
mv ${ETCD_NAME}.etcd/member /var/lib/etcd/
- Perform this directory swap on all etcd VMs, and then start etcd again on those VMs
systemctl start etcd.service
NOTE: Until the etcd process is started on all VMs, it might appear stuck on the VMs where it was started first, but this should be temporary.
Starting the controlplane components
- Login to a controlplane VM
ssh -i $PRIV_KEY ec2-user@$CONTROLPLANE_VM_IP
- Start the controlplane components by moving back the static pod manifests from under the temp directory to the /etc/kubernetes/manifests directory:
mv temp-manifests/*.yaml /etc/kubernetes/manifests
- Repeat these steps for all other controlplane VMs
- It may take a few minutes for the kube-apiserver and the other components to get restarted. After this you should be able to access all objects present in the cluster at the time the backup was taken.
Resuming Etcdadm controller reconcile
Resume Etcdadm controller reconcile for the target cluster by removing the cluster.x-k8s.io/paused
annotation in the target cluster’s etcdadmclusters
resource. For example,
kubectl annotate etcdadmclusters workload-cluster-1-etcd cluster.x-k8s.io/paused- -n eksa-system
2.5 - Verify cluster
To verify that a cluster control plane is up and running, use the kubectl
command to show that the control plane pods are all running.
kubectl get po -A -l control-plane=controller-manager
NAMESPACE NAME READY STATUS RESTARTS AGE
capi-kubeadm-bootstrap-system capi-kubeadm-bootstrap-controller-manager-57b99f579f-sd85g 2/2 Running 0 47m
capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-79cdf98fb8-ll498 2/2 Running 0 47m
capi-system capi-controller-manager-59f4547955-2ks8t 2/2 Running 0 47m
capi-webhook-system capi-controller-manager-bb4dc9878-2j8mg 2/2 Running 0 47m
capi-webhook-system capi-kubeadm-bootstrap-controller-manager-6b4cb6f656-qfppd 2/2 Running 0 47m
capi-webhook-system capi-kubeadm-control-plane-controller-manager-bf7878ffc-rgsm8 2/2 Running 0 47m
capi-webhook-system capv-controller-manager-5668dbcd5-v5szb 2/2 Running 0 47m
capv-system capv-controller-manager-584886b7bd-f66hs 2/2 Running 0 47m
You may also check the status of the cluster control plane resource directly. This can be especially useful to verify clusters with multiple control plane nodes after an upgrade.
kubectl get kubeadmcontrolplanes.controlplane.cluster.x-k8s.io
NAME INITIALIZED API SERVER AVAILABLE VERSION REPLICAS READY UPDATED UNAVAILABLE
supportbundletestcluster true true v1.20.7-eks-1-20-6 1 1 1
To verify that the expected number of cluster worker nodes are up and running, use the kubectl
command to show that nodes are Ready
.
This will confirm that the expected number of worker nodes are present.
Worker nodes are named using the cluster name followed by the worker node group name (example: my-cluster-md-0)
kubectl get nodes
NAME STATUS ROLES AGE VERSION
supportbundletestcluster-md-0-55bb5ccd-mrcf9 Ready <none> 4m v1.20.7-eks-1-20-6
supportbundletestcluster-md-0-55bb5ccd-zrh97 Ready <none> 4m v1.20.7-eks-1-20-6
supportbundletestcluster-mdrwf Ready control-plane,master 5m v1.20.7-eks-1-20-6
To test a workload in your cluster you can try deploying the hello-eks-anywhere .
2.6 - Add cluster integrations
EKS Anywhere offers AWS support for certain third-party vendor components, namely Ubuntu TLS, Cilium, and Flux. It also provides flexibility for you to integrate with your choice of tools in other areas. Below is a list of example third-party tools your consideration.
For a full list of partner integration options, please visit Amazon EKS Anywhere Partner page .
Note
The solutions listed on this page have not been tested by AWS and are not covered by the EKS Anywhere Support Subscription.Feature | Example third-party tools |
---|---|
Ingress controller | Gloo Edge , Emissary-ingress (previously Ambassador) |
Service type load balancer | MetalLB |
Local container repository | Harbor |
Monitoring | Prometheus , Grafana , Datadog , or NewRelic |
Logging | Splunk or Fluentbit |
Secret management | Hashi Vault |
Policy agent | Open Policy Agent |
Service mesh | Istio , Gloo Mesh , or Linkerd |
Cost management | KubeCost |
Etcd backup and restore | Velero |
Storage | Default storage class, any compatible CSI |
2.7 - Reboot nodes
If you need to reboot a node in your cluster for maintenance or any other reason, performing the following steps will help prevent possible disruption of services on those nodes:
Warning
Rebooting a cluster node as described here is good for all nodes, but is critically important when rebooting a Bottlerocket node running theboots
service on a Bare Metal cluster.
If it does go down while running the boots
service, the Bottlerocket node will not be able to boot again until the boots
service is restored on another machine. This is because Bottlerocket must get its address from a DHCP service.
-
Cordon the node so no further workloads are scheduled to run on it:
kubectl cordon <node-name>
-
Drain the node of all current workloads:
kubectl drain <node-name>
-
Shut down. Using the appropriate method for your provider, shut down the node.
-
Perform system maintenance or other task you need to do on the node and boot up the node.
-
Uncordon the node so that it can begin receiving workloads again.
kubectl uncordon <node-name>
2.8 - Connect cluster to console
The AWS EKS Connector lets you connect your EKS Anywhere cluster to the AWS EKS console, where you can see your the EKS Anywhere cluster, its configuration, workloads, and their status. EKS Connector is a software agent that can be deployed on your EKS Anywhere cluster, enabling the cluster to register with the EKS console.
Visit AWS EKS Connector for details.
2.9 - License cluster
If you are are licensing an existing cluster, apply the following secret to your cluster (replacing my-license-here
with your license):
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: eksa-license
namespace: eksa-system
stringData:
license: "my-license-here"
type: Opaque
EOF
2.10 - Multus CNI plugin configuration
NOTE: Currently, Multus support is only available with the EKS Anywhere Bare Metal provider. The vSphere and CloudStack providers, do not have multi-network support for cluster machines. Once multiple network support is added to those clusters, Multus CNI can be supported.
Multus CNI is a container network interface plugin for Kubernetes that enables attaching multiple network interfaces to pods. In Kubernetes, each pod has only one network interface by default, other than local loopback. With Multus, you can create multi-homed pods that have multiple interfaces. Multus acts a as ‘meta’ plugin that can call other CNI plugins to configure additional interfaces.
Pre-Requisites
Given that Multus CNI is used to create pods with multiple network interfaces, the cluster machines that these pods run on need to have multiple network interfaces attached and configured. The interfaces on multi-homed pods need to map to these interfaces on the machines.
For Bare Metal clusters using the Tinkerbell provider, the cluster machines need to have multiple network interfaces cabled in and appropriate network configuration put in place during machine provisioning.
Overview of Multus setup
The following diagrams show the result of two applications (app1 and app2) running in pods that use the Multus plugin to communicate over two network interfaces (eth0 and net1) from within the pods. The Multus plugin uses two network interfaces on the worker node (eth0 and eth1) to provide communications outside of the node.
Follow the procedure below to set up Multus as illustrated in the previous diagrams.
Install and configure Multus
Deploying Multus using a Daemonset will spin up pods that install a Multus binary and configure Multus for usage in every node in the cluster. Here are the steps for doing that.
-
Clone the Multus CNI repo:
git clone https://github.com/k8snetworkplumbingwg/multus-cni.git && cd multus-cni
-
Apply Multus daemonset to your EKS Anywhere cluster:
kubectl apply -f ./deployments/multus-daemonset-thick-plugin.yml
-
Verify that you have Multus pods running:
kubectl get pods --all-namespaces | grep -i multus
-
Check that Multus is running:
kubectl get pods -A | grep multus
Output:
kube-system kube-multus-ds-bmfjs 1/1 Running 0 3d1h kube-system kube-multus-ds-fk2sk 1/1 Running 0 3d1h
Create Network Attachment Definition
You need to create a Network Attachment Definition for the CNI you wish to use as the plugin for the additional interface.
You can verify that your intended CNI plugin is supported by ensuring that the binary corresponding to that CNI plugin is present in the node’s /opt/cni/bin
directory.
Below is an example of a Network Attachment Definition yaml:
cat <<EOF | kubectl create -f -
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: ipvlan-conf
spec:
config: '{
"cniVersion": "0.3.0",
"type": "ipvlan",
"master": "eth1",
"mode": "l3",
"ipam": {
"type": "host-local",
"subnet": "198.17.0.0/24",
"rangeStart": "198.17.0.200",
"rangeEnd": "198.17.0.216",
"routes": [
{ "dst": "0.0.0.0/0" }
],
"gateway": "198.17.0.1"
}
}'
EOF
Note that eth1
is used as the master parameter.
This master parameter should match the interface name on the hosts in your cluster.
Verify the configuration
Type the following to verify the configuration you created:
kubectl get network-attachment-definitions
kubectl describe network-attachment-definitions ipvlan-conf
Deploy sample applications with network attachment
-
Create a sample application 1 (app1) with network annotation created in the previous steps:
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: app1 annotations: k8s.v1.cni.cncf.io/networks: ipvlan-conf spec: containers: - name: app1 command: ["/bin/sh", "-c", "trap : TERM INT; sleep infinity & wait"] image: alpine EOF
-
Create a sample application 2 (app2) with the network annotation created in the previous step:
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: app2 annotations: k8s.v1.cni.cncf.io/networks: ipvlan-conf spec: containers: - name: app2 command: ["/bin/sh", "-c", "trap : TERM INT; sleep infinity & wait"] image: alpine EOF
-
Verify that the additional interfaces were created on these application pods using the defined network attachment:
kubectl exec -it app1 -- ip a
Output:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever *2: net1@if3: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UNKNOWN link/ether 00:50:56:9a:84:3b brd ff:ff:ff:ff:ff:ff inet 198.17.0.200/24 brd 198.17.0.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::50:5600:19a:843b/64 scope link valid_lft forever preferred_lft forever* 31: eth0@if32: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP link/ether 0a:9e:a0:b4:21:05 brd ff:ff:ff:ff:ff:ff inet 192.168.1.218/32 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::89e:a0ff:feb4:2105/64 scope link valid_lft forever preferred_lft forever
kubectl exec -it app2 -- ip a
Output:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever *2: net1@if3: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UNKNOWN link/ether 00:50:56:9a:84:3b brd ff:ff:ff:ff:ff:ff inet 198.17.0.201/24 brd 198.17.0.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::50:5600:29a:843b/64 scope link valid_lft forever preferred_lft forever* 33: eth0@if34: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP link/ether b2:42:0a:67:c0:48 brd ff:ff:ff:ff:ff:ff inet 192.168.1.210/32 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::b042:aff:fe67:c048/64 scope link valid_lft forever preferred_lft forever
Note that both pods got the new interface net1. Also, the additional network interface on each pod got assigned an IP address out of the range specified by the Network Attachment Definition.
-
Test the network connectivity across these pods for Multus interfaces:
kubectl exec -it app1 -- ping -I net1 198.17.0.201
Output:
PING 198.17.0.201 (198.17.0.201): 56 data bytes 64 bytes from 198.17.0.201: seq=0 ttl=64 time=0.074 ms 64 bytes from 198.17.0.201: seq=1 ttl=64 time=0.077 ms 64 bytes from 198.17.0.201: seq=2 ttl=64 time=0.078 ms 64 bytes from 198.17.0.201: seq=3 ttl=64 time=0.077 ms
kubectl exec -it app2 -- ping -I net1 198.17.0.200
Output:
PING 198.17.0.200 (198.17.0.200): 56 data bytes 64 bytes from 198.17.0.200: seq=0 ttl=64 time=0.074 ms 64 bytes from 198.17.0.200: seq=1 ttl=64 time=0.077 ms 64 bytes from 198.17.0.200: seq=2 ttl=64 time=0.078 ms 64 bytes from 198.17.0.200: seq=3 ttl=64 time=0.077 ms
2.11 - Authenticate cluster with AWS IAM Authenticator
AWS IAM Authenticator Support (optional)
EKS Anywhere supports configuring AWS IAM Authenticator as an authentication provider for clusters.
When you create a cluster with IAM Authenticator enabled, EKS Anywhere
- Installs
aws-iam-authenticator
server as a DaemonSet on the workload cluster. - Configures the Kubernetes API Server to communicate with iam authenticator using a token authentication webhook .
- Creates the necessary ConfigMaps based on user options.
Note
Enabling IAM Authenticator needs to be done during cluster creation.Create IAM Authenticator enabled cluster
Generate your cluster configuration and add the necessary IAM Authenticator configuration. For a full spec reference check AWSIamConfig .
Create an EKS Anywhere cluster as follows:
CLUSTER_NAME=my-cluster-name
eksctl anywhere create cluster -f ${CLUSTER_NAME}.yaml
Example AWSIamConfig configuration
This example uses a region in the default aws partition and EKSConfigMap
as backendMode
. Also, the IAM ARNs are mapped to the kubernetes system:masters
group.
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
name: my-cluster-name
spec:
...
# IAM Authenticator
identityProviderRefs:
- kind: AWSIamConfig
name: aws-iam-auth-config
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: AWSIamConfig
metadata:
name: aws-iam-auth-config
spec:
awsRegion: us-west-1
backendMode:
- EKSConfigMap
mapRoles:
- roleARN: arn:aws:iam::XXXXXXXXXXXX:role/myRole
username: myKubernetesUsername
groups:
- system:masters
mapUsers:
- userARN: arn:aws:iam::XXXXXXXXXXXX:user/myUser
username: myKubernetesUsername
groups:
- system:masters
partition: aws
Note
When using backend modeCRD
, the mapRoles
and mapUsers
are not required. For more details on configuring CRD mode, refer to CRD.
Authenticating with IAM Authenticator
After your cluster is created you may now use the mapped IAM ARNs to authenticate to the cluster.
EKS Anywhere generates a KUBECONFIG
file in your local directory that uses aws-iam-authenticator client
to authenticate with the cluster. The file can be found at
${PWD}/${CLUSTER_NAME}/${CLUSTER_NAME}-aws.kubeconfig
Steps
-
Ensure the IAM role/user ARN mapped in the cluster is configured on the local machine from which you are trying to access the cluster.
-
Install the
aws-iam-authenticator client
binary on the local machine.- We recommend installing the binary referenced in the latest
release manifest
of the kubernetes version used when creating the cluster. - The below commands can be used to fetch the installation uri for clusters created with
1.21
kubernetes version and OSlinux
.
CLUSTER_NAME=my-cluster-name KUBERNETES_VERSION=1.21 export KUBECONFIG=${PWD}/${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig EKS_D_MANIFEST_URL=$(kubectl get bundles $CLUSTER_NAME -o jsonpath="{.spec.versionsBundles[?(@.kubeVersion==\"$KUBERNETES_VERSION\")].eksD.manifestUrl}") OS=linux curl -fsSL $EKS_D_MANIFEST_URL | yq e '.status.components[] | select(.name=="aws-iam-authenticator") | .assets[] | select(.os == '"\"$OS\""' and .type == "Archive") | .archive.uri' -
- We recommend installing the binary referenced in the latest
-
Export the generated IAM Authenticator based
KUBECONFIG
file.export KUBECONFIG=${PWD}/${CLUSTER_NAME}/${CLUSTER_NAME}-aws.kubeconfig
-
Run
kubectl
commands to check cluster access. Example,kubectl get pods -A
Modify IAM Authenticator mappings
EKS Anywhere supports modifying IAM ARNs that are mapped on the cluster. The mappings can be modified by either running the upgrade cluster
command or using GitOps
.
upgrade command
The mapRoles
and mapUsers
lists in AWSIamConfig
can be modified when running the upgrade cluster
command from EKS Anywhere.
As an example, let’s add another IAM user to the above example configuration.
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: AWSIamConfig
metadata:
name: aws-iam-auth-config
spec:
...
mapUsers:
- userARN: arn:aws:iam::XXXXXXXXXXXX:user/myUser
username: myKubernetesUsername
groups:
- system:masters
- userARN: arn:aws:iam::XXXXXXXXXXXX:user/anotherUser
username: anotherKubernetesUsername
partition: aws
and then run the upgrade command
CLUSTER_NAME=my-cluster-name
eksctl anywhere upgrade cluster -f ${CLUSTER_NAME}.yaml
EKS Anywhere now updates the role mappings for IAM authenticator in the cluster and a new user gains access to the cluster.
GitOps
If the cluster created has GitOps configured, then the mapRoles
and mapUsers
list in AWSIamConfig
can be modified by the GitOps controller. For GitOps configuration details refer to Manage Cluster with GitOps
.
Note
GitOps support for theAWSIamConfig
is currently only on management or self-managed clusters.
- Clone your git repo and modify the cluster specification.
The default path for the cluster file is:
clusters/$CLUSTER_NAME/eksa-system/eksa-cluster.yaml
- Modify the
AWSIamConfig
object and add to themapRoles
andmapUsers
object lists. - Commit the file to your git repository
git add eksa-cluster.yaml git commit -m 'Adding IAM Authenticator access ARNs' git push origin main
EKS Anywhere GitOps Controller now updates the role mappings for IAM authenticator in the cluster and users gains access to the cluster.
2.12 - Manage cluster with GitOps
NOTE: GitOps support is available for vSphere clusters, but is not yet available for Bare Metal clusters
GitOps Support (optional)
EKS Anywhere supports a GitOps workflow for the management of your cluster.
When you create a cluster with GitOps enabled, EKS Anywhere will automatically commit your cluster configuration to the provided GitHub repository and install a GitOps toolkit on your cluster which watches that committed configuration file. You can then manage the scale of the cluster by making changes to the version controlled cluster configuration file and committing the changes. Once a change has been detected by the GitOps controller running in your cluster, the scale of the cluster will be adjusted to match the committed configuration file.
If you’d like to learn more about GitOps, and the associated best practices, check out this introduction from Weaveworks .
NOTE: Installing a GitOps controller can be done during cluster creation or through upgrade. In the event that GitOps installation fails, EKS Anywhere cluster creation will continue.
Supported Cluster Properties
Currently, you can manage a subset of cluster properties with GitOps:
Management Cluster
Cluster
:
workerNodeGroupConfigurations.count
workerNodeGroupConfigurations.machineGroupRef.name
WorkerNodes VSphereMachineConfig
:
datastore
diskGiB
folder
memoryMiB
numCPUs
resourcePool
template
users
Workload Cluster
Cluster
:
kubernetesVersion
controlPlaneConfiguration.count
controlPlaneConfiguration.machineGroupRef.name
workerNodeGroupConfigurations.count
workerNodeGroupConfigurations.machineGroupRef.name
identityProviderRefs
(Only forkind:OIDCConfig
,kind:AWSIamConfig
is immutable)
ControlPlane / Etcd / WorkerNodes VSphereMachineConfig
:
datastore
diskGiB
folder
memoryMiB
numCPUs
resourcePool
template
users
OIDCConfig
:
clientID
groupsClaim
groupsPrefix
issuerUrl
requiredClaims.claim
requiredClaims.value
usernameClaim
usernamePrefix
Any other changes to the cluster configuration in the git repository will be ignored. If an immutable field has been changed in a Git repository, there are two ways to find the error message:
- If a notification webhook is set up, check the error message in notification channel.
- Check the Flux Kustomization Controller log:
kubectl logs -f -n flux-system kustomize-controller-******
for error message containing text similar toInvalid value: 1: field is immutable
Getting Started with EKS Anywhere GitOps with Github
In order to use GitOps to manage cluster scaling, you need a couple of things:
- A GitHub account
- A cluster configuration file with a
GitOpsConfig
, referenced with agitOpsRef
in your Cluster spec - A Personal Access Token (PAT) for the GitHub account , with permissions to create, clone, and push to a repo
Create a GitHub Personal Access Token
Create a Personal Access Token (PAT)
to access your provided GitHub repository.
It must be scoped for all repo
permissions.
NOTE: GitOps configuration only works with hosted github.com and will not work on a self-hosted GitHub Enterprise instances.
This PAT should have at least the following permissions:
NOTE: The PAT must belong to the
owner
of therepository
or, if using an organization as theowner
, the creator of thePAT
must have repo permission in that organization.
You need to set your PAT as the environment variable $EKSA_GITHUB_TOKEN to use it during cluster creation:
export EKSA_GITHUB_TOKEN=ghp_MyValidPersonalAccessTokenWithRepoPermissions
Create GitOps configuration repo
If you have an existing repo you can set that as your repository name in the configuration.
If you specify a repo in your FluxConfig
which does not exist EKS Anywhere will create it for you.
If you would like to create a new repo you can click here
to create a new repo.
If your repository contains multiple cluster specification files, store them in sub-folders and specify the configuration path in your cluster specification.
In order to accommodate the management cluster feature, the CLI will now structure the repo directory following a new convention:
clusters
└── management-cluster
├── flux-system
│ └── ...
├── management-cluster
│ └── eksa-system
│ └── eksa-cluster.yaml
│ └── kustomization.yaml
├── workload-cluster-1
│ └── eksa-system
│ └── eksa-cluster.yaml
└── workload-cluster-2
└── eksa-system
└── eksa-cluster.yaml
By default, Flux kustomization reconciles at the management cluster’s root level (./clusters/management-cluster
), so both the management cluster and all the workload clusters it manages are synced.
Example GitOps cluster configuration for Github
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
name: mynewgitopscluster
spec:
... # collapsed cluster spec fields
# Below added for gitops support
gitOpsRef:
kind: FluxConfig
name: my-cluster-name
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: FluxConfig
metadata:
name: my-cluster-name
spec:
github:
personal: true
repository: mygithubrepository
owner: mygithubusername
Create a GitOps enabled cluster
Generate your cluster configuration and add the GitOps configuration. For a full spec reference see the Cluster Spec reference .
NOTE: After your cluster has been created the cluster configuration will automatically be committed to your git repo.
-
Create an EKS Anywhere cluster with GitOps enabled.
CLUSTER_NAME=gitops eksctl anywhere create cluster -f ${CLUSTER_NAME}.yaml
Enable GitOps in an existing cluster
You can also install Flux and enable GitOps in an existing cluster by running the upgrade command with updated cluster configuration. For a full spec reference see the Cluster Spec reference .
-
Upgrade an EKS Anywhere cluster with GitOps enabled.
CLUSTER_NAME=gitops eksctl anywhere upgrade cluster -f ${CLUSTER_NAME}.yaml
Test GitOps controller
After your cluster has been created, you can test the GitOps controller by modifying the cluster specification.
-
Clone your git repo and modify the cluster specification. The default path for the cluster file is:
clusters/$CLUSTER_NAME/eksa-system/eksa-cluster.yaml
-
Modify the
workerNodeGroupsConfigurations[0].count
field with your desired changes. -
Commit the file to your git repository
git add eksa-cluster.yaml git commit -m 'Scaling nodes for test' git push origin main
-
The Flux controller will automatically make the required changes.
If you updated your node count, you can use this command to see the current node state.
kubectl get nodes
Getting Started with EKS Anywhere GitOps with any Git source
You can configure EKS Anywhere to use a generic git repository as the source of truth for GitOps by providing a FluxConfig
with a git
configuration.
EKS Anywhere requires a valid SSH Known Hosts file and SSH Private key in order to connect to your repository and bootstrap Flux.
Create a Git repository for use by EKS Anywhere and Flux
When using the git
provider, EKS Anywhere requires that the configuration repository be pre-initialized.
You may re-use an existing repo or use the same repo for multiple management clusters.
Create the repository through your git provider and initialize it with a README.md
documenting the purpose of the repository.
Create a Private Key for use by EKS Anywhere and Flux
EKS Anywhere requires a private key to authenticate to your git repository, push the cluster configuration, and configure Flux for ongoing management and monitoring of that configuration. The private key should have permissions to read and write from the repository in question.
It is recommended that you create a new private key for use exclusively by EKS Anywhere.
You can use ssh-keygen
to generate a new key.
ssh-keygen -t ecdsa -C "my_email@example.com"
Please consult the documentation for your git provider to determine how to add your corresponding public key; for example, if using Github enterprise, you can find the documentation for adding a public key to your github account here .
Add your private key to your SSH agent on your management machine
When using a generic git provider, EKS Anywhere requires that your management machine has a running SSH agent and the private key be added to that SSH agent.
You can start an SSH agent and add your private key by executing the following in your current session:
eval "$(ssh-agent -s)" && ssh-add $EKSA_GIT_PRIVATE_KEY
Create an SSH Known Hosts file for use by EKS Anywhere and Flux
EKS Anywhere needs an SSH known hosts file to verify the identity of the remote git host.
A path to a valid known hosts file must be provided to the EKS Anywhere command line via the environment variable EKSA_GIT_KNOWN_HOSTS
.
For example, if you have a known hosts file at /home/myUser/.ssh/known_hosts
that you want EKS Anywhere to use, set the environment variable EKSA_GIT_KNOWN_HOSTS
to the path to that file, /home/myUser/.ssh/known_hosts
.
export EKSA_GIT_KNOWN_HOSTS=/home/myUser/.ssh/known_hosts
While you can use your pre-existing SSH known hosts file, it is recommended that you generate a new known hosts file for use by EKS Anywhere that contains only the known-hosts entries required for your git host and key type.
For example, if you wanted to generate a known hosts file for a git server located at example.com
with key type ecdsa
, you can use the OpenSSH utility ssh-keyscan
:
ssh-keyscan -t ecdsa example.com >> my_eksa_known_hosts
This will generate a known hosts file which contains only the entry necessary to verify the identity of example.com when using an ecdsa
based private key file.
Example FluxConfig cluster configuration for a generic git provider
For a full spec reference see the Cluster Spec reference .
NOTE: The
repositoryUrl
value is of the formatssh://git@provider.com/$REPO_OWNER/$REPO_NAME.git
. This may differ from the default SSH URL given by your provider. For Example, the github.com user interface provides an SSH URL containing a:
before the repository owner, rather than a/
. Make sure to replace this:
with a/
, if present.
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
name: mynewgitopscluster
spec:
... # collapsed cluster spec fields
# Below added for gitops support
gitOpsRef:
kind: FluxConfig
name: my-cluster-name
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: FluxConfig
metadata:
name: my-cluster-name
spec:
git:
repositoryUrl: ssh://git@provider.com/myAccount/myClusterGitopsRepo.git
sshKeyAlgorithm: ecdsa
Manage separate workload clusters using Gitops
Follow these steps if you want to use your initial cluster to create and manage separate workload clusters via Gitops.
Prerequisites
-
An existing EKS Anywhere cluster with Gitops enabled. If your existing cluster does not have Gitops installed, see Enable Gitops in an existing cluster. .
-
A cluster configuration file for your new workload cluster.
Create cluster using Gitops
-
Clone your git repo and add the new cluster specification. Be sure to follow the directory structure defined here :
clusters/<management-cluster-name>/$CLUSTER_NAME/eksa-system/eksa-cluster.yaml
NOTE: Specify the
namespace
for all EKS Anywhere objects when you are using GitOps to create new workload clusters (even for thedefault
namespace, usenamespace: default
on those objects).Ensure workload cluster object names are distinct from management cluster object names. Be sure to set the
managementCluster
field to identify the name of the management cluster.Make sure there is a
kustomization.yaml
file under the namespace directory for the management cluster. Creating a Gitops enabled management cluster witheksctl
should create thekustomization.yaml
file automatically. -
Commit the file to your git repository.
git add clusters/<management-cluster-name>/$CLUSTER_NAME/eksa-system/eksa-cluster.yaml git commit -m 'Creating new workload cluster' git push origin main
-
The Flux controller will automatically make the required changes. You can list the workload clusters managed by the management cluster.
export KUBECONFIG=${PWD}/${MGMT_CLUSTER_NAME}/${MGMT_CLUSTER_NAME}-eks-a-cluster.kubeconfig kubectl get clusters
-
The kubeconfig for your new cluster is stored as a secret on the management cluster. You can get credentials and run the test application on your new workload cluster as follows:
kubectl get secret -n eksa-system w01-kubeconfig -o jsonpath=‘{.data.value}' | base64 —decode > w01.kubeconfig export KUBECONFIG=w01.kubeconfig kubectl apply -f "https://anywhere.eks.amazonaws.com/manifests/hello-eks-a.yaml"
Upgrade cluster using Gitops
-
To upgrade the cluster using Gitops, modify the workload cluster yaml file with the desired changes.
-
Commit the file to your git repository.
git add eksa-cluster.yaml git commit -m 'Scaling nodes on new workload cluster' git push origin main
Delete cluster using Gitops
- To delete the cluster using Gitops, delete the workload cluster yaml file from your repository and commit those changes.
git rm eksa-cluster.yaml git commit -m 'Deleting workload cluster' git push origin main
2.13 - Manage cluster with Terraform
NOTE: Support for using Terraform to manage and modify an EKS Anywhere cluster is available for vSphere, Snow and Nutanix clusters, but not yet for Bare Metal or CloudStack clusters.
Using Terraform to manage an EKS Anywhere Cluster (Optional)
This guide explains how you can use Terraform to manage and modify an EKS Anywhere cluster. The guide is meant for illustrative purposes and is not a definitive approach to building production systems with Terraform and EKS Anywhere.
At its heart, EKS Anywhere is a set of Kubernetes CRDs, which define an EKS Anywhere cluster,
and a controller, which moves the cluster state to match these definitions.
These CRDs, and the EKS-A controller, live on the management cluster or
on a self-managed cluster.
We can manage a subset of the fields in the EKS Anywhere CRDs with any tool that can interact with the Kubernetes API, like kubectl
or, in this case, the Terraform Kubernetes provider.
In this guide, we’ll show you how to import your EKS Anywhere cluster into Terraform state and how to scale your EKS Anywhere worker nodes using the Terraform Kubernetes provider.
Prerequisites
-
An existing EKS Anywhere cluster
-
the latest version of Terraform
-
the latest version of tfk8s , a tool for converting Kubernetes manifest files to Terraform HCL
Guide
- Create an EKS-A management cluster, or a self-managed stand-alone cluster.
- if you already have an existing EKS-A cluster, skip this step.
- if you don’t already have an existing EKS-A cluster, follow the official instructions to create one
-
Set up the Terraform Kubernetes provider Make sure your KUBECONFIG environment variable is set
export KUBECONFIG=/path/to/my/kubeconfig.kubeconfig
Set an environment variable with your cluster name:
export MY_EKSA_CLUSTER="myClusterName"
cat << EOF > ./provider.tf provider "kubernetes" { config_path = "${KUBECONFIG}" } EOF
-
Get
tfk8s
and use it to convert your EKS Anywhere cluster Kubernetes manifest into Terraform HCL:- Install tfk8s
- Convert the manifest into Terraform HCL:
kubectl get cluster ${MY_EKSA_CLUSTER} -o yaml | tfk8s --strip -o ${MY_EKSA_CLUSTER}.tf
-
Configure the Terraform cluster resource definition generated in step 2
- Set
metadata.generation
as a computed field . Add the following to your cluster resource configuration
computed_fields = ["metadata.generated"]
- Configure the field manager to force reconcile managed resources . Add the following configuration block to your cluster resource:
field_manager { force_conflicts = true }
- Add the
namespace
to themetadata
of the cluster - Remove the
generation
field from themetadata
of the cluster - Your Terraform cluster resource should look similar to this:
computed_fields = ["metadata.generated"] field_manager { force_conflicts = true } manifest = { "apiVersion" = "anywhere.eks.amazonaws.com/v1alpha1" "kind" = "Cluster" "metadata" = { "name" = "MyClusterName" "namespace" = "default" }
- Set
-
Import your EKS Anywhere cluster into terraform state:
terraform init terraform import kubernetes_manifest.cluster_${MY_EKSA_CLUSTER} "apiVersion=anywhere.eks.amazonaws.com/v1alpha1,kind=Cluster,namespace=default,name=${MY_EKSA_CLUSTER}"
After you
import
your cluster, you will need to runterraform apply
one time to ensure that themanifest
field of your cluster resource is in-sync. This will not change the state of your cluster, but is a required step after the initial import. Themanifest
field stores the contents of the associated kubernetes manifest, while theobject
field stores the actual state of the resource. -
Modify Your Cluster using Terraform
- Modify the
count
value of one of yourworkerNodeGroupConfigurations
, or another mutable field, in the configuration stored in${MY_EKSA_CLUSTER}.tf
file. - Check the expected diff between your cluster state and the modified local state via
terraform plan
You should see in the output that the worker node group configuration count field (or whichever field you chose to modify) will be modified by Terraform.
- Modify the
-
Now, actually change your cluster to match the local configuration:
terraform apply
-
Observe the change to your cluster. For example:
kubectl get nodes
Manage separate workload clusters using Terraform
Follow these steps if you want to use your initial cluster to create and manage separate workload clusters via Terraform.
NOTE: If you choose to manage your cluster using Terraform, do not use
kubectl
to edit your cluster objects as this can lead to field manager conflicts.
Prerequisites
- An existing EKS Anywhere cluster imported into Terraform state. If your existing cluster is not yet imported, see this guide. .
- A cluster configuration file for your new workload cluster.
Create cluster using Terraform
-
Create the new cluster configuration Terraform file.
tfk8s -f new-workload-cluster.yaml -o new-workload-cluster.tf
NOTE: Specify the
namespace
for all EKS Anywhere objects when you are using Terraform to manage your clusters (even for thedefault
namespace, use"namespace" = "default"
on those objects).Ensure workload cluster object names are distinct from management cluster object names. Be sure to set the
managementCluster
field to identify the name of the management cluster. -
Ensure that this new Terraform workload cluster configuration exists in the same directory as the management cluster Terraform files.
my/terraform/config/path ├── management-cluster.tf ├── new-workload-cluster.tf ├── provider.tf ├── ... └──
-
Verify the changes to be applied:
terraform plan
-
If the plan looks as expected, apply those changes to create the new cluster resources:
terraform apply
-
You can list the workload clusters managed by the management cluster.
export KUBECONFIG=${PWD}/${MGMT_CLUSTER_NAME}/${MGMT_CLUSTER_NAME}-eks-a-cluster.kubeconfig kubectl get clusters
-
The kubeconfig for your new cluster is stored as a secret on the management cluster. You can get the workload cluster credentials and run the test application on your new workload cluster as follows:
kubectl get secret -n eksa-system w01-kubeconfig -o jsonpath=‘{.data.value}' | base64 --decode > w01.kubeconfig export KUBECONFIG=w01.kubeconfig kubectl apply -f "https://anywhere.eks.amazonaws.com/manifests/hello-eks-a.yaml"
Upgrade cluster using Terraform
- To upgrade a workload cluster using Terraform, modify the desired fields in the Terraform resource file and apply the changes.
terraform apply
Delete cluster using Terraform
- To delete a workload cluster using Terraform, you will need the name of the Terraform cluster resource.
This can be found on the first line of your cluster resource definition.
terraform destroy --target kubernetes_manifest.cluster_w01
Appendix
Terraform K8s Provider https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs
2.14 - Delete cluster
NOTE: EKS Anywhere Bare Metal clusters do not yet support separate workload and management clusters. Use the instructions for Deleting a management cluster to delete a Bare Metal cluster.
Deleting a workload cluster
Follow these steps to delete your EKS Anywhere cluster that is managed by a separate management cluster.
To delete a workload cluster, you will need:
- name of your workload cluster
- kubeconfig of your workload cluster
- kubeconfig of your management cluster
Run the following commands to delete the cluster:
-
Set up
CLUSTER_NAME
andKUBECONFIG
environment variables:export CLUSTER_NAME=eksa-w01-cluster export KUBECONFIG=${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig export MANAGEMENT_KUBECONFIG=<path-to-management-cluster-kubeconfig>
-
Run the delete command:
-
If you are running the delete command from the directory which has the cluster folder with
${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.yaml
:eksctl anywhere delete cluster ${CLUSTER_NAME} --kubeconfig ${MANAGEMENT_KUBECONFIG}
Deleting a management cluster
Follow these steps to delete your management cluster.
To delete a cluster you will need:
- cluster name or cluster configuration
- kubeconfig of your cluster
Run the following commands to delete the cluster:
-
Set up
CLUSTER_NAME
andKUBECONFIG
environment variables:export CLUSTER_NAME=mgmt export KUBECONFIG=${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig
-
Run the delete command:
-
If you are running the delete command from the directory which has the cluster folder with
${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.yaml
:eksctl anywhere delete cluster ${CLUSTER_NAME}
-
Otherwise, use this command to manually specify the clusterconfig file path:
export CONFIG_FILE=<path-to-config-file> eksctl anywhere delete cluster -f ${CONFIG_FILE}
Example output:
Performing provider setup and validations
Creating management cluster
Installing cluster-api providers on management cluster
Moving cluster management from workload cluster
Deleting workload cluster
Clean up Git Repo
GitOps field not specified, clean up git repo skipped
🎉 Cluster deleted!
For vSphere, CloudStack, and Nutanix, this will delete all of the VMs that were created in your provider. For Bare Metal, the servers will be powered off if BMC information has been provided. If your workloads created external resources such as external DNS entries or load balancer endpoints you may need to delete those resources manually.
3 - Cluster troubleshooting
3.1 - Troubleshooting
This guide covers EKS Anywhere troubleshooting. It is divided into the following sections:
- General troubleshooting
- Bare Metal Troubleshooting
- vSphere Troubleshooting
- Snow Troubleshooting
- Nutanix Troubleshooting
You may want to search this document for a fragment of the error you are seeing.
General troubleshooting
Increase eksctl anywhere output
If you’re having trouble running eksctl anywhere
you may get more verbose output with the -v 6
option. The highest level of verbosity is -v 9
and the default level of logging is level equivalent to -v 0
.
Cannot run docker commands
The EKS Anywhere binary requires access to run docker commands without using sudo
.
If you’re using a Linux distribution you will need to be using Docker 20.x.x add your user needs to be part of the docker group.
To add your user to the docker group you can use.
sudo usermod -a -G docker $USER
Now you need to log out and back in to get the new group permissions.
Minimum requirements for docker version have not been met
Error: failed to validate docker: minimum requirements for docker version have not been met. Install Docker version 20.x.x or above
Ensure you are running Docker 20.x.x for example:
% docker --version
Docker version 20.10.6, build 370c289
Minimum requirements for docker version have not been met on Mac OS
Error: EKS Anywhere does not support Docker desktop versions between 4.3.0 and 4.4.1 on macOS
Error: EKS Anywhere requires Docker desktop to be configured to use CGroups v1. Please set `deprecatedCgroupv1:true` in your `~/Library/Group\\ Containers/group.com.docker/settings.json` file
Ensure you are running Docker Desktop 4.4.2 or newer and have set "deprecatedCgroupv1": true
in your settings.json file
% defaults read /Applications/Docker.app/Contents/Info.plist CFBundleShortVersionString
4.42
% docker info --format '{{json .CgroupVersion}}'
"1"
cgroups v2 is not supported in Ubuntu 21.10+ and 22.04
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
It is recommended to use Ubuntu 20.04 for the Administrative Machine. This is because the EKS Anywhere Bootstrap cluster requires cgroups v1. Since Ubuntu 21.10 cgroups v2 is enabled by default. You can use Ubuntu 21.10 and 22.04 for the Administrative machine if you configure Ubuntu to use cgroups v1 instead.
To verify cgroups version
% docker info | grep Cgroup
Cgroup Driver: cgroupfs
Cgroup Version: 2
To use cgroups v1 you need to sudo and edit /etc/default/grub to set GRUB_CMDLINE_LINUX to “systemd.unified_cgroup_hierarchy=0” and reboot.
%sudo <editor> /etc/default/grub
GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"
sudo update-grub
sudo reboot now
Then verify you are using cgroups v1.
% docker info | grep Cgroup
Cgroup Driver: cgroupfs
Cgroup Version: 1
ECR access denied
Error: failed to create cluster: unable to initialize executables: failed to setup eks-a dependencies: Error response from daemon: pull access denied for public.ecr.aws/***/cli-tools, repository does not exist or may require 'docker login': denied: Your authorization token has expired. Reauthenticate and try again.
All images needed for EKS Anywhere are public and do not need authentication. Old cached credentials could trigger this error. Remove cached credentials by running:
docker logout public.ecr.aws
error unmarshaling JSON: while decoding JSON: json: unknown field “spec”
Error: loading config file "cluster.yaml": error unmarshaling JSON: while decoding JSON: json: unknown field "spec"
Use eksctl anywhere create cluster -f cluster.yaml
instead of eksctl create cluster -f cluster.yaml
to create an EKS Anywhere cluster.
Error: old cluster config file exists under my-cluster, please use a different clusterName to proceed
Error: old cluster config file exists under my-cluster, please use a different clusterName to proceed
The my-cluster
directory already exists in the current directory.
Either use a different cluster name or move the directory.
failed to create cluster: node(s) already exist for a cluster with the name
Performing provider setup and validations
Creating new bootstrap cluster
Error create bootstrapcluster {"error": "error creating bootstrap cluster: error executing create cluster: ERROR: failed to create cluster: node(s) already exist for a cluster with the name \"cluster-name\"\n, try rerunning with --force-cleanup to force delete previously created bootstrap cluster"}
Failed to create cluster {"error": "error creating bootstrap cluster: error executing create cluster: ERROR: failed to create cluster: node(s) already exist for a cluster with the name \"cluster-name\"\n, try rerunning with --force-cleanup to force delete previously created bootstrap cluster"}ry rerunning with --force-cleanup to force delete previously created bootstrap cluster"}
A bootstrap cluster already exists with the same name. If you are sure the cluster is not being used, you may use the --force-cleanup
option to eksctl anywhere
to delete the cluster or you may delete the cluster with kind delete cluster --name <cluster-name>
. If you do not have kind
installed, you may use docker stop
to stop the docker container running the KinD cluster.
Memory or disk resource problem
There are various disk and memory issues that can cause problems. Make sure docker is configured with enough memory. Make sure the system wide Docker memory configuration provides enough RAM for the bootstrap cluster.
Make sure you do not have unneeded KinD clusters running kind get clusters
.
You may want to delete unneeded clusters with kind delete cluster --name <cluster-name>
.
If you do not have kind installed, you may install it from https://kind.sigs.k8s.io/
or use docker ps
to see the KinD clusters and docker stop
to stop the cluster.
Make sure you do not have any unneeded Docker containers running with docker ps
.
Terminate any unneeded Docker containers.
Make sure Docker isn’t out of disk resources.
If you don’t have any other docker containers running you may want to run docker system prune
to clean up disk space.
You may want to restart Docker.
To restart Docker on Ubuntu sudo systemctl restart docker
.
Waiting for cert-manager to be available… Error: timed out waiting for the condition
Failed to create cluster {"error": "error initializing capi resources in cluster: error executing init: Fetching providers\nInstalling cert-manager Version=\"v1.1.0\"\nWaiting for cert-manager to be available...\nError: timed out waiting for the condition\n"}
This is likely a Memory or disk resource problem . You can also try using techniques from Generic cluster unavailable .
NTP Time sync issues
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/watchers/endpoint_slice.go:91: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: Unauthorized" subsys=k8s
You might notice authorization errors if the timestamps on your EKS Anywhere control plane nodes and worker nodes are out-of-sync. Please ensure that all the nodes are configured with same healthy NTP servers to avoid out-of-sync issues.
Error running bootstrapper cmd: error joining as worker: Error waiting for worker join files: Kubeadm join kubelet-start killed after timeout
You might also notice that the joining of nodes will fail if your admin machine differs in time compared to your nodes. Make sure to check the server time matches between the two as well.
The connection to the server localhost:8080 was refused
Performing provider setup and validations
Creating new bootstrap cluster
Installing cluster-api providers on bootstrap cluster
Error initializing capi in bootstrap cluster {"error": "error waiting for capi-kubeadm-control-plane-controller-manager in namespace capi-kubeadm-control-plane-system: error executing wait: The connection to the server localhost:8080 was refused - did you specify the right host or port?\n"}
Failed to create cluster {"error": "error waiting for capi-kubeadm-control-plane-controller-manager in namespace capi-kubeadm-control-plane-system: error executing wait: The connection to the server localhost:8080 was refused - did you specify the right host or port?\n"}
This is likely a Memory or disk resource problem .
Generic cluster unavailable
Troubleshoot more by inspecting bootstrap cluster or workload cluster (depending on the stage of failure) using kubectl commands.
kubectl get pods -A --kubeconfig=<kubeconfig>
kubectl get nodes -A --kubeconfig=<kubeconfig>
kubectl get logs <podname> -n <namespace> --kubeconfig=<kubeconfig>
....
Capv troubleshooting guide: https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/master/docs/troubleshooting.md#debugging-issues
Bootstrap cluster fails to come up
If your bootstrap cluster has problems you may get detailed logs by looking at the files created under the ${CLUSTER_NAME}/logs
folder. The capv-controller-manager log file will surface issues with vsphere specific configuration while the capi-controller-manager log file might surface other generic issues with the cluster configuration passed in.
You may also access the logs from your bootstrap cluster directly as below:
export KUBECONFIG=${PWD}/${CLUSTER_NAME}/generated/${CLUSTER_NAME}.kind.kubeconfig
kubectl logs -f -n capv-system -l control-plane="controller-manager" -c manager
It also might be useful to start a shell session on the docker container running the bootstrap cluster by running docker ps
and then docker exec -it <container-id> bash
the kind container.
Bootstrap cluster fails to come up
Error: creating bootstrap cluster: executing create cluster: ERROR: failed to create cluster: node(s) already exist for a cluster with the name \"cluster-name\"
, try rerunning with —force-cleanup to force delete previously created bootstrap cluster
Cluster creation fails because a cluster of the same name already exists.
Try running the eksctl anywhere create cluster
again, adding the --force-cleanup
option.
If that doesn’t work, you can manually delete the old cluster:
kind delete cluster --name cluster-name
Cluster upgrade fails with management cluster on bootstrap cluster
If a cluster upgrade of a management (or self managed) cluster fails or is halted in the middle, you may be left in a state where the management resources (CAPI) are still on the KinD bootstrap cluster on the Admin machine. Right now, you will have to manually move the management resources from the KinD cluster back to the management cluster.
First create a backup:
CLUSTER_NAME=squid
KINDKUBE=${CLUSTER_NAME}/generated/${CLUSTER_NAME}.kind.kubeconfig
MGMTKUBE=${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig
DIRECTORY=backup
# Substitute the version with whatever version you are using
CONTAINER=public.ecr.aws/eks-anywhere/cli-tools:v0.12.0-eks-a-19
rm -rf ${DIRECTORY}
mkdir ${DIRECTORY}
docker run -i --network host -w $(pwd) -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/$(pwd) --entrypoint clusterctl ${CONTAINER} move \
--namespace eksa-system \
--kubeconfig $KINDKUBE \
--to-directory ${DIRECTORY}
#After the backup, move the management cluster back
docker run -i --network host -w $(pwd) -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/$(pwd) --entrypoint clusterctl ${CONTAINER} move \
--to-kubeconfig $MGMTKUBE \
--namespace eksa-system \
--kubeconfig $KINDKUBE
Before you delete your bootstrap KinD cluster, verify there are no import custom resources left on it:
kubectl get crds | grep eks | while read crd rol
do
echo $crd
kubectl get $crd -A
done
Bare Metal troubleshooting
Creating new workload cluster hangs or fails
Cluster creation appears to be hung waiting for the Control Plane to be ready. If the CLI is hung on this message for over 30 mins, something likely failed during the OS provisioning:
Waiting for Control Plane to be ready
Or if cluster creation times out on this step and fails with the following messages:
Support bundle archive created {"path": "support-bundle-2022-06-28T00_41_24.tar.gz"}
Analyzing support bundle {"bundle": "CLUSTER_NAME/generated/bootstrap-cluster-2022-06-28T00:41:24Z-bundle.yaml", "archive": "support-bundle-2022-06-28T00_41_24.tar.gz"}
Analysis output generated {"path": "CLUSTER_NAME/generated/bootstrap-cluster-2022-06-28T00:43:40Z-analysis.yaml"}
collecting workload cluster diagnostics
Error: waiting for workload cluster control plane to be ready: executing wait: error: timed out waiting for the condition on clusters/CLUSTER_NAME
In either of those cases, the following steps can help you determine the problem:
-
Export the kind cluster’s kubeconfig file:
export KUBECONFIG=${PWD}/${CLUSTER_NAME}/generated/${CLUSTER_NAME}.kind.kubeconfig
-
If you have provided BMC information:
-
Check all of the machines that the EKS Anywhere CLI has picked up from the pool of hardware in the CSV file:
kubectl get machines.bmc -A
-
Check if those nodes are powered on. If any of those nodes are not powered on after a while then it could be possible that BMC credentials are invalid. You can verify it by checking the logs:
kubectl get tasks.bmc -n eksa-system kubectl get tasks.bmc <bmc-name> -n eksa-system -o yaml
Validate BMC credentials are correct if a connection error is observed on the
tasks.bmc
resource. Note that “IPMI over LAN” or “Redfish” must be enabled in the BMC configuration for thetasks.bmc
resource to communicate successfully. -
-
If the machine is powered on but you see linuxkit is not running, then Tinkerbell failed to serve the node via iPXE. In this case, you would want to:
-
Check the Boots service logs from the machine where you are running the CLI to see if it received and/or responded to the request:
docker logs boots
-
Confirm no other DHCP service responded to the request and check for any errors in the BMC console. Other DHCP servers on the network can result in race conditions and should be avoided by configuring the other server to block all MAC addresses and exclude all IP addresses used by EKS Anywhere.
-
-
If you see
Welcome to LinuxKit
, click enter in the BMC console to access the LinuxKit terminal. Run the following commands to check if the tink-worker container is running.docker ps -a docker logs <container-id>
-
If the machine has already started provisioning the OS and it’s in irrecoverable state, get the workflow of the provisioning/provisioned machine using:
kubectl get workflows -n eksa-system kubectl describe workflow/<workflow-name> -n eksa-system
Check all the actions and their status to determine if all actions have been executed successfully or not. If the stream-image has action failed, it’s likely due to a timeout or network related issue. You can also provide your own
image_url
by specifyingosImageURL
under datacenter spec.
vSphere troubleshooting
EKSA_VSPHERE_USERNAME is not set or is empty
❌ Validation failed {"validation": "vsphere Provider setup is valid", "error": "failed setup and validations: EKSA_VSPHERE_USERNAME is not set or is empty", "remediation": ""}
Two environment variables need to be set and exported in your environment to create clusters successfully. Be sure to use single quotes around your user name and password to avoid shell manipulation of these values.
export EKSA_VSPHERE_USERNAME='<vSphere-username>'
export EKSA_VSPHERE_PASSWORD='<vSphere-password>'
vSphere authentication failed
❌ Validation failed {"validation": "vsphere Provider setup is valid", "error": "error validating vCenter setup: vSphere authentication failed: govc: ServerFaultCode: Cannot complete login due to an incorrect user name or password.\n", "remediation": ""}
Error: failed to create cluster: validations failed
Two environment variables need to be set and exported in your environment to create clusters successfully. Be sure to use single quotes around your user name and password to avoid shell manipulation of these values.
export EKSA_VSPHERE_USERNAME='<vSphere-username>'
export EKSA_VSPHERE_PASSWORD='<vSphere-password>'
Issues detected with selected template
Issues detected with selected template. Details: - -1:-1:VALUE_ILLEGAL: No supported hardware versions among [vmx-15]; supported: [vmx-04, vmx-07, vmx-08, vmx-09, vmx-10, vmx-11, vmx-12, vmx-13].
Our upstream dependency on CAPV makes it a requirement that you use vSphere 6.7 update 3 or newer. Make sure your ESXi hosts are also up to date.
Waiting for external etcd to be ready
2022-01-19T15:56:57.734Z V3 Waiting for external etcd to be ready {"cluster": "mgmt"}
Debug this problem using techniques from Generic cluster unavailable .
Timed out waiting for the condition on deployments/capv-controller-manager
Failed to create cluster {"error": "error initializing capi in bootstrap cluster: error waiting for capv-controller-manager in namespace capv-system: error executing wait: error: timed out waiting for the condition on deployments/capv-controller-manager\n"}
Debug this problem using techniques from Generic cluster unavailable .
Timed out waiting for the condition on clusters/
Failed to create cluster {"error": "error waiting for workload cluster control plane to be ready: error executing wait: error: timed out waiting for the condition on clusters/test-cluster\n"}
This can be an issue with the number of control plane and worker node replicas defined in your cluster yaml file. Try to start off with a smaller number (3 or 5 is recommended for control plane) in order to bring up the cluster.
This error can also occur because your vCenter server is using self-signed certificates and you have insecure
set to true in the generated cluster yaml.
To check if this is the case, run the commands below:
export KUBECONFIG=${PWD}/${CLUSTER_NAME}/generated/${CLUSTER_NAME}.kind.kubeconfig
kubectl get machines
If all the machines are in Provisioning
phase, this is most likely the issue.
To resolve the issue, set insecure
to false
and thumbprint
to the TLS thumbprint of your vCenter server in the cluster yaml and try again.
"msg"="discovered IP address"
The aforementioned log message can also appear with an address value of the control plane in either of the ${CLUSTER_NAME}/logs/capv-controller-manager.log file or the capv-controller-manager pod log which can be extracted with the following command,
export KUBECONFIG=${PWD}/${CLUSTER_NAME}/generated/${CLUSTER_NAME}.kind.kubeconfig
kubectl logs -f -n capv-system -l control-plane="controller-manager" -c manager
Make sure you are choosing an ip in your network range that does not conflict with other VMs. https://anywhere.eks.amazonaws.com/docs/reference/clusterspec/vsphere/#controlplaneconfigurationendpointhost-required
Generic cluster unavailable
The first thing to look at is: were virtual machines created on your target provider? In the case of vSphere, you should see some VMs in your folder and they should be up. Check the console and if you see:
[FAILED] Failed to start Wait for Network to be Configured.
Make sure your DHCP server is up and working.
Workload VM is created on vSphere but can not power on
A similar issue is the VM does power on but does not show any logs on the console and does not have any IPs assigned.
This issue can occur if the resourcePool
that the VM uses does not have enough CPU or memory resources to run a VM.
To resolve this issue, increase the CPU and/or memory reservations or limits for the resourcePool.
Workload VMs start but Kubernetes not working properly
If the workload VMs start, but Kubernetes does not start or is not working properly, you may want to log onto the VMs and check the logs there.
If Kubernetes is at least partially working, you may use kubectl
to get the IPs of the nodes:
kubectl get nodes -o=custom-columns="NAME:.metadata.name,IP:.status.addresses[2].address"
If Kubernetes is not working at all, you can get the IPs of the VMs from vCenter or using govc
.
When you get the external IP you can ssh into the nodes using the private ssh key associated with the public ssh key you provided in your cluster configuration:
ssh -i <ssh-private-key> <ssh-username>@<external-IP>
create command stuck on Creating new workload cluster
There can we a few reasons if the create command is stuck on Creating new workload cluster
for over 30 min.
First, check the vSphere UI to see if any workload VM are created.
If any VMs are created, check to see if they have any IPv4 IPs assigned to them.
If there are no IPv4 IPs assigned to them, this is most likely because you don’t have a DHCP server configured for the network
configured in the cluster config yaml.
Ensure that you have DHCP running and run the create command again.
If there are any IPv4 IPs assigned, check if one of the VMs have the controlPlane IP specified in Cluster.spec.controlPlaneConfiguration.endpoint.host
in the clusterconfig yaml.
If this IP is not present on any control plane VM, make sure the network
has access to the following endpoints:
- vCenter endpoint (must be accessible to EKS Anywhere clusters)
- public.ecr.aws
- anywhere-assets.eks.amazonaws.com (to download the EKS Anywhere binaries, manifests and OVAs)
- distro.eks.amazonaws.com (to download EKS Distro binaries and manifests)
- d2glxqk2uabbnd.cloudfront.net (for EKS Anywhere and EKS Distro ECR container images)
- api.ecr.us-west-2.amazonaws.com (for EKS Anywhere package authentication matching your region)
- d5l0dvt14r5h8.cloudfront.net (for EKS Anywhere package ECR container images)
- api.github.com (only if GitOps is enabled)
If the IPv4 IPs are assigned to the VM and you have the workload kubeconfig under <cluster-name>/<cluster-name>-eks-a-cluster.kubeconfig
, you can use it to check vsphere-cloud-controller-manager
logs.
kubectl logs -n kube-system vsphere-cloud-controller-manager-<xxxxx> --kubeconfig <cluster-name>/<cluster-name>-eks-a-cluster.kubeconfig
If you see this message in the logs, it means your cluster nodes do not have access to vSphere, which is required for cluster to get to a ready state.
Failed to connect to <vSphere-FQDN>: connection refused
In this case, you need to enable inbound traffic from your cluster nodes on your vCenter’s management network.
If VMs are created, but they do not get a network connection and DHCP is not configured for your vSphere deployment, you may need to create your own DHCP server
.
If no VMs are created, check the capi-controller-manager
, capv-controller-manager
and capi-kubeadm-control-plane-controller-manager
logs using the commands mentioned in Generic cluster unavailable
section.
Cluster Deletion Fails
If cluster deletion fails, you may need to manually delete the VMs associated with the cluster.
The VMs should be named with the cluster name.
You can power off and delete from disk using the vCenter web user interface.
You may also use govc
:
govc find -type VirtualMachine --name '<cluster-name>*'
This will give you a list of virtual machines that should be associated with your cluster. For each of the VMs you want to delete run:
VM_NAME=vm-to-destroy
govc vm.power -off -force $VM_NAME
govc object.destroy $VM_NAME
Troubleshooting GitOps integration
Cluster creation failure leaves outdated cluster configuration in GitHub.com repository
Failed cluster creation can sometimes leave behind cluster configuration files committed to your GitHub.com repository.
Make sure to delete these configuration files before you re-try eksctl anywhere create cluster
.
If these configuration files are not deleted, GitOps installation will fail but cluster creation will continue.
They’ll generally be located under the directory
clusters/$CLUSTER_NAME
if you used the default path in your flux
gitops
config.
Delete the entire directory named $CLUSTER_NAME.
Cluster creation failure leaves empty GitHub.com repository
Failed cluster creation can sometimes leave behind a completely empty GitHub.com repository. This can cause the GitOps installation to fail if you re-try the creation of a cluster which uses this repository. If cluster creation failure leaves behind an empty github repository, please manually delete the created GitHub.com repository before attempting cluster creation again.
Changes not syncing to cluster
Please remember that the only fields currently supported for GitOps are:
Cluster
Cluster.workerNodeGroupConfigurations.count
Cluster.workerNodeGroupConfigurations.machineGroupRef.name
Worker Nodes
VsphereMachineConfig.diskGiB
VsphereMachineConfig.numCPUs
VsphereMachineConfig.memoryMiB
VsphereMachineConfig.template
VsphereMachineConfig.datastore
VsphereMachineConfig.folder
VsphereMachineConfig.resourcePool
If you’ve changed these fields and they’re not syncing to the cluster as you’d expect,
check out the logs of the pod in the source-controller
deployment in the flux-system
namespaces.
If flux
is having a problem connecting to your GitHub repository the problem will be logged here.
$ kubectl get pods -n flux-system
NAME READY STATUS RESTARTS AGE
helm-controller-7d644b8547-k8wfs 1/1 Running 0 4h15m
kustomize-controller-7cf5875f54-hs2bt 1/1 Running 0 4h15m
notification-controller-776f7d68f4-v22kp 1/1 Running 0 4h15m
source-controller-7c4555748d-7c7zb 1/1 Running 0 4h15m
$ kubectl logs source-controller-7c4555748d-7c7zb -n flux-system
A well behaved flux pod will simply log the ongoing reconciliation process, like so:
{"level":"info","ts":"2021-07-01T19:58:51.076Z","logger":"controller.gitrepository","msg":"Reconciliation finished in 902.725344ms, next run in 1m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system"}
{"level":"info","ts":"2021-07-01T19:59:52.012Z","logger":"controller.gitrepository","msg":"Reconciliation finished in 935.016754ms, next run in 1m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system"}
{"level":"info","ts":"2021-07-01T20:00:52.982Z","logger":"controller.gitrepository","msg":"Reconciliation finished in 970.03174ms, next run in 1m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system"}
If there are issues connecting to GitHub, you’ll instead see exceptions in the source-controller
log stream.
For example, if the deploy key used by flux
has been deleted, you’d see something like this:
{"level":"error","ts":"2021-07-01T20:04:56.335Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/youruser/gitops-vsphere-test', error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain"}
Other ways to troubleshoot GitOps integration
If you’re still having problems after deleting any empty EKS Anywhere created GitHub repositories and looking at the source-controller
logs.
You can look for additional issues by checking out the deployments in the flux-system
and eksa-system
namespaces and ensure they’re running and their log streams are free from exceptions.
$ kubectl get deployments -n flux-system
NAME READY UP-TO-DATE AVAILABLE AGE
helm-controller 1/1 1 1 4h13m
kustomize-controller 1/1 1 1 4h13m
notification-controller 1/1 1 1 4h13m
source-controller 1/1 1 1 4h13m
$ kubectl get deployments -n eksa-system
NAME READY UP-TO-DATE AVAILABLE AGE
eksa-controller-manager 1/1 1 1 4h13m
Snow troubleshooting
Device outage
These are some conditions that can cause a device outage:
- Intentional outage (a planned power outage or an outage when moving devices, for example).
- Unintentional outage (a subset of devices or all devices are rebooted, or experiencing network disconnections from the LAN, which make device offline or isolated from the cluster).
NOTE: If all Snowball Edge devices are moved to a different place and connected to a different local network, make sure you use the same subnet, netmask, and gateway for your network configuration. After moving, devices and all node instances need to maintain the original IP addresses. Then, follow the recover cluster procedure to get your cluster up and running again. Otherwise, it might be impossible to resume the cluster.
To recover a cluster
If there is a subset of devices or all devices experience an outage, see Downloading and Installing the Snowball Edge client to get the Snowball Edge client and then follow these steps:
-
Reboot and unlock all affected devices manually.
// use reboot-device command to reboot device, this may take several minutes $ path-to-snowballEdge_CLIENT reboot-device --endpoint https://snowball-ip --manifest-file path-to-manifest-file --unlock-code unlock-code // use describe-device command to check the status of device $ path-to-snowballEdge_CLIENT describe-device --endpoint https://snowball-ip --manifest-file path-to-manifest-file --unlock-code unlock-code // when the State in the output of describe-device is LOCKED, run unlock-device $ path-to-snowballEdge_CLIENT unlock-device --endpoint https://snowball-ip --manifest-file path-to-manifest-file --unlock-code unlock-code // use describe-device command to check the status of device until device is unlocked $ path-to-snowballEdge_CLIENT describe-device --endpoint https://snowball-ip --manifest-file path-to-manifest-file --unlock-code unlock-code
-
Get all instance IDs that were part of the cluster by looking up the impacted device IP in the PROVIDERID column.
$ kubectl get machines -A --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION eksa-system machine-name-1 cluster-name node-name-1 aws-snow:///192.168.1.39/s.i-8319d8c75d54a32cc Running 82s v1.24.9-eks-1-24-7 eksa-system machine-name-2 cluster-name node-name-2 aws-snow:///192.168.1.39/s.i-8d7d3679a1713e403 Running 82s v1.24.9-eks-1-24-7 eksa-system machine-name-3 cluster-name node-name-3 aws-snow:///192.168.1.231/s.i-8201c356fb369c37f Running 81s v1.24.9-eks-1-24-7 eksa-system machine-name-4 cluster-name node-name-4 aws-snow:///192.168.1.39/s.i-88597731b5a4a9044 Running 81s v1.24.9-eks-1-24-7 eksa-system machine-name-5 cluster-name node-name-5 aws-snow:///192.168.1.77/s.i-822f0f46267ad4c6e Running 81s v1.24.9-eks-1-24-7
-
Start all instances on the impacted devices as soon as possible.
$ aws ec2 start-instances --instance-id instance-id-1 instance-id-2 ... --endpoint http://snowball-ip:6078 --profile profile-name
-
Check the balance status of the current cluster after the cluster is ready again.
$ kubectl get machines -A --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
Check if you have unstacked etcd machines.
-
If you have unstacked etcd machines, check the provision of unstacked etcd machines. You can find the device IP in the
PROVIDERID
column.- If there are more than 1 unstacked etcd machines provisioned on the same device and there are devices with no unstacked etcd machine, you need to rebalance unstacked etcd nodes. Follow the rebalance nodes procedure to rebalance your unstacked etcd nodes in order to recover high availability.
- If you have your etcd nodes evenly distributed with 1 device having at most 1 etcd node, you are done with the recovery.
-
If you don’t have unstacked etcd machines, check the provision of control plane machines. You can find the device IP in
PROVIDERID
column.- If there are more than 1 control plan machines provisioned on the same device and there are devices with no control plane machine, you need to rebalance control plane nodes. Follow the rebalance nodes procedure to rebalance your control plane nodes in order to recover high availability.
- If you have your control plane nodes evenly distributed with 1 device having at most 1 control plane node, you are done with the recovery.
-
How to rebalance nodes
-
Confirm the machines you want to delete and get their node name from the NODENAME column.
You can determine which machines need to be deleted by referring to the AGE column. The newly-generated machines have short AGE. Delete those new etcd/control plane machine nodes which are not the only etcd/control plane machine nodes on their devices.
-
Cordon each node so no further workloads are scheduled to run on it.
$ kubectl cordon node-name --ignore-daemonsets --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
Drain machine nodes of all current workloads.
$ kubectl drain node-name --ignore-daemonsets --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
Delete machine node.
$ kubectl delete node node-name --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
Repeat this process until etcd/control plane machine nodes are evenly provisioned.
Device replacement
There might be some reasons which can require device replacement:
- When a subset of devices are determined to be broken and you want to join a new device into current cluster.
- When a subset of devices are offline and come back with a new device IP.
To upgrade a cluster with new devices:
-
Add new certificates to the certificate file and new credentials to the credential file.
-
Change the device list in your cluster yaml configuration file and use the
eksctl anywhere upgrade cluster
command.$ eksctl anywhere upgrade cluster -f eks-a-cluster.yaml
Node outage
Unintentional instance outage
When an instance is in exception status (for example, terminated/stopped for some reason), it will be discovered automatically by Amazon EKS Anywhere and there will be a new replacement instance node created after 5 minutes. The new node will be provisioned to devices based on even provision strategy. In this case, the new node will be provisioned to a device with the fewest number of machines of the same type. Sometimes, more than one device will have the same number of machines of this type. Thus, we cannot guarantee it will be provisioned on the original device.
Intentional node replacement
If you want to replace an unhealthy node which didn’t get detected by Amazon EKS Anywhere automatically, you can follow these steps.
NOTE: Do not delete all worker machine nodes or control plane nodes or etcd nodes at the same time. Make sure you delete machine nodes one by one.
-
Cordon nodes so no further workloads are scheduled to run on it.
$ kubectl cordon node-name --ignore-daemonsets --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
Drain machine nodes of all current workloads.
$ kubectl drain node-name --ignore-daemonsets --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
Delete machine nodes.
$ kubectl delete node node-name --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
New nodes will be provisioned automatically. You can check the provision result with the get machines command.
$ kubectl get machines -A --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
Cluster Deletion Fails
If your Amazon EKS Anywhere cluster creation failed and the eksctl anywhere delete cluster -f eksa-cluster.yaml
command cannot be run successfully, manually delete a few resources before trying the command again. Run the following commands from the computer on which you set up the AWS configuration and have the Snowball Edge Client installed
. If you are using multiple Snowball Edge devices, run these commands on each.
// get the list of instance ids that are created for Amazon EKS Anywhere cluster,
// that can be identified by cluster name in the tag of the output
$ aws ec2 describe-instances --endpoint http://snowball-ip:8008 --profile profile-name
// the next two commands are for deleting DNI, this needs to be done before deleting instance
$ PATH_TO_Snowball_Edge_CLIENT/bin/snowballEdge describe-direct-network-interfaces --endpoint https://snowball-ip --manifest-file path-to-manifest-file --unlock-code unlock-code
// DNI arn can be found in the output of last command, which is associated with the specific instance id you get from describe-instances
$ PATH_TO_Snowball_Edge_CLIENT/bin/snowballEdge delete-direct-network-interface --direct-network-interface-arn DNI-ARN --endpoint https://snowball-ip --manifest-file path-to-manifest-file --unlock-code unlock-code
// delete instance
$ aws ec2 terminate-instances --instance-id instance-id-1,instance-id-2 --endpoint http://snowball-ip:8008 --profile profile-name
Generate a log file from the Snowball Edge device
You can also generate a log file from the Snowball Edge device for AWS Support. See AWS Snowball Edge Logs in this guide.
Nutanix troubleshooting
Error creating Nutanix client
Error: error creating nutanix client: username, password and endpoint are required
Verify if the required environment variables are set before creating the clusters:
export EKSA_NUTANIX_USERNAME="<Nutanix-username>"
export EKSA_NUTANIX_PASSWORD="<Nutanix-password>"
Also, make sure the spec.endpoint
is correctly configured in the NutanixDatacenterConfig
. The value of the spec.endpoint
should be the IP or FQDN of Prism Central.
x509: certificate signed by unknown authority
Failure of the nutanix Provider setup is valid
validation with the x509: certificate signed by unknown authority
message indicates the certificate of the Prism Central endpoint is not trusted.
In case Prism Central is configured with self-signed certificates, it is recommended to configure the additionalTrustBundle
in the NutanixDatacenterConfig
. More information can be found here
.
3.2 - Generating a Support Bundle
This guide covers the use of the EKS Anywhere Support Bundle for troubleshooting and support. This allows you to gather cluster information, save it to your administrative machine, and perform analysis of the results.
EKS Anywhere leverages troubleshoot.sh to collect and analyze kubernetes cluster logs, cluster resource information, and other relevant debugging information.
EKS Anywhere has two Support Bundle commands:
eksctl anywhere generate support-bundle
will execute a support bundle on your cluster,
collecting relevant information, archiving it locally, and performing analysis of the results.
eksctl anywhere generate support-bundle-config
will generate a Support Bundle config yaml file for you to customize.
Do not add personally identifiable information (PII) or other confidential or sensitive information to your support bundle. If you provide the support bundle to get support from AWS, it will be accessible to other AWS services, including AWS Support.
Collecting a Support Bundle and running analyzers
eksctl anywhere generate support-bundle
generate support-bundle
will allow you to quickly collect relevant logs and cluster resources and save them locally in an archive file.
This archive can then be used to aid in further troubleshooting and debugging.
If you provide a cluster configuration file containing your cluster spec using the -f
flag,
generate support-bundle
will customize the auto-generated support bundle collectors and analyzers
to match the state of your cluster.
If you provide a support bundle configuration file using the --bundle-config
flag,
for example one generated with generate support-bundle-config
,
generate support-bundle
will use the provided configuration when collecting information from your cluster and analyzing the results.
Flags:
--bundle-config string Bundle Config file to use when generating support bundle
-f, --filename string Filename that contains EKS-A cluster configuration
-h, --help help for support-bundle
--since string Collect pod logs in the latest duration like 5s, 2m, or 3h.
--since-time string Collect pod logs after a specific datetime(RFC3339) like 2021-06-28T15:04:05Z
-w, --w-config string Kubeconfig file to use when creating support bundle for a workload cluster
Collecting and analyzing a bundle
You only need to run a single command to generate a support bundle, collect information and analyze the output:
eksctl anywhere generate support-bundle -f myCluster.yaml
This command will collect the information from your cluster and run an analysis of the collected information.
The collected information will be saved to your local disk in an archive which can be used for debugging and obtaining additional in-depth support.
The analysis will be printed to your console.
Collect phase:
$ ./bin/eksctl anywhere generate support-bundle -f ./testcluster100.yaml
Collecting support bundle cluster-info
Collecting support bundle cluster-resources
Collecting support bundle secret
Collecting support bundle logs
Analyzing support bundle
Analysis phase:
Analyze Results
------------
Check PASS
Title: gitopsconfigs.anywhere.eks.amazonaws.com
Message: gitopsconfigs.anywhere.eks.amazonaws.com is present on the cluster
------------
Check PASS
Title: vspheredatacenterconfigs.anywhere.eks.amazonaws.com
Message: vspheredatacenterconfigs.anywhere.eks.amazonaws.com is present on the cluster
------------
Check PASS
Title: vspheremachineconfigs.anywhere.eks.amazonaws.com
Message: vspheremachineconfigs.anywhere.eks.amazonaws.com is present on the cluster
------------
Check PASS
Title: capv-controller-manager Status
Message: capv-controller-manager is running.
------------
Check PASS
Title: capv-controller-manager Status
Message: capv-controller-manager is running.
------------
Check PASS
Title: coredns Status
Message: coredns is running.
------------
Check PASS
Title: cert-manager-webhook Status
Message: cert-manager-webhook is running.
------------
Check PASS
Title: cert-manager-cainjector Status
Message: cert-manager-cainjector is running.
------------
Check PASS
Title: cert-manager Status
Message: cert-manager is running.
------------
Check PASS
Title: capi-kubeadm-control-plane-controller-manager Status
Message: capi-kubeadm-control-plane-controller-manager is running.
------------
Check PASS
Title: capi-kubeadm-bootstrap-controller-manager Status
Message: capi-kubeadm-bootstrap-controller-manager is running.
------------
Check PASS
Title: capi-controller-manager Status
Message: capi-controller-manager is running.
------------
Check PASS
Title: capi-controller-manager Status
Message: capi-controller-manager is running.
------------
Check PASS
Title: capi-kubeadm-control-plane-controller-manager Status
Message: capi-kubeadm-control-plane-controller-manager is running.
------------
Check PASS
Title: capi-kubeadm-control-plane-controller-manager Status
Message: capi-kubeadm-control-plane-controller-manager is running.
------------
Check PASS
Title: capi-kubeadm-bootstrap-controller-manager Status
Message: capi-kubeadm-bootstrap-controller-manager is running.
------------
Check PASS
Title: clusters.anywhere.eks.amazonaws.com
Message: clusters.anywhere.eks.amazonaws.com is present on the cluster
------------
Check PASS
Title: bundles.anywhere.eks.amazonaws.com
Message: bundles.anywhere.eks.amazonaws.com is present on the cluster
------------
Archive phase:
a support bundle has been created in the current directory: {"path": "support-bundle-2021-09-02T19_29_41.tar.gz"}
Generating a custom Support Bundle configuration for your EKS Anywhere Cluster
EKS Anywhere will automatically generate a support bundle based on your cluster configuration; however, if you’d like to customize the support bundle to collect specific information, you can generate your own support bundle configuration yaml for EKS Anywhere to run on your cluster.
eksctl anywhere generate support-bundle-config
will generate a default support bundle configuration and print it as yaml.
eksctl anywhere generate support-bundle-config -f myCluster.yaml
will generate a support bundle configuration customized to your cluster and print it as yaml.
To run a customized support bundle configuration yaml file on your cluster,
save this output to a file and run the command eksctl anywhere generate support-bundle
using the flag --bundle-config
.
eksctl anywhere generate support-bundle-config
Flags:
-f, --filename string Filename that contains EKS-A cluster configuration
-h, --help help for support-bundle-config
4 - EKS Anywhere curated package management
The main goal of EKS Anywhere curated packages is to make it easy to install, configure and maintain operational components in an EKS Anywhere cluster. EKS Anywhere curated packages offers to run secure and tested operational components on EKS Anywhere clusters. Please check out EKS Anywhere curated packages concepts and EKS Anywhere curated packages configurations for more details.
For proper curated package support, make sure the cluster kubernetes
version is v1.21
or above and eksctl anywhere
version is v0.11.0
or above (can be checked with the eksctl anywhere version
command). Amazon EKS Anywhere Curated Packages are only available to customers with the Amazon EKS Anywhere Enterprise Subscription. To request a free trial, talk to your Amazon representative or connect with one here
.
Setup authentication to use curated-packages
When you have been notified that your account has been given access to curated packages, create an IAM user in your account with a policy that only allows ECR read access to the Curated Packages repository; similar to this:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ECRRead",
"Effect": "Allow",
"Action": [
"ecr:DescribeImageScanFindings",
"ecr:GetDownloadUrlForLayer",
"ecr:DescribeRegistry",
"ecr:DescribePullThroughCacheRules",
"ecr:DescribeImageReplicationStatus",
"ecr:ListTagsForResource",
"ecr:ListImages",
"ecr:BatchGetImage",
"ecr:DescribeImages",
"ecr:DescribeRepositories",
"ecr:BatchCheckLayerAvailability"
],
"Resource": "arn:aws:ecr:*:783794618700:repository/*"
},
{
"Sid": "ECRLogin",
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken"
],
"Resource": "*"
}
]
}
Note Curated Packages now supports pulling images from the following regions. Use the corresponding EKSA_AWS_REGION
prior to cluster creation to choose which region to pull form, if not set it will default to pull from us-west-2
.
"us-east-2",
"us-east-1",
"us-west-1",
"us-west-2",
"ap-northeast-3",
"ap-northeast-2",
"ap-southeast-1",
"ap-southeast-2",
"ap-northeast-1",
"ca-central-1",
"eu-central-1",
"eu-west-1",
"eu-west-2",
"eu-west-3",
"eu-north-1",
"sa-east-1"
Create credentials for this user and set and export the following environment variables:
export EKSA_AWS_ACCESS_KEY_ID="your*access*id"
export EKSA_AWS_SECRET_ACCESS_KEY="your*secret*key"
export EKSA_AWS_REGION="us-west-2"
Make sure you are authenticated with the AWS CLI
export AWS_ACCESS_KEY_ID="your*access*id"
export AWS_SECRET_ACCESS_KEY="your*secret*key"
aws sts get-caller-identity
Login to docker
aws ecr get-login-password --region us-west-2 |docker login --username AWS --password-stdin 783794618700.dkr.ecr.us-west-2.amazonaws.com
Verify you can pull an image
docker pull 783794618700.dkr.ecr.us-west-2.amazonaws.com/emissary-ingress/emissary:v3.0.0-9ded128b4606165b41aca52271abe7fa44fa7109
If the image downloads successfully, it worked!
Discover curated packages
You can get a list of the available packages from the command line:
export CLUSTER_NAME=nameofyourcluster
export KUBECONFIG=${PWD}/${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig
eksctl anywhere list packages --kube-version 1.23
Example command output:
Package Version(s)
------- ----------
hello-eks-anywhere 0.1.2-a6847010915747a9fc8a412b233a2b1ee608ae76
adot 0.25.0-c26690f90d38811dbb0e3dad5aea77d1efa52c7b
cert-manager 1.9.1-dc0c845b5f71bea6869efccd3ca3f2dd11b5c95f
cluster-autoscaler 9.21.0-1.23-5516c0368ff74d14c328d61fe374da9787ecf437
harbor 2.5.1-ee7e5a6898b6c35668a1c5789aa0d654fad6c913
metallb 0.13.7-758df43f8c5a3c2ac693365d06e7b0feba87efd5
metallb-crds 0.13.7-758df43f8c5a3c2ac693365d06e7b0feba87efd5
metrics-server 0.6.1-eks-1-23-6-c94ed410f56421659f554f13b4af7a877da72bc1
emissary 3.3.0-cbf71de34d8bb5a72083f497d599da63e8b3837b
emissary-crds 3.3.0-cbf71de34d8bb5a72083f497d599da63e8b3837b
prometheus 2.41.0-b53c8be243a6cc3ac2553de24ab9f726d9b851ca
Generate a curated-packages config
The example shows how to install the harbor
package from the curated package list
.
export CLUSTER_NAME=nameofyourcluster
eksctl anywhere generate package harbor --cluster ${CLUSTER_NAME} --kube-version 1.23 > packages.yaml
Available curated packages and troubleshooting guides are listed below.
Install package controller after installation
If you created a cluster without the package controller or if the package controller was not properly configured, you may need to do some things to enable it.
Make sure you are authenticated with the AWS CLI. Use the credentials you set up for packages. These credentials should have limited capabilities :
export AWS_ACCESS_KEY_ID="your*access*id"
export AWS_SECRET_ACCESS_KEY="your*secret*key"
export EKSA_AWS_ACCESS_KEY_ID="your*access*id"
export EKSA_AWS_SECRET_ACCESS_KEY="your*secret*key"
Verify your credentials are working:
aws sts get-caller-identity
Login to docker
aws ecr get-login-password |docker login --username AWS --password-stdin 783794618700.dkr.ecr.us-west-2.amazonaws.com
Verify you can pull an image
docker pull 783794618700.dkr.ecr.us-west-2.amazonaws.com/emissary-ingress/emissary:v3.0.0-9ded128b4606165b41aca52271abe7fa44fa7109
If the image downloads successfully, it worked!
If you do not have the package controller installed (it is installed by default), install it now:
eksctl anywhere install packagecontroller -f cluster.yaml
If you had the package controller disabled, you may need to modify your cluster.yaml
to enable it.
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
name: billy
spec:
packages:
disable: false
You may need to create or update your credentials which you can do with a command like this. Set the environment variables to the proper values before running the command.
kubectl delete secret -n eksa-packages aws-secret
kubectl create secret -n eksa-packages generic aws-secret \
--from-literal=AWS_ACCESS_KEY_ID=${EKSA_AWS_ACCESS_KEY_ID} \
--from-literal=AWS_SECRET_ACCESS_KEY=${EKSA_AWS_SECRET_ACCESS_KEY} \
--from-literal=REGION=${EKSA_AWS_REGION}
If you recreate secrets, you can manually re-enable the cronjob and run the job to update the image pull secrets:
kubectl get cronjob -n eksa-packages cron-ecr-renew -o yaml | yq e '.spec.suspend |= false' - | kubectl apply -f -
kubectl create job -n eksa-packages --from=cronjob/cron-ecr-renew run-it-now
Upgrade the packages controller
Starting with EKS-A v0.15.0 (packages controller v0.3.9+) the package controller will upgrade automatically according to the selected bundle. For any version prior to v0.3.X, manual steps must be executed to upgrade.
- Ensure the namespace will be kept
kubectl annotate namespaces eksa-packages helm.sh/resource-policy=keep
- Uninstall the eks-anywhere-package helm release
helm uninstall eks-anywhere-packages
- Remove the secret called aws-secret (we will need credentials when installing the new version)
kubectl delete secret -n eksa-package aws-secret
- Install the new version using the latest eksctl-anywhere binary
eksctl anywhere install packagecontroller -f ${CLUSTER_NAME}.yaml
4.1 - Package Prerequisites
Prerequisites
Before installing any curated packages for EKS Anywhere, do the following:
-
Check that the cluster
Kubernetes
version isv1.21
or above. For example, you could runkubectl get cluster -o yaml <cluster-name> | grep -i kubernetesVersion
-
Check that the version of
eksctl anywhere
isv0.11.0
or above with theeksctl anywhere version
command. -
It is recommended that the package controller is only installed on the management cluster.
-
Check the existence of package controller:
kubectl get pods -n eksa-packages | grep "eks-anywhere-packages"
If the returned result is empty, you need to install the package controller.
-
Install the package controller if it is not installed: Install the package controller
Note This command is temporarily provided to ease integration with curated packages. This command will be deprecated in the future
eksctl anywhere install packagecontroller -f $CLUSTER_NAME.yaml
4.2 - Curated Packages Troubleshooting
General debugging
The major component of Curated Packages is the package controller. If the container is not running or not running correctly, packages will not be installed. Generally it should be debugged like any other Kubernetes application. The first step is to check that the pod is running.
kubectl get pods -n eksa-packages
You should see at least two pods with running and one or more refresher completed.
NAME READY STATUS RESTARTS AGE
eks-anywhere-packages-69d7bb9dd9-9d47l 1/1 Running 0 14s
eksa-auth-refresher-w82nm 0/1 Completed 0 10s
The describe command might help to get more detail on why there is a problem:
kubectl describe pods -n eksa-packages
Logs of the controller can be seen in a normal Kubernetes fashion:
kubectl logs deploy/eks-anywhere-packages -n eksa-packages controller
To get the general state of the package controller, run the following command:
kubectl get packages,packagebundles,packagebundlecontrollers -A
You should see an active packagebundlecontroller and an available bundle. The packagebundlecontroller should indicate the active bundle. It may take a few minutes to download and activate the latest bundle. The state of the package in this example is installing and there is an error downloading the chart.
NAMESPACE NAME PACKAGE AGE STATE CURRENTVERSION TARGETVERSION DETAIL
eksa-packages-sammy package.packages.eks.amazonaws.com/my-hello hello-eks-anywhere 42h installed 0.1.1-bc7dc6bb874632972cd92a2bca429a846f7aa785 0.1.1-bc7dc6bb874632972cd92a2bca429a846f7aa785 (latest)
eksa-packages-tlhowe package.packages.eks.amazonaws.com/my-hello hello-eks-anywhere 44h installed 0.1.1-083e68edbbc62ca0228a5669e89e4d3da99ff73b 0.1.1-083e68edbbc62ca0228a5669e89e4d3da99ff73b (latest)
NAMESPACE NAME STATE
eksa-packages packagebundle.packages.eks.amazonaws.com/v1-21-83 available
eksa-packages packagebundle.packages.eks.amazonaws.com/v1-23-70 available
eksa-packages packagebundle.packages.eks.amazonaws.com/v1-23-81 available
eksa-packages packagebundle.packages.eks.amazonaws.com/v1-23-82 available
eksa-packages packagebundle.packages.eks.amazonaws.com/v1-23-83 available
NAMESPACE NAME ACTIVEBUNDLE STATE DETAIL
eksa-packages packagebundlecontroller.packages.eks.amazonaws.com/sammy v1-23-70 upgrade available v1-23-83 available
eksa-packages packagebundlecontroller.packages.eks.amazonaws.com/tlhowe v1-21-83 active active
Package controller not running
If you do not see a pod or various resources for the package controller, it may be that it is not installed.
No resources found in eksa-packages namespace.
Most likely the cluster was created with an older version of the EKS Anywhere CLI. Curated packages became generally available with v0.11.0
. Use the eksctl anywhere version
command to verify you are running a new enough release and you can use the eksctl anywhere install packagecontroller
command to install the package controller on an older release.
Error: this command is currently not supported
Error: this command is currently not supported
Curated packages became generally available with version v0.11.0
. Use the version command to make sure you are running version v0.11.0
or later:
eksctl anywhere version
Error: cert-manager is not present in the cluster
Error: curated packages cannot be installed as cert-manager is not present in the cluster
This is most likely caused by an action to install curated packages at a workload cluster with eksctl anywhere
version older than v0.12.0
. In order to use packages on workload clusters, please upgrade eksctl anywhere
version to v0.12+
. The package manager will remotely manage packages on the workload cluster from the management cluster.
Package registry authentication
Error: ImagePullBackOff on Package
If a package fails to start with ImagePullBackOff:
NAME READY STATUS RESTARTS AGE
generated-harbor-jobservice-564d6fdc87 0/1 ImagePullBackOff 0 2d23h
If a package pod cannot pull images, you may not have your AWS credentials set up properly. Verify that your credentials are working properly.
Make sure you are authenticated with the AWS CLI. Use the credentials you set up for packages. These credentials should have limited capabilities :
export AWS_ACCESS_KEY_ID="your*access*id"
export AWS_SECRET_ACCESS_KEY="your*secret*key"
aws sts get-caller-identity
Login to docker
aws ecr get-login-password |docker login --username AWS --password-stdin 783794618700.dkr.ecr.us-west-2.amazonaws.com
Verify you can pull an image
docker pull 783794618700.dkr.ecr.us-west-2.amazonaws.com/emissary-ingress/emissary:v3.0.0-9ded128b4606165b41aca52271abe7fa44fa7109
If the image downloads successfully, it worked!
You may need to create or update your credentials which you can do with a command like this. Set the environment variables to the proper values before running the command.
kubectl delete secret -n eksa-packages aws-secret
kubectl create secret -n eksa-packages generic aws-secret --from-literal=AWS_ACCESS_KEY_ID=${EKSA_AWS_ACCESS_KEY_ID} --from-literal=AWS_SECRET_ACCESS_KEY=${EKSA_AWS_SECRET_ACCESS_KEY} --from-literal=REGION=${EKSA_AWS_REGION}
If you recreate secrets, you can manually re-enable the cronjob and run the job to update the image pull secrets:
kubectl get cronjob -n eksa-packages cron-ecr-renew -o yaml | yq e '.spec.suspend |= false' - | kubectl apply -f -
kubectl create job -n eksa-packages --from=cronjob/cron-ecr-renew run-it-now
Warning: not able to trigger cron job
secret/aws-secret created
Warning: not able to trigger cron job, please be aware this will prevent the package controller from installing curated packages.
This is most likely caused by an action to install curated packages in a cluster that is running Kubernetes
at version v1.20
or below. Note curated packages only support Kubernetes
v1.21
and above.
Package on workload clusters
Starting at eksctl anywhere
version v0.12.0
, packages on workload clusters are remotely managed by the management cluster. While interacting with the package resources by the following commands for a workload cluster, please make sure the kubeconfig is pointing to the management cluster that was used to create the workload cluster.
Package manager is not managing packages on workload cluster
If the package manager is not managing packages on a workload cluster, make sure the management cluster has various resources for the workload cluster:
kubectl get packages,packagebundles,packagebundlecontrollers -A
You should see a PackageBundleController for the workload cluster named with the name of the workload cluster and the status should be set. There should be a namespace for the workload cluster as well:
kubectl get ns | grep eksa-packagess
Create a PackageBundlecController for the workload cluster if it does not exist (where billy here is the cluster name):
cat <<! | k apply -f -
apiVersion: packages.eks.amazonaws.com/v1alpha1
kind: PackageBundleController
metadata:
name: billy
namespace: eksa-packages
!
Workload cluster is disconnected
Cluster is disconnected:
NAMESPACE NAME ACTIVEBUNDLE STATE DETAIL
eksa-packages packagebundlecontroller.packages.eks.amazonaws.com/billy disconnected initializing target client: getting kubeconfig for cluster "billy": Secret "billy-kubeconfig" not found
In the example above, the secret does not exist which may be that the management cluster is not managing the cluster, the PackageBundleController name is wrong or the secret was deleted.
This also may happen if the management cluster cannot communicate with the workload cluster or the workload cluster was deleted, although the detail would be different.
Error: the server doesn’t have a resource type “packages”
All packages are remotely managed by the management cluster, and packages, packagebundles, and packagebundlecontrollers resources are all deployed on the management cluster. Please make sure the kubeconfig is pointing to the management cluster that was used to create the workload cluster while interacting with package-related resources.
Error: packagebundlecontrollers.packages.eks.amazonaws.com “clusterName” not found
A package command run on a cluster that does not seem to be managed by the management cluster. To get a list of the clusters managed by the management cluster run the following command:
eksctl anywhere get packagebundlecontroller
NAME ACTIVEBUNDLE STATE DETAIL
billy v1-21-87 active
There will be one packagebundlecontroller for each cluster that is being managed. The only valid cluster name in the above example is billy
.
4.3 - Cert-Manager
If you have not already done so, make sure your cluster meets the package prerequisites. Be sure to refer to the troubleshooting guide in the event of a problem.
Important
- Starting at
eksctl anywhere
versionv0.12.0
, packages on workload clusters are remotely managed by the management cluster. - While following this guide to install packages on a workload cluster, please make sure the
kubeconfig
is pointing to the management cluster that was used to create the workload cluster. The only exception is thekubectl create namespace
command below, which should be run withkubeconfig
pointing to the workload cluster.
Install on workload cluster
NOTE: The cert-manager package can only be installed on a workload cluster
-
Generate the package configuration
eksctl anywhere generate package cert-manager --cluster <cluster-name> > cert-manager.yaml
-
Add the desired configuration to
cert-manager.yaml
Please see complete configuration options for all configuration options and their default values.
Example package file configuring a cert-manager package to run on a workload cluster.
apiVersion: packages.eks.amazonaws.com/v1alpha1 kind: Package metadata: name: my-cert-manager namespace: eksa-packages-<cluster-name> spec: packageName: cert-manager targetNamespace: <namespace-to-install-component>
-
Install Cert-Manager
eksctl anywhere create packages -f cert-manager.yaml
-
Validate the installation
eksctl anywhere get packages --cluster <cluster-name>
Example command output
NAME PACKAGE AGE STATE CURRENTVERSION TARGETVERSION DETAIL my-cert-manager cert-manager 15s installed 1.9.1-dc0c845b5f71bea6869efccd3ca3f2dd11b5c95f 1.9.1-dc0c845b5f71bea6869efccd3ca3f2dd11b5c95f (latest)
Update
To update package configuration, update cert-manager.yaml file, and run the following command:
eksctl anywhere apply package -f cert-manager.yaml
Upgrade
Cert-Manager will automatically be upgraded when a new bundle is activated.
Uninstall
To uninstall cert-manager, simply delete the package
eksctl anywhere delete package --cluster <cluster-name> cert-manager
4.4 - Cluster Autoscaler
If you have not already done so, make sure your cluster meets the package prerequisites. Be sure to refer to the troubleshooting guide in the event of a problem.
Important
- Starting at
eksctl anywhere
versionv0.12.0
, packages on workload clusters are remotely managed by the management cluster. - While following this guide to install packages on a workload cluster, please make sure the
kubeconfig
is pointing to the management cluster that was used to create the workload cluster. The only exception is thekubectl create namespace
command below, which should be run withkubeconfig
pointing to the workload cluster.
Choose a Deployment Approach
Each Cluster Autoscaler instance can target one cluster for autoscaling.
There are three ways to deploy a Cluster Autoscaler instance:
- Cluster Autoscaler deployed in the management cluster to autoscale the management cluster itself
- Cluster Autoscaler deployed in the management cluster to autoscale a remote workload cluster
- Cluster Autoscaler deployed in the workload cluster to autoscale the workload cluster itself
To read more about the tradeoffs of these different approaches, see here .
Install Cluster Autoscaler in management cluster
-
Ensure you have configured at least one WorkerNodeGroup in your cluster to support autoscaling as outlined here
-
Generate the package configuration
eksctl anywhere generate package cluster-autoscaler --cluster <cluster-name> > cluster-autoscaler.yaml
-
Add the desired configuration to
cluster-autoscaler.yaml
Please see complete configuration options for all configuration options and their default values.
Example package file configuring a cluster autoscaler package to run in the management cluster.
Note: Here, the
<cluster-name>
value represents the name of the management or workload cluster you would like to autoscale.apiVersion: packages.eks.amazonaws.com/v1alpha1 kind: Package metadata: name: cluster-autoscaler namespace: eksa-packages-<cluster-name> spec: packageName: cluster-autoscaler targetNamespace: <namespace-to-install-component> config: |- cloudProvider: "clusterapi" autoDiscovery: clusterName: "<cluster-name>"
-
Install Cluster Autoscaler
eksctl anywhere create packages -f cluster-autoscaler.yaml
-
Validate the installation
eksctl anywhere get packages --cluster <cluster-name>
Example command output
NAMESPACE NAME PACKAGE AGE STATE CURRENTVERSION TARGETVERSION DETAIL eksa-packages-mgmt-v-vmc cluster-autoscaler cluster-autoscaler 18h installed 9.21.0-1.21-147e2a701f6ab625452fe311d5c94a167270f365 9.21.0-1.21-147e2a701f6ab625452fe311d5c94a167270f365 (latest)
Update
To update package configuration, update cluster-autoscaler.yaml file, and run the following command:
eksctl anywhere apply package -f cluster-autoscaler.yaml
Upgrade
Cluster Autoscaler will automatically be upgraded when a new bundle is activated.
Uninstall
To uninstall Cluster Autoscaler, simply delete the package
eksctl anywhere delete package --cluster <cluster-name> cluster-autoscaler
Install Cluster Autoscaler in workload cluster
A few extra steps are required to install cluster autoscaler in a workload cluster instead of the management cluster.
First, retrieve the management cluster’s kubeconfig secret:
kubectl -n eksa-system get secrets <management-cluster-name>-kubeconfig -o yaml > mgmt-secret.yaml
Update the secret’s namespace to the namespace in the workload cluster that you would like to deploy the cluster autoscaler to. Then, apply the secret to the workload cluster.
kubectl --kubeconfig /path/to/workload/kubeconfig apply -f mgmt-secret.yaml
Now apply this package configuration to the management cluster:
apiVersion: packages.eks.amazonaws.com/v1alpha1
kind: Package
metadata:
name: workload-cluster-autoscaler
namespace: eksa-packages-<workload-cluster-name>
spec:
packageName: cluster-autoscaler
targetNamespace: <workload-cluster-namespace-to-install-components>
config: |-
cloudProvider: "clusterapi"
autoDiscovery:
clusterName: "<workload-cluster-name>"
clusterAPIMode: "incluster-kubeconfig"
clusterAPICloudConfigPath: "/etc/kubernetes/value"
extraVolumeSecrets:
cluster-autoscaler-cloud-config:
mountPath: "/etc/kubernetes"
name: "<management-cluster-name>-kubeconfig"
4.5 - Metrics Server
If you have not already done so, make sure your cluster meets the package prerequisites. Be sure to refer to the troubleshooting guide in the event of a problem.
Important
- Starting at
eksctl anywhere
versionv0.12.0
, packages on workload clusters are remotely managed by the management cluster. - While following this guide to install packages on a workload cluster, please make sure the
kubeconfig
is pointing to the management cluster that was used to create the workload cluster. The only exception is thekubectl create namespace
command below, which should be run withkubeconfig
pointing to the workload cluster.
Install
-
Generate the package configuration
eksctl anywhere generate package metrics-server --cluster <cluster-name> > metrics-server.yaml
-
Add the desired configuration to
metrics-server.yaml
Please see complete configuration options for all configuration options and their default values.
Example package file configuring a cluster autoscaler package to run on a management cluster.
apiVersion: packages.eks.amazonaws.com/v1alpha1 kind: Package metadata: name: metrics-server namespace: eksa-packages-<cluster-name> spec: packageName: metrics-server targetNamespace: <namespace-to-install-component> config: |- args: - "--kubelet-insecure-tls"
-
Install Metrics Server
eksctl anywhere create packages -f metrics-server.yaml
-
Validate the installation
eksctl anywhere get packages --cluster <cluster-name>
Example command output
NAME PACKAGE AGE STATE CURRENTVERSION TARGETVERSION DETAIL metrics-server metrics-server 8h installed 0.6.1-eks-1-23-6-b4c2524fabb3dd4c5f9b9070a418d740d3e1a8a2 0.6.1-eks-1-23-6-b4c2524fabb3dd4c5f9b9070a418d740d3e1a8a2 (latest)
Update
To update package configuration, update metrics-server.yaml file, and run the following command:
eksctl anywhere apply package -f metrics-server.yaml
Upgrade
Metrics Server will automatically be upgraded when a new bundle is activated.
Uninstall
To uninstall Metrics Server, simply delete the package
eksctl anywhere delete package --cluster <cluster-name> metrics-server
4.6 - AWS Distro for OpenTelemetry (ADOT)
If you have not already done so, make sure your cluster meets the package prerequisites. Be sure to refer to the troubleshooting guide in the event of a problem.
Important
- Starting at
eksctl anywhere
versionv0.12.0
, packages on workload clusters are remotely managed by the management cluster. - While following this guide to install packages on a workload cluster, please make sure the
kubeconfig
is pointing to the management cluster that was used to create the workload cluster. The only exception is thekubectl create namespace
command below, which should be run withkubeconfig
pointing to the workload cluster.
Install
-
Generate the package configuration
eksctl anywhere generate package adot --cluster <cluster-name> > adot.yaml
-
Add the desired configuration to
adot.yaml
Please see complete configuration options for all configuration options and their default values.
Example package file with
daemonSet
mode and default configuration:apiVersion: packages.eks.amazonaws.com/v1alpha1 kind: Package metadata: name: my-adot namespace: eksa-packages-<cluster-name> spec: packageName: adot targetNamespace: observability config: | mode: daemonset
Example package file with
deployment
mode and customized collector components to scrap ADOT collector’s own metrics:apiVersion: packages.eks.amazonaws.com/v1alpha1 kind: Package metadata: name: my-adot namespace: eksa-packages-<cluster-name> spec: packageName: adot targetNamespace: observability config: | mode: deployment replicaCount: 2 config: receivers: prometheus: config: scrape_configs: - job_name: opentelemetry-collector scrape_interval: 10s static_configs: - targets: - ${MY_POD_IP}:8888 processors: batch: {} memory_limiter: null exporters: logging: loglevel: debug prometheusremotewrite: endpoint: "<prometheus-remote-write-end-point>" extensions: health_check: {} memory_ballast: {} service: pipelines: metrics: receivers: [prometheus] processors: [batch] exporters: [logging, prometheusremotewrite] telemetry: metrics: address: 0.0.0.0:8888
-
Create the namespace (If overriding
targetNamespace
, changeobservability
to the value oftargetNamespace
)kubectl create namespace observability
-
Install adot
eksctl anywhere create packages -f adot.yaml
-
Validate the installation
eksctl anywhere get packages --cluster <cluster-name>
Example command output
NAME PACKAGE AGE STATE CURRENTVERSION TARGETVERSION DETAIL my-adot adot 19h installed 0.25.0-c26690f90d38811dbb0e3dad5aea77d1efa52c7b 0.25.0-c26690f90d38811dbb0e3dad5aea77d1efa52c7b (latest)
Update
To update package configuration, update adot.yaml file, and run the following command:
eksctl anywhere apply package -f adot.yaml
Upgrade
ADOT will automatically be upgraded when a new bundle is activated.
Uninstall
To uninstall ADOT, simply delete the package
eksctl anywhere delete package --cluster <cluster-name> my-adot
4.7 - Prometheus
If you have not already done so, make sure your cluster meets the package prerequisites. Be sure to refer to the troubleshooting guide in the event of a problem.
Important
- Starting at
eksctl anywhere
versionv0.12.0
, packages on workload clusters are remotely managed by the management cluster. - While following this guide to install packages on a workload cluster, please make sure the
kubeconfig
is pointing to the management cluster that was used to create the workload cluster. The only exception is thekubectl create namespace
command below, which should be run withkubeconfig
pointing to the workload cluster.
Install
-
Generate the package configuration
eksctl anywhere generate package prometheus --cluster <cluster-name> > prometheus.yaml
-
Add the desired configuration to
prometheus.yaml
Please see complete configuration options for all configuration options and their default values.
Example package file with default configuration, which enables prometheus-server and node-exporter:
apiVersion: packages.eks.amazonaws.com/v1alpha1 kind: Package metadata: name: generated-prometheus namespace: eksa-packages-<cluster-name> spec: packageName: prometheus
Example package file with prometheus-server (or node-exporter) disabled:
apiVersion: packages.eks.amazonaws.com/v1alpha1 kind: Package metadata: name: generated-prometheus namespace: eksa-packages-<cluster-name> spec: packageName: prometheus config: | # disable prometheus-server server: enabled: false # or disable node-exporter # nodeExporter: # enabled: false
Example package file with prometheus-server deployed as a statefulSet with replicaCount 2, and set scrape config to collect Prometheus-server’s own metrics only:
apiVersion: packages.eks.amazonaws.com/v1alpha1 kind: Package metadata: name: generated-prometheus namespace: eksa-packages-<cluster-name> spec: packageName: prometheus targetNamespace: observability config: | server: replicaCount: 2 statefulSet: enabled: true serverFiles: prometheus.yml: scrape_configs: - job_name: prometheus static_configs: - targets: - localhost:9090
-
Create the namespace (If overriding
targetNamespace
, changeobservability
to the value oftargetNamespace
)kubectl create namespace observability
-
Install prometheus
eksctl anywhere create packages -f prometheus.yaml
-
Validate the installation
eksctl anywhere get packages --cluster <cluster-name>
Example command output
NAMESPACE NAME PACKAGE AGE STATE CURRENTVERSION TARGETVERSION DETAIL eksa-packages-<cluster-name> generated-prometheus prometheus 17m installed 2.41.0-b53c8be243a6cc3ac2553de24ab9f726d9b851ca 2.41.0-b53c8be243a6cc3ac2553de24ab9f726d9b851ca (latest)
Update
To update package configuration, update prometheus.yaml file, and run the following command:
eksctl anywhere apply package -f prometheus.yaml
Upgrade
Prometheus will automatically be upgraded when a new bundle is activated.
Uninstall
To uninstall Prometheus, simply delete the package
eksctl anywhere delete package --cluster <cluster-name> generated-prometheus
4.8 - Emissary Ingress
If you have not already done so, make sure your cluster meets the package prerequisites. Be sure to refer to the troubleshooting guide in the event of a problem.
Important
- Starting at
eksctl anywhere
versionv0.12.0
, packages on workload clusters are remotely managed by the management cluster. - While following this guide to install packages on a workload cluster, please make sure the
kubeconfig
is pointing to the management cluster that was used to create the workload cluster. The only exception is thekubectl create namespace
command below, which should be run withkubeconfig
pointing to the workload cluster.
Install
-
Generate the package configuration
eksctl anywhere generate package emissary --cluster <cluster-name> > emissary.yaml
-
Add the desired configuration to
emissary.yaml
Please see complete configuration options for all configuration options and their default values.
Example package file with standard configuration.
apiVersion: packages.eks.amazonaws.com/v1alpha1 kind: Package metadata: name: emissary namespace: eksa-packages-<cluster-name> spec: packageName: emissary
-
Install Emissary
eksctl anywhere create packages -f emissary.yaml
-
Validate the installation
eksctl anywhere get packages --cluster <cluster-name>
Example command output
NAMESPACE NAME PACKAGE AGE STATE CURRENTVERSION TARGETVERSION DETAIL eksa-packages emissary emissary 2m57s installed 3.0.0-a507e09c2a92c83d65737835f6bac03b9b341467 3.0.0-a507e09c2a92c83d65737835f6bac03b9b341467 (latest)
Update
To update package configuration, update emissary.yaml file, and run the following command:
eksctl anywhere apply package -f emissary.yaml
Upgrade
Emissary will automatically be upgraded when a new bundle is activated.
Uninstall
To uninstall Emissary, simply delete the package
eksctl anywhere delete package --cluster <cluster-name> emissary
4.9 - Harbor
If you have not already done so, make sure your cluster meets the package prerequisites. Be sure to refer to the troubleshooting guide in the event of a problem.
Important
- Starting at
eksctl anywhere
versionv0.12.0
, packages on workload clusters are remotely managed by the management cluster. - While following this guide to install packages on a workload cluster, please make sure the
kubeconfig
is pointing to the management cluster that was used to create the workload cluster. The only exception is thekubectl create namespace
command below, which should be run withkubeconfig
pointing to the workload cluster.
Install
-
Generate the package configuration
eksctl anywhere generate package harbor --cluster <cluster-name> > harbor.yaml
-
Add the desired configuration to
harbor.yaml
Please see complete configuration options for all configuration options and their default values.
Important
- All configuration options are listed in dot notations (e.g.,
expose.tls.enabled
) in the doc, but they have to be transformed to hierachical structures when specified in theconfig
section in the YAML spec. - Harbor web portal is exposed through
NodePort
by default, and its default port number is30003
with TLS enabled and30002
with TLS disabled. - TLS is enabled by default for connections to Harbor web portal, and a secret resource named
harbor-tls-secret
is required for that purpose. It can be provisioned through cert-manager or manually with the following command using self-signed certificate:kubectl create secret tls harbor-tls-secret --cert=[path to certificate file] --key=[path to key file] -n eksa-packages
secretKey
has to be set as a string of 16 characters for encryption.
TLS example with auto certificate generation
apiVersion: packages.eks.amazonaws.com/v1alpha1 kind: Package metadata: name: my-harbor namespace: eksa-packages-<cluster-name> spec: packageName: harbor config: |- secretKey: "use-a-secret-key" externalURL: https://harbor.eksa.demo:30003 expose: tls: certSource: auto auto: commonName: "harbor.eksa.demo"
Non-TLS example
apiVersion: packages.eks.amazonaws.com/v1alpha1 kind: Package metadata: name: my-harbor namespace: eksa-packages-<cluster-name> spec: packageName: harbor config: |- secretKey: "use-a-secret-key" externalURL: http://harbor.eksa.demo:30002 expose: tls: enabled: false
- All configuration options are listed in dot notations (e.g.,
-
Install Harbor
eksctl anywhere create packages -f harbor.yaml
-
Check Harbor
eksctl anywhere get packages --cluster <cluster-name>
Example command output
NAME PACKAGE AGE STATE CURRENTVERSION TARGETVERSION DETAIL my-harbor harbor 5m34s installed v2.5.1 v2.5.1 (latest)
Harbor web portal is accessible at whatever
externalURL
is set to. See complete configuration options for all default values.
Update
To update package configuration, update harbor.yaml file, and run the following command:
eksctl anywhere apply package -f harbor.yaml
Upgrade
Note
- New versions of software packages will be automatically downloaded but not automatically installed. You can always manually run
eksctl
to check and install updates.
-
Verify a new bundle is available
eksctl anywhere get packagebundle
Example command output
NAME VERSION STATE v1.25-120 1.25 active (upgrade available) v1.26-120 1.26 inactive
-
Upgrade Harbor
eksctl anywhere upgrade packages --bundle-version v1.26-120
-
Check Harbor
eksctl anywhere get packages --cluster <cluster-name>
Example command output
NAME PACKAGE AGE STATE CURRENTVERSION TARGETVERSION DETAIL my-harbor Harbor 14m installed v2.7.1 v2.7.1 (latest)
Uninstall
-
Uninstall Harbor
Important
- By default, PVCs created for jobservice and registry are not removed during a package delete operation, which can be changed by leaving
persistence.resourcePolicy
empty.
eksctl anywhere delete package --cluster <cluster-name> my-harbor
- By default, PVCs created for jobservice and registry are not removed during a package delete operation, which can be changed by leaving
4.10 - MetalLB
If you have not already done so, make sure your cluster meets the package prerequisites. Be sure to refer to the troubleshooting guide in the event of a problem.
Important
- Starting at
eksctl anywhere
versionv0.12.0
, packages on workload clusters are remotely managed by the management cluster. - While following this guide to install packages on a workload cluster, please make sure the
kubeconfig
is pointing to the management cluster that was used to create the workload cluster. The only exception is thekubectl create namespace
command below, which should be run withkubeconfig
pointing to the workload cluster.
Install
-
Generate the package configuration
eksctl anywhere generate package metallb --cluster <cluster-name> > metallb.yaml
-
Add the desired configuration to
metallb.yaml
Please see complete configuration options for all configuration options and their default values.
Example package file with bgp configuration:
apiVersion: packages.eks.amazonaws.com/v1alpha1 kind: Package metadata: name: mylb namespace: eksa-packages-<cluster-name> spec: packageName: metallb config: | IPAddressPools: - name: default addresses: - 10.220.0.93/32 - 10.220.0.97-10.220.0.120 BGPAdvertisements: - ipAddressPools: - default BGPPeers: - peerAddress: 10.220.0.2 peerASN: 65000 myASN: 65002
Example package file with ARP configuration:
apiVersion: packages.eks.amazonaws.com/v1alpha1 kind: Package metadata: name: mylb namespace: eksa-packages spec: packageName: metallb config: | IPAddressPools: - name: default addresses: - 10.220.0.93/32 - 10.220.0.97-10.220.0.120 L2Advertisements: - ipAddressPools: - default
-
Create the namespace (If overriding
targetNamespace
, changemetallb-system
to the value oftargetNamespace
)kubectl create namespace metallb-system
-
Install MetalLB
eksctl anywhere create packages -f metallb.yaml
-
Validate the installation
eksctl anywhere get packages --cluster <cluster-name>
Example command output
NAME PACKAGE AGE STATE CURRENTVERSION TARGETVERSION DETAIL mylb metallb 22h installed 0.13.5-ce5b5de19014202cebd4ab4c091830a3b6dfea06 0.13.5-ce5b5de19014202cebd4ab4c091830a3b6dfea06 (latest)
Update
To update package configuration, update metallb.yaml file, and run the following command:
eksctl anywhere apply package -f metallb.yaml
Upgrade
MetalLB will automatically be upgraded when a new bundle is activated.
Uninstall
To uninstall MetalLB, simply delete the package
eksctl anywhere delete package --cluster <cluster-name> mylb