I used Kubernetes service on Google Cloud Platform and it was a great service. I also wrote one blog, Running Spark on Kubernetes, on this area. Recently I used Azure Kubernetes Service (AKS) for a different project and run into some issues. One of major annoying issues was that I could not get external IP for load balancer on AKS. This blog discusses the process I identified the issue and solution for this problem.
I used the example from Microsoft, Use Azure Kubernetes Service with Kafka on HDInsight, for my testing. The source code can be accessed at https://github.com/Blackmist/Kafka-AKS-Test. The example is pretty simple and straight forward and the most import part is file kafka-aks-test.yaml. Here is the content of the file.
apiVersion: apps/v1beta1 kind: Deployment metadata: name: kafka-aks-test spec: replicas: 1 strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 1 minReadySeconds: 5 template: metadata: labels: app: kafka-aks-test spec: containers: - name: kafka-aks-test image: microsoft/kafka-aks-test:v1 ports: - containerPort: 80 resources: requests: cpu: 250m limits: cpu: 500m --- apiVersion: v1 kind: Service metadata: name: kafka-aks-test spec: type: LoadBalancer ports: - port: 80 selector: app: kafka-aks-test
We can see the Service is using LoadBalancer. So it should automatically get an External IP for my load balancer of the service. Unfortunately, I can not get this external IP and was stuck in Pending stage forever.
[root@ Kafka-AKS-Test]# kubectl get service kafka-aks-test --watch NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kafka-aks-test LoadBalancer 192.168.130.97 <pending> 80:32656/TCP 10s
To make the debugging process simpler, I used the following two lines of commands to create a NGIX service. This is a nice and quick way to find out whether AKS is working or not.
kubectl run my-nginx --image=nginx --replicas=1 --port=80 kubectl expose deployment my-nginx --port=80 --type=LoadBalancer
Got the same issue. For AKS service, a good way to find out what’s going on in the service is to use kubectl describe service command. Here is output from this command.
[root@ AKS-Test]# kubectl describe service my-nginx Name: my-nginx Namespace: default Labels: run=my-nginx Annotations: <none> Selector: run=my-nginx Type: LoadBalancer IP: <pending> Port: <unset> 80/TCP TargetPort: 80/TCP NodePort: <unset> 31478/TCP Endpoints: 10.2.5.70:80 Session Affinity: None External Traffic Policy: Cluster Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning CreatingLoadBalancerFailed 2m (x3 over 3m) service-controller Error creating load balancer (will retry): failed to ensure load balancer for service default/my-nginx: [ensure(default/my-nginx): lb(kubernetes) - failed to ensure host in pool: "network.InterfacesClient#CreateOrUpdate: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code=\"LinkedAuthorizationFailed\" Message=\"The client '11b7e54a-e1bc-4092-af66-b014c11d9b87' with object id '11b7e54a-e1bc-4092-af66-b014c11d9b87' has permission to perform action 'Microsoft.Network/networkInterfaces/write' on scope '/subscriptions/763d9895-8916-4d35-8b43-d51b52642cef/resourceGroups/MC_exa-dev01-ue1-aksc2- vnet2-rg_exa-aksc2_eastus/providers/Microsoft.Network/networkInterfaces/aks-agentpool-40875261-nic-0'; however, it does not have permission to perform action 'Microsoft.Network/virtualNetworks/ subnets/join/action' on the linked scope(s) '/subscriptions/763d9895-8916-4d35-8b43-d51b52642cef/ resourceGroups/exa-dev01-ue1-vnet2-rg/providers/Microsoft.Network/virtualNetworks/exa-dev01-ue1-vnet2/ subnets/snet-aks2'.\"", ensure(default/my-nginx): lb(kubernetes) - failed to ensure host in pool: "network.InterfacesClient#CreateOrUpdate: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code=\"LinkedAuthorizationFailed\" Message=\"The client '11b7e54a-e1bc-4092-af66-b014c11d9b87' with object id '11b7e54a-e1bc-4092-af66-b014c11d9b87' has permission to perform action 'Microsoft.Network/networkInterfaces/write' on scope . . . .
It seems this is a common issue and many people run into the similar issue. Checked out the issue site for Github and found out one issue related to my problem, Azure AKS CreatingLoadBalancerFailed on AKS cluster with advanced networking. One of recommendations was to add AKS’ Service Principal (SP) to the subnet or VNet as Contributor. Did not work on me. Tried to add the SP as Owner. It didn’t work.
If running command kubectl get all –all-namespaces, it provides everything related to Kubernetes on AKS.
[root@ ~]# kubectl get all --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE kube-system pod/addon-http-application-routing-default-http-backend-66c97fw842d 1/1 Running 1 2d kube-system pod/addon-http-application-routing-external-dns-c547864b7-r7zts 1/1 Running 1 5d kube-system pod/addon-http-application-routing-nginx-ingress-controller-642qfcp 0/1 CrashLoopBackOff 4 1m kube-system pod/azureproxy-79c5db744-7ndvk 1/1 Running 4 5d kube-system pod/heapster-55f855b47-q5jtf 2/2 Running 0 2d kube-system pod/kube-dns-v20-7c556f89c5-5ngp5 3/3 Running 0 5d kube-system pod/kube-dns-v20-7c556f89c5-djf7d 3/3 Running 3 2d kube-system pod/kube-proxy-dpt28 1/1 Running 2 5d kube-system pod/kube-proxy-jq8hx 1/1 Running 1 5d kube-system pod/kube-proxy-v4xc5 1/1 Running 0 5d kube-system pod/kube-svc-redirect-77kj4 1/1 Running 2 5d kube-system pod/kube-svc-redirect-j9545 1/1 Running 1 5d kube-system pod/kube-svc-redirect-kvh2r 1/1 Running 0 5d kube-system pod/kubernetes-dashboard-546f987686-ws5nm 1/1 Running 0 2d kube-system pod/omsagent-4xn72 1/1 Running 2 5d kube-system pod/omsagent-fbjsp 1/1 Running 1 5d kube-system pod/omsagent-pvfrt 1/1 Running 0 5d kube-system pod/tiller-deploy-7ccf99cd64-tstvl 1/1 Running 1 23h kube-system pod/tunnelfront-55bbb6b96c-nhlbk 1/1 Running 0 5d NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE default service/kubernetes ClusterIP 192.168.0.1 <none> 443/TCP 1d kube-system service/addon-http-application-routing-default-http-backend ClusterIP 192.168.89.103 <none> 80/TCP 5d kube-system service/addon-http-application-routing-nginx-ingress LoadBalancer 192.168.205.83 <pending> 80:32704/TCP,443:32663/TCP 5d kube-system service/heapster ClusterIP 192.168.2.201 <none> 80/TCP 5d kube-system service/kube-dns ClusterIP 192.168.0.10 <none> 53/UDP,53/TCP 5d kube-system service/kubernetes-dashboard ClusterIP 192.168.150.149 <none> 80/TCP 5d kube-system service/tiller-deploy ClusterIP 192.168.34.240 <none> 44134/TCP 23h NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kube-system daemonset.extensions/kube-proxy 3 3 3 3 3 beta.kubernetes.io/os=linux 5d kube-system daemonset.extensions/kube-svc-redirect 3 3 3 3 3 beta.kubernetes.io/os=linux 5d kube-system daemonset.extensions/omsagent 3 3 3 3 3 beta.kubernetes.io/os=linux 5d NAMESPACE NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE kube-system deployment.extensions/addon-http-application-routing-default-http-backend 1 1 1 1 5d kube-system deployment.extensions/addon-http-application-routing-external-dns 1 1 1 1 5d kube-system deployment.extensions/addon-http-application-routing-nginx-ingress-controller 1 1 1 0 5d kube-system deployment.extensions/azureproxy 1 1 1 1 5d kube-system deployment.extensions/heapster 1 1 1 1 5d kube-system deployment.extensions/kube-dns-v20 2 2 2 2 5d kube-system deployment.extensions/kubernetes-dashboard 1 1 1 1 5d kube-system deployment.extensions/tiller-deploy 1 1 1 1 23h kube-system deployment.extensions/tunnelfront 1 1 1 1 5d NAMESPACE NAME DESIRED CURRENT READY AGE kube-system replicaset.extensions/addon-http-application-routing-default-http-backend-66c97f5dc7 1 1 1 5d kube-system replicaset.extensions/addon-http-application-routing-external-dns-c547864b7 1 1 1 5d kube-system replicaset.extensions/addon-http-application-routing-nginx-ingress-controller-6449fd79f9 1 1 0 5d kube-system replicaset.extensions/azureproxy-79c5db744 1 1 1 5d kube-system replicaset.extensions/heapster-55f855b47 1 1 1 5d kube-system replicaset.extensions/heapster-56c6f9566f 0 0 0 5d kube-system replicaset.extensions/kube-dns-v20-7c556f89c5 2 2 2 5d kube-system replicaset.extensions/kubernetes-dashboard-546f987686 1 1 1 5d kube-system replicaset.extensions/tiller-deploy-7ccf99cd64 1 1 1 23h kube-system replicaset.extensions/tunnelfront-55bbb6b96c 1 1 1 5d NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kube-system daemonset.apps/kube-proxy 3 3 3 3 3 beta.kubernetes.io/os=linux 5d kube-system daemonset.apps/kube-svc-redirect 3 3 3 3 3 beta.kubernetes.io/os=linux 5d kube-system daemonset.apps/omsagent 3 3 3 3 3 beta.kubernetes.io/os=linux 5d NAMESPACE NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE kube-system deployment.apps/addon-http-application-routing-default-http-backend 1 1 1 1 5d kube-system deployment.apps/addon-http-application-routing-external-dns 1 1 1 1 5d kube-system deployment.apps/addon-http-application-routing-nginx-ingress-controller 1 1 1 0 5d kube-system deployment.apps/azureproxy 1 1 1 1 5d kube-system deployment.apps/heapster 1 1 1 1 5d kube-system deployment.apps/kube-dns-v20 2 2 2 2 5d kube-system deployment.apps/kubernetes-dashboard 1 1 1 1 5d kube-system deployment.apps/tiller-deploy 1 1 1 1 23h kube-system deployment.apps/tunnelfront 1 1 1 1 5d NAMESPACE NAME DESIRED CURRENT READY AGE kube-system replicaset.apps/addon-http-application-routing-default-http-backend-66c97f5dc7 1 1 1 5d kube-system replicaset.apps/addon-http-application-routing-external-dns-c547864b7 1 1 1 5d kube-system replicaset.apps/addon-http-application-routing-nginx-ingress-controller-6449fd79f9 1 1 0 5d kube-system replicaset.apps/azureproxy-79c5db744 1 1 1 5d kube-system replicaset.apps/heapster-55f855b47 1 1 1 5d kube-system replicaset.apps/heapster-56c6f9566f 0 0 0 5d kube-system replicaset.apps/kube-dns-v20-7c556f89c5 2 2 2 5d kube-system replicaset.apps/kubernetes-dashboard-546f987686 1 1 1 5d kube-system replicaset.apps/tiller-deploy-7ccf99cd64 1 1 1 23h kube-system replicaset.apps/tunnelfront-55bbb6b96c 1 1 1 5d
Pay attention more on the pod that has CrashLoopBackOff error. I saw this CrashLoopBackOff thing restarted over 1000 times within 5 days in our first AKS cluster. This is one Pod that is used internally by AKS before we can deploy anything else.
I opened a ticket with Microsoft and got Microsoft Support to work with me. After a very long conference call and even completely reinstalled AKS cluster, we finally figured out the way to get around this issue. The key is to give correct permission for AKS Service Principal.
There is one drawback when deploying AKS with Azure UI. You can not specify the name of Service Principal and SP is automatically created with the name like . For us, we have installed and uninstall AKS multiple times, so we have a few SP names. It is confusing to decide which one is the one we really care. Finding out the correct SP name is a challenge task. Anyway, the followings are the steps to add correct permission to AKS Service Principal.
1. Get Client ID
Run the following command to get client id.
[root@ AKS-Test]# az aks show -n exa-aksc2 -g exa-dev01-ue1-aksc2-vnet2-rg | grep clientId "clientId": "27ae6273-9706-4156-b546-607279623990"
2. Get SP Name
Click Azure Active Directory, then click App registrations. Change dropdown from My Apps to All apps. Then input the clientId. It should show the SP name as screen below.
3. Set Correct Permission for the SP
At the time when AKS creates the cluster, it creates a SP showing above. Then grant Contributor role to the SP. This is the problem as certain operations require OWNER permissions. So need to add Owner role to the SP. All the resources used by AKS cluster are under MC_* resource group. In our case, it is MC_exa-dev01-ue1-aksc2-vnet2-rg_exa-aksc2_eastus.
Click Resource Group, then MC_exa-dev01-ue1-aksc2-vnet2-rg_exa-aksc2_eastus. Click Access Control (IAM), then click + Add.
After this change, our issue was gone. Here is the result from describe service. No error this time.
[root@exa-dev01-ue1-kfclient1-vm Kafka-AKS-Test]# kubectl describe service my-nginx Name: my-nginx Namespace: default Labels: run=my-nginx Annotations: <none> Selector: run=my-nginx Type: LoadBalancer IP: 10.242.237.5 Port: <unset> 80/TCP TargetPort: 80/TCP NodePort: <unset> 32026/TCP Endpoints: 10.2.10.70:80 Session Affinity: None External Traffic Policy: Cluster Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal EnsuringLoadBalancer 8s service-controller Ensuring load balancer
The deployment also looks good.
Sample output:
[root@exa-dev01-ue1-kfclient1-vm Kafka-AKS-Test]# kubectl describe deployment my-nginx
Name: my-nginx
Namespace: default
CreationTimestamp: Thu, 14 Jun 2018 15:03:23 +0000
Labels: run=my-nginx
Annotations: deployment.kubernetes.io/revision=1
Selector: run=my-nginx
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 1 max surge
Pod Template:
Labels: run=my-nginx
Containers:
my-nginx:
Image: nginx
Port: 80/TCP
Host Port: 0/TCP
Environment:
Mounts:
Volumes:
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
OldReplicaSets:
NewReplicaSet: my-nginx-9d5677d94 (1/1 replicas created)
Events:
[/code]
For more information about our issue, you can check it out at https://github.com/Azure/AKS/issues/427.