Could not Get External IP for Load Balancer on Azure AKS

I used Kubernetes service on Google Cloud Platform and it was a great service. I also wrote one blog, Running Spark on Kubernetes, on this area. Recently I used Azure Kubernetes Service (AKS) for a different project and run into some issues. One of major annoying issues was that I could not get external IP for load balancer on AKS. This blog discusses the process I identified the issue and solution for this problem.

I used the example from Microsoft, Use Azure Kubernetes Service with Kafka on HDInsight, for my testing. The source code can be accessed at https://github.com/Blackmist/Kafka-AKS-Test. The example is pretty simple and straight forward and the most import part is file kafka-aks-test.yaml. Here is the content of the file.

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: kafka-aks-test
spec:
  replicas: 1
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  minReadySeconds: 5
  template:
    metadata:
      labels:
        app: kafka-aks-test
    spec:
      containers:
        - name: kafka-aks-test
          image: microsoft/kafka-aks-test:v1
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: 250m
            limits:
              cpu: 500m
---
apiVersion: v1
kind: Service
metadata:
  name: kafka-aks-test
spec:
  type: LoadBalancer
  ports:
    - port: 80
  selector:
    app: kafka-aks-test

We can see the Service is using LoadBalancer. So it should automatically get an External IP for my load balancer of the service. Unfortunately, I can not get this external IP and was stuck in Pending stage forever.

[root@ Kafka-AKS-Test]# kubectl get service kafka-aks-test --watch
NAME             TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
kafka-aks-test   LoadBalancer   192.168.130.97   <pending>     80:32656/TCP   10s

To make the debugging process simpler, I used the following two lines of commands to create a NGIX service. This is a nice and quick way to find out whether AKS is working or not.

kubectl run my-nginx --image=nginx --replicas=1 --port=80
kubectl expose deployment my-nginx --port=80 --type=LoadBalancer

Got the same issue. For AKS service, a good way to find out what’s going on in the service is to use kubectl describe service command. Here is output from this command.

[root@ AKS-Test]# kubectl describe service my-nginx
Name:                     my-nginx
Namespace:                default
Labels:                   run=my-nginx
Annotations:              <none>
Selector:                 run=my-nginx
Type:                     LoadBalancer
IP:                       <pending>
Port:                     <unset>  80/TCP
TargetPort:               80/TCP
NodePort:                 <unset>  31478/TCP
Endpoints:                10.2.5.70:80
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type     Reason                      Age               From                Message
  ----     ------                      ----              ----                -------
  Warning  CreatingLoadBalancerFailed  2m (x3 over 3m)   service-controller  Error creating 
  load balancer (will retry): failed to ensure load balancer for service default/my-nginx: 
  [ensure(default/my-nginx): lb(kubernetes) - failed to ensure host in pool: "network.InterfacesClient#CreateOrUpdate: 
  Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service 
  returned an error. Status=403 Code=\"LinkedAuthorizationFailed\" Message=\"The client 
  '11b7e54a-e1bc-4092-af66-b014c11d9b87' with object id '11b7e54a-e1bc-4092-af66-b014c11d9b87' 
  has permission to perform action 'Microsoft.Network/networkInterfaces/write' on scope 
  '/subscriptions/763d9895-8916-4d35-8b43-d51b52642cef/resourceGroups/MC_exa-dev01-ue1-aksc2-
  vnet2-rg_exa-aksc2_eastus/providers/Microsoft.Network/networkInterfaces/aks-agentpool-40875261-nic-0'; 
  however, it does not have permission to perform action 'Microsoft.Network/virtualNetworks/
  subnets/join/action' on the linked scope(s) '/subscriptions/763d9895-8916-4d35-8b43-d51b52642cef/
  resourceGroups/exa-dev01-ue1-vnet2-rg/providers/Microsoft.Network/virtualNetworks/exa-dev01-ue1-vnet2/
  subnets/snet-aks2'.\"", ensure(default/my-nginx): lb(kubernetes) - failed to ensure host 
  in pool: "network.InterfacesClient#CreateOrUpdate: Failure responding to request: StatusCode=403 
  -- Original Error: autorest/azure: Service returned an error. Status=403 Code=\"LinkedAuthorizationFailed\" 
  Message=\"The client '11b7e54a-e1bc-4092-af66-b014c11d9b87' with object id '11b7e54a-e1bc-4092-af66-b014c11d9b87' 
  has permission to perform action 'Microsoft.Network/networkInterfaces/write' on scope
  . . . . 

It seems this is a common issue and many people run into the similar issue. Checked out the issue site for Github and found out one issue related to my problem, Azure AKS CreatingLoadBalancerFailed on AKS cluster with advanced networking. One of recommendations was to add AKS’ Service Principal (SP) to the subnet or VNet as Contributor. Did not work on me. Tried to add the SP as Owner. It didn’t work.

If running command kubectl get all –all-namespaces, it provides everything related to Kubernetes on AKS.

[root@ ~]# kubectl get all --all-namespaces
NAMESPACE     NAME                                                                  READY     STATUS             RESTARTS   AGE
kube-system   pod/addon-http-application-routing-default-http-backend-66c97fw842d   1/1       Running            1          2d
kube-system   pod/addon-http-application-routing-external-dns-c547864b7-r7zts       1/1       Running            1          5d
kube-system   pod/addon-http-application-routing-nginx-ingress-controller-642qfcp   0/1       CrashLoopBackOff   4          1m
kube-system   pod/azureproxy-79c5db744-7ndvk                                        1/1       Running            4          5d
kube-system   pod/heapster-55f855b47-q5jtf                                          2/2       Running            0          2d
kube-system   pod/kube-dns-v20-7c556f89c5-5ngp5                                     3/3       Running            0          5d
kube-system   pod/kube-dns-v20-7c556f89c5-djf7d                                     3/3       Running            3          2d
kube-system   pod/kube-proxy-dpt28                                                  1/1       Running            2          5d
kube-system   pod/kube-proxy-jq8hx                                                  1/1       Running            1          5d
kube-system   pod/kube-proxy-v4xc5                                                  1/1       Running            0          5d
kube-system   pod/kube-svc-redirect-77kj4                                           1/1       Running            2          5d
kube-system   pod/kube-svc-redirect-j9545                                           1/1       Running            1          5d
kube-system   pod/kube-svc-redirect-kvh2r                                           1/1       Running            0          5d
kube-system   pod/kubernetes-dashboard-546f987686-ws5nm                             1/1       Running            0          2d
kube-system   pod/omsagent-4xn72                                                    1/1       Running            2          5d
kube-system   pod/omsagent-fbjsp                                                    1/1       Running            1          5d
kube-system   pod/omsagent-pvfrt                                                    1/1       Running            0          5d
kube-system   pod/tiller-deploy-7ccf99cd64-tstvl                                    1/1       Running            1          23h
kube-system   pod/tunnelfront-55bbb6b96c-nhlbk                                      1/1       Running            0          5d

NAMESPACE     NAME                                                          TYPE           CLUSTER-IP        EXTERNAL-IP   PORT(S)                      AGE
default       service/kubernetes                                            ClusterIP      192.168.0.1       <none>        443/TCP                      1d
kube-system   service/addon-http-application-routing-default-http-backend   ClusterIP      192.168.89.103    <none>        80/TCP                       5d
kube-system   service/addon-http-application-routing-nginx-ingress          LoadBalancer   192.168.205.83    <pending>     80:32704/TCP,443:32663/TCP   5d
kube-system   service/heapster                                              ClusterIP      192.168.2.201     <none>        80/TCP                       5d
kube-system   service/kube-dns                                              ClusterIP      192.168.0.10      <none>        53/UDP,53/TCP                5d
kube-system   service/kubernetes-dashboard                                  ClusterIP      192.168.150.149   <none>        80/TCP                       5d
kube-system   service/tiller-deploy                                         ClusterIP      192.168.34.240    <none>        44134/TCP                    23h

NAMESPACE     NAME                                     DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
kube-system   daemonset.extensions/kube-proxy          3         3         3         3            3           beta.kubernetes.io/os=linux   5d
kube-system   daemonset.extensions/kube-svc-redirect   3         3         3         3            3           beta.kubernetes.io/os=linux   5d
kube-system   daemonset.extensions/omsagent            3         3         3         3            3           beta.kubernetes.io/os=linux   5d

NAMESPACE     NAME                                                                            DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.extensions/addon-http-application-routing-default-http-backend       1         1         1            1           5d
kube-system   deployment.extensions/addon-http-application-routing-external-dns               1         1         1            1           5d
kube-system   deployment.extensions/addon-http-application-routing-nginx-ingress-controller   1         1         1            0           5d
kube-system   deployment.extensions/azureproxy                                                1         1         1            1           5d
kube-system   deployment.extensions/heapster                                                  1         1         1            1           5d
kube-system   deployment.extensions/kube-dns-v20                                              2         2         2            2           5d
kube-system   deployment.extensions/kubernetes-dashboard                                      1         1         1            1           5d
kube-system   deployment.extensions/tiller-deploy                                             1         1         1            1           23h
kube-system   deployment.extensions/tunnelfront                                               1         1         1            1           5d

NAMESPACE     NAME                                                                                       DESIRED   CURRENT   READY     AGE
kube-system   replicaset.extensions/addon-http-application-routing-default-http-backend-66c97f5dc7       1         1         1         5d
kube-system   replicaset.extensions/addon-http-application-routing-external-dns-c547864b7                1         1         1         5d
kube-system   replicaset.extensions/addon-http-application-routing-nginx-ingress-controller-6449fd79f9   1         1         0         5d
kube-system   replicaset.extensions/azureproxy-79c5db744                                                 1         1         1         5d
kube-system   replicaset.extensions/heapster-55f855b47                                                   1         1         1         5d
kube-system   replicaset.extensions/heapster-56c6f9566f                                                  0         0         0         5d
kube-system   replicaset.extensions/kube-dns-v20-7c556f89c5                                              2         2         2         5d
kube-system   replicaset.extensions/kubernetes-dashboard-546f987686                                      1         1         1         5d
kube-system   replicaset.extensions/tiller-deploy-7ccf99cd64                                             1         1         1         23h
kube-system   replicaset.extensions/tunnelfront-55bbb6b96c                                               1         1         1         5d

NAMESPACE     NAME                               DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
kube-system   daemonset.apps/kube-proxy          3         3         3         3            3           beta.kubernetes.io/os=linux   5d
kube-system   daemonset.apps/kube-svc-redirect   3         3         3         3            3           beta.kubernetes.io/os=linux   5d
kube-system   daemonset.apps/omsagent            3         3         3         3            3           beta.kubernetes.io/os=linux   5d

NAMESPACE     NAME                                                                      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/addon-http-application-routing-default-http-backend       1         1         1            1           5d
kube-system   deployment.apps/addon-http-application-routing-external-dns               1         1         1            1           5d
kube-system   deployment.apps/addon-http-application-routing-nginx-ingress-controller   1         1         1            0           5d
kube-system   deployment.apps/azureproxy                                                1         1         1            1           5d
kube-system   deployment.apps/heapster                                                  1         1         1            1           5d
kube-system   deployment.apps/kube-dns-v20                                              2         2         2            2           5d
kube-system   deployment.apps/kubernetes-dashboard                                      1         1         1            1           5d
kube-system   deployment.apps/tiller-deploy                                             1         1         1            1           23h
kube-system   deployment.apps/tunnelfront                                               1         1         1            1           5d

NAMESPACE     NAME                                                                                 DESIRED   CURRENT   READY     AGE
kube-system   replicaset.apps/addon-http-application-routing-default-http-backend-66c97f5dc7       1         1         1         5d
kube-system   replicaset.apps/addon-http-application-routing-external-dns-c547864b7                1         1         1         5d
kube-system   replicaset.apps/addon-http-application-routing-nginx-ingress-controller-6449fd79f9   1         1         0         5d
kube-system   replicaset.apps/azureproxy-79c5db744                                                 1         1         1         5d
kube-system   replicaset.apps/heapster-55f855b47                                                   1         1         1         5d
kube-system   replicaset.apps/heapster-56c6f9566f                                                  0         0         0         5d
kube-system   replicaset.apps/kube-dns-v20-7c556f89c5                                              2         2         2         5d
kube-system   replicaset.apps/kubernetes-dashboard-546f987686                                      1         1         1         5d
kube-system   replicaset.apps/tiller-deploy-7ccf99cd64                                             1         1         1         23h
kube-system   replicaset.apps/tunnelfront-55bbb6b96c                                               1         1         1         5d

Pay attention more on the pod that has CrashLoopBackOff error. I saw this CrashLoopBackOff thing restarted over 1000 times within 5 days in our first AKS cluster. This is one Pod that is used internally by AKS before we can deploy anything else.

I opened a ticket with Microsoft and got Microsoft Support to work with me. After a very long conference call and even completely reinstalled AKS cluster, we finally figured out the way to get around this issue. The key is to give correct permission for AKS Service Principal.

There is one drawback when deploying AKS with Azure UI. You can not specify the name of Service Principal and SP is automatically created with the name like . For us, we have installed and uninstall AKS multiple times, so we have a few SP names. It is confusing to decide which one is the one we really care. Finding out the correct SP name is a challenge task. Anyway, the followings are the steps to add correct permission to AKS Service Principal.

1. Get Client ID
Run the following command to get client id.

[root@ AKS-Test]# az aks show -n exa-aksc2 -g exa-dev01-ue1-aksc2-vnet2-rg | grep clientId
"clientId": "27ae6273-9706-4156-b546-607279623990"

2. Get SP Name
Click Azure Active Directory, then click App registrations. Change dropdown from My Apps to All apps. Then input the clientId. It should show the SP name as screen below.

3. Set Correct Permission for the SP
At the time when AKS creates the cluster, it creates a SP showing above. Then grant Contributor role to the SP. This is the problem as certain operations require OWNER permissions. So need to add Owner role to the SP. All the resources used by AKS cluster are under MC_* resource group. In our case, it is MC_exa-dev01-ue1-aksc2-vnet2-rg_exa-aksc2_eastus.

Click Resource Group, then MC_exa-dev01-ue1-aksc2-vnet2-rg_exa-aksc2_eastus. Click Access Control (IAM), then click + Add.

After this change, our issue was gone. Here is the result from describe service. No error this time.

[root@exa-dev01-ue1-kfclient1-vm Kafka-AKS-Test]# kubectl describe service my-nginx
Name:                     my-nginx
Namespace:                default
Labels:                   run=my-nginx
Annotations:              <none>
Selector:                 run=my-nginx
Type:                     LoadBalancer
IP:                       10.242.237.5
Port:                     <unset>  80/TCP
TargetPort:               80/TCP
NodePort:                 <unset>  32026/TCP
Endpoints:                10.2.10.70:80
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason                Age   From                Message
  ----    ------                ----  ----                -------
  Normal  EnsuringLoadBalancer  8s    service-controller  Ensuring load balancer

The deployment also looks good.
Sample output:

[root@exa-dev01-ue1-kfclient1-vm Kafka-AKS-Test]# kubectl describe deployment my-nginx
Name: my-nginx
Namespace: default
CreationTimestamp: Thu, 14 Jun 2018 15:03:23 +0000
Labels: run=my-nginx
Annotations: deployment.kubernetes.io/revision=1
Selector: run=my-nginx
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 1 max surge
Pod Template:
Labels: run=my-nginx
Containers:
my-nginx:
Image: nginx
Port: 80/TCP
Host Port: 0/TCP
Environment:
Mounts:
Volumes:
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
OldReplicaSets:
NewReplicaSet: my-nginx-9d5677d94 (1/1 replicas created)
Events:
[/code]

For more information about our issue, you can check it out at https://github.com/Azure/AKS/issues/427.

Advertisements

Install H2O Driverless AI on Google Cloud Platform

I wrote many blogs about H2O and H2O Sparkling Water in the past. Today I am going to discuss the installation of H2O Driverless AI (H2O AI). H2O AI targets machine learning, especially deep learning. While H2O focuses more on algorithm, models, and predication, H2O AI automates some of the most difficult data science and ML workflows to offer automatic visualizations and Machine Learning Interpretability (MLI). Here is the architecture of H2O AI.

There are some difference in different installation environment. To check out different environment, use H2O Driverless AI installation document at http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/installing.html.

This blog discusses the topic only related to Google Cloud. Here are a few important things to know before the installation.
1. It requires a lot of memory and CPUs, if possible use GPU. I uses 8 CPUs and 52 GB memory on Google cloud. If you can use GPU, add GPU option. For me, I don’t have the access to GPU in my account.
2. The OS is based on Ubuntu 16.04 and I believe it is the minimum version supported.
3. OS disk size should be >= 64GB. I used 64GB.
4. Instead of installation software package, H2O AI uses Docker image. Yes, Docker needs to be installed first.
5. If plan to use python to connect the H2O AI, the supported version of python is v3.6.

Ok, here is the installation procedure on GCP:
1. Create a new firewall rule
Click VPC Network -> Firewall Rules -> Create Firewall Rule
Input the following:
Name : h2oai
Description: The firewall rule for H2O driverless AI
Target tags: h2o
Source IP ranges: 0.0.0.0/0
Protocols and ports: tcp:12345,54321
Please note: H2O’s documentation misses the port 54321, which is used by H2O Flow UI. Needs to open this port. Otherwise you can not access H2O Flow UI.


2. Create a new VM instance
Name: h2otest
Zone: us-east1-c
Cores: 8 vCPU
Memory: 52 GB
Boot disk: 64 GB, Ubuntu 16.04
Service account: use your GCP service account
Network tags: h2o

3. Install and configure Docker
Logon to h2otest VM instance and su to root user.
Create a script, build.sh

apt-get -y update
apt-get -y --no-install-recommends install \
  curl \
  apt-utils \
  python-software-properties \
  software-properties-common

add-apt-repository -y "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -

apt-get update
apt-get install -y docker-ce

Run the script

root@h2otest:~# chmod u+x build.sh
root@h2otest:~# ./build.sh

Created required directories.

mkdir ~/tmp
mkdir ~/log
mkdir ~/data
mkdir ~/scripts
mkdir ~/license
mkdir ~/demo
mkdir -p ~/jupyter/notebooks

Adding current user to Docker container is optional. I did anyway.

root@h2otest:~# usermod -aG docker weidong.zhou
root@h2otest:~# id weidong.zhou
uid=1001(weidong.zhou) gid=1002(weidong.zhou) groups=1002(weidong.zhou),4(adm),20(dialout),24(cdrom),25(floppy),29(audio),30(dip),44(video),46(plugdev),109(netdev),110(lxd),1000(ubuntu),1001(google-sudoers),999(docker)

4. Download and Load H2O AI Docker Image
Download the docker image.

root@h2otest:~# wget https://s3-us-west-2.amazonaws.com/h2o-internal-release/docker/driverless-ai-docker-runtime-latest-release.gz
--2018-04-18 16:43:31--  https://s3-us-west-2.amazonaws.com/h2o-internal-release/docker/driverless-ai-docker-runtime-latest-release.gz
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.209.8
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.209.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2167098485 (2.0G) [application/gzip]
Saving to: ‘driverless-ai-docker-runtime-latest-release.gz’

driverless-ai-docker-runtime-latest-release.g 100%[==============================================================================================>]   2.02G  26.2MB/s    in 94s     

2018-04-18 16:45:05 (22.0 MB/s) - ‘driverless-ai-docker-runtime-latest-release.gz’ saved [2167098485/2167098485]

Load Docker image.

root@h2otest:~# docker load < driverless-ai-docker-runtime-latest-release.gz 9d3227c1793b: Loading layer [==================================================>]  121.3MB/121.3MB
a1a54d352248: Loading layer [==================================================>]  15.87kB/15.87kB
. . . .
ed86b627a562: Loading layer [==================================================>]  1.536kB/1.536kB
7d38d6d61cec: Loading layer [==================================================>]  1.536kB/1.536kB
de539994349c: Loading layer [==================================================>]  3.584kB/3.584kB
8e992954a9eb: Loading layer [==================================================>]  3.584kB/3.584kB
ff71b3e896ef: Loading layer [==================================================>]  8.192kB/8.192kB
Loaded image: opsh2oai/h2oai-runtime:latest
root@h2otest:~# docker image ls
REPOSITORY               TAG                 IMAGE ID            CREATED             SIZE
opsh2oai/h2oai-runtime   latest              dff251c69407        12 days ago         5.46GB

5. Start H2O AI
Create a startup script, start_h2oai.sh. Please note: H2O document has an error, missing port 54321 for H2O Flow UI. Then run the script

root@h2otest:~# cat start_h2oai.sh 
#!/bin/bash

docker run \
    --rm \
    -u `id -u`:`id -g` \
    -p 12345:12345 \
    -p 54321:54321 \
    -p 8888:8888 \
    -p 9090:9090 \
    -v `pwd`/data:/data \
    -v `pwd`/log:/log \
    -v `pwd`/license:/license \
    -v `pwd`/tmp:/tmp \
    opsh2oai/h2oai-runtime

root@h2otest:~# chmod a+x start_h2oai.sh
root@h2otest:~# ./start_h2oai.sh 
---------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------
     version: 1.0.30

- Put data in the volume mounted at /data
- Logs are written to the volume mounted at /log/20180419-094058
- Connect to Driverless AI on port 12345 inside the container
- Connect to Jupyter notebook on port 8888 inside the container

Also create a script, ssh_h2oai.sh to quickly ssh to the H2O AI container without knowing the container id first.

root@h2otest:~# vi ssh_h2oai.sh
root@h2otest:~# cat ssh_h2oai.sh 
#!/bin/bash

CONTAINER_ID=`docker ps|grep h2oai-runtime|awk '{print $1}'`
docker exec -it $CONTAINER_ID bash
root@h2otest:~# chmod a+x ssh_h2oai.sh 
root@h2otest:~# ./ssh_h2oai.sh
root@09bd138f4f41:/#

6. Use H2O AI
Get the external IP of the H2O VM. In my case, it is 35.196.90.114. Then access URL at http://35.196.90.114:12345/. You will see H2O AI evaluation agreement screen. Click I Agree to these Terms to continue.
The Logon screen shows up. I use the following information to sign in.
Username: h2o
Password: h2o
Actually it doesn’t matter what you input. You can use any username to login. It just didn’t check. I know it has the feature to integrate with LDAP. I just didn’t give a try this time.

After sign in, it will ask you to input license information. Fill out your information at https://www.h2o.ai/try-driverless-ai/ and you will receive a 21-day trail license in the email.

The first screen shows up is the Datasets overview. You can add dataset from one of three sources: File System, Hadoop File System, Amazon S3. To use some sample data, I chose Amazon S3‘s AirlinesTest.csv.zip file.



For every dataset, there are two kinds of Actions: Visualize or Predict
Click Visualize. Many interesting visualization charts show up.



If click Predict, Experiment screen shows up. Choose a Target Column. In my example, I chose ArrTime column. Click Launch Experiment

Once finished, it will show a list of options. For example, I clicked Interpret this model on original features


For people familiar with H2O Flow UI, H2O AI still has this UI, just click H2O-3 from the menu. The H2O Flow UI will show up.

In general, H2O AI has an impressive UI and tons of new stuff. No wonder it is not a free version. In the next blog, I am going to discuss how to configure python client to access H2O AI.

Fixing the Error of i/o timeout when Using Kubernetes Google Cloud Platform

Kubernetes is a nice offering on Google Cloud Platform. It is pretty easy to create a Kubernetes cluster and deploy software to the cluster. I recently run into a weird issue in using Kubernetes and would like to share my issue and solution in this blog.
I run kubectl get nodes command after creating a new Kubernetes cluster. It usually works without any issue. This time when I run it, it look hung and came back the following error after long time.

wzhou:@myhost tmp > kubectl get nodes
Unable to connect to the server: dial tcp 15.172.14.42:443: i/o timeout

It looked weird as I didn’t do anything significantly different than my other runs. After went through the steps I created the cluster, I realized I created the cluster in a different zone this time. Ok, let me try to get the credential for my cluster, wz-kube1.

wzhou:@myhost tmp > gcloud container clusters get-credentials wz-kube1
Fetching cluster endpoint and auth data.
ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=404, message=The resource "projects/cdh-gcp-test-139878/zones/us-central1-a/clusters/wz-kube1" was not found.
Could not find [wz-kube1] in [us-central1-a].
Did you mean [wz-kube1] in [us-east1-b]?

Ah, this indicated the issue. Let me specify the zone again and no error this time.

wzhou:@myhost tmp > gcloud container clusters get-credentials wz-kube1 --zone us-east1-b
Fetching cluster endpoint and auth data.
kubeconfig entry generated for wz-kube1.

Try the get nodes command again. It worked this time. Problem solved.

wzhou:@myhost tmp > kubectl get nodes
NAME                                      STATUS    ROLES     AGE       VERSION
gke-wz-kube1-default-pool-6d1150c9-dcqf   Ready     <none>    47m       v1.8.8-gke.0
gke-wz-kube1-default-pool-6d1150c9-dgfk   Ready     <none>    47m       v1.8.8-gke.0
gke-wz-kube1-default-pool-6d1150c9-nfs7   Ready     <none>    47m       v1.8.8-gke.0

Create Cloudera Hadoop Cluster Using Cloudera Director on Google Cloud

I have a blog discussing how to install Cloudera Hadoop Cluster several years ago. It basically took about at least half day to complete the installation in my VM cluster. In my last post, I discussed an approach to deploy Hadoop cluster using DataProc on Google Cloud Platform. It literally took less than two minutes to create a Hadoop Cluster. Although it is a good to have a cluster launched in a very short time, it does not have the nice UI like Cloudera Manager as the Hadoop distribution used by Dataproc is not CDH. I could repeat my blogs to build a Hadoop Cluster using VM instances on Google Cloud Platform. But it will take some time and involve a lot of work. Actually there is another way to create Hadoop cluster on the cloud. Cloudera has a product, called Cloudera Director. It currently supports not only Google Cloud, but also AWS and Azure as well. It is designed to deploy CDH cluster faster and easier to scale the cluster on the cloud. Another important feature is that Cloud Director allows you to move your deployment scripts or steps easily from one cloud provider to another provider and you don’t have to be locked in one cloud vendor. In this blog, I will show you the way to create a CDH cluster using Cloudera Director.

The first step is to start my Cloudera Director instance. In my case, I have already installed Cloudera Director based on the instruction from Cloudera. It is pretty straight forward process and I am not going to repeat it here. The Cloudera Director instance is where you can launch your CDH cluster deployment.

Both Cloudera Director and Cloudera Manager UI are browser-based and you have to setup secure connection between your local machine and VM instances on the cloud. To achieve this, you need to configure SOCKS proxy on your local machine that is used to connect to the Cloudera Director VM. It provides a secure way to connect to your VM on the cloud and can use VM’s internal IP and hostname in the web browser. Google has a nice note about the steps, Securely Connecting to VM Instances. Following this note will help you to setup SOCKS proxy.

Ok, here are the steps.
Logon to Cloudera Director
Open a terminal session locally, and run the following code:

gcloud compute ssh cdh-director-1 \
    --project cdh-director-173715 \
    --zone us-central1-c \
    --ssh-flag="-D" \
    --ssh-flag="1080" \
    --ssh-flag="-N"    

cdh-director-1 is the name of my Cloudera Director instance on Google cloud and cdh-director-173715 is my Google Cloud project id. After executing the above command, it looks hang and never complete. This is CORRECT behavior. Do not kill or exit this session. Open a browser and type in the internal IP of Cloudera Director instance with port number 7189. For my cdh-director-1 instance, the internal IP is 10.128.0.2.

After input the URL http://10.128.0.2:7189 for Cloudera Director. The login screen shows up. Login as admin user.

Deployment
After login, the initial setup wizard shows up. Click Let’s get started.

In the Add Environment screen, input the information as follows. The Client ID JSON Key is the file you can create during the initial setup of you Google project with SSH key stuff.

In the next Add Cloudera Manager screen, I usually create the Instance Template first. Click the drop down of Instance Template, then select Create a new instance template. I need at least three template, one for Cloudera Manager, one for Master nodes, and one for Worker nodes. In my case here, I did not create a template for Edge nodes. To save resource on my Google cloud environment, I did not create the template for Edge node. Here are the configuration for all three templates.

Cloudera Manager Template

Master Node Template

Worker Node Template

Input the following for Cloudera Manager. For my test, I use Embedded Database. If it is used for production, you need to setup external database first and register the external database here.

After click Continue, Add Cluster screen shows up. There is a gateway instance group and I removed it by clicking Delete Group because I don’t have edge node here. Input the corresponding template and number of instances for masters and workders.

After click Continue, the deployment starts.

After about 20 minutes, it completes. Click Continue.

Review Cluster
The nice Cloudera Director dashboard shows up.

You can also login to Cloudera Manager from the link on Cloudera Director.

Nice and easy. Excellent product from Cloudera. For more information about deploying CDH cluster on Google Cloud, you can also check out Cloudera’s document, Getting Started on Google Cloud Platform.

Create Hadoop Cluster on Google Cloud Platform

There are many ways to create Hadoop clusters and I am going to show a few ways on Google Cloud Platform (GCP). The first approach is the standard way to build a Hadoop cluster, no matter whether you do it on cloud or on-premise. Basically create a group of VM instances and manually install Hadoop cluster on these VM instances. Many people have blogs or articles about this approach and I am not going to repeat the steps here.

In this blog, I am going to discuss the approach using Google Cloud Dataproc and you can actually have a Hadoop cluster up and running
within 2 minutes. Google Cloud Dataproc is a fully-managed cloud service for running Apache Hadoop cluster in a simple and fast way. The followings show the steps to create a Hadoop Cluster and submit a spark job to the cluster.

Create a Hadoop Cluster
Click Dataproc -> Clusters

Then click Enable API

Cloud Dataproc screen shows up. Click Create cluster

Input the following parameters:
Name : cluster-test1
Region : Choose use-central1
Zone : Choose us-central1-c

1. Master Node
Machine Type : The default is n1-standard-4, but I choose n1-standard-1 just for simple testing purpose.
Cluster Mode : There are 3 modes here. Single Mode (1 master, 0 worker), Standard Mode (1 master, N worker), and High Mode (3 masters, N workers). Choose Standard Mode.
Primary disk size : For my testing, 10GB 1s enough.

2. Worker Nodes
Similar configuration like Worker node. I use 3 worker nodes and disk size is 15 GB. You might notice that there is option to use local SSD storage. You can attach up to 8 local SSD devices to the VM instance. Each disk is 375 GB in size and you can not specify 10GB disk size here. The local SSDs are physically attached to the host server and offer higher performance and lower latency storage than Google’s persistent disk storage. The local SSDs is used for temporary data like shuffling data in MapReduce. The data on the local SSD storage is not persistent. For more information, please visit https://cloud.google.com/compute/docs/disks/local-ssd.

Another thing to mention is that Dataproc uses Cloud Storage bucket instead of HDFS for the Hadoop cluster. Although the hadoop command is still working and you won’t feel anything different, the underline storage is different. In my opinion, it is actually better because Google Cloud Storage bucket definitely has much better reliability and scalability than HDFS.

Click Create when everything is done. After a few minutes, the cluster is created.

Click cluster-test1 and it should show the cluster information.

If click VM Instances tab, we can see there is one master and 3 worker instances.

Click Configuration tab. It shows all configuration information.

Submit a Spark Job
Click Cloud Dataproc -> Jobs.

Once Submit Job screen shows up, input the following information, then click Submit.

After the job completes, you should see the followings:

To verify the result, I need to ssh to the master node to find out which ports are listening for connections. Click the drop down on the right of SSH of master node, then click Open in browser window.

Then run the netstat command.

cluster-test1-m:~$ netstat -a |grep LISTEN |grep tcp
tcp        0      0 *:10033                 *:*                     LISTEN     
tcp        0      0 *:10002                 *:*                     LISTEN     
tcp        0      0 cluster-test1-m.c.:8020 *:*                     LISTEN     
tcp        0      0 *:33044                 *:*                     LISTEN     
tcp        0      0 *:ssh                   *:*                     LISTEN     
tcp        0      0 *:52888                 *:*                     LISTEN     
tcp        0      0 *:58266                 *:*                     LISTEN     
tcp        0      0 *:35738                 *:*                     LISTEN     
tcp        0      0 *:9083                  *:*                     LISTEN     
tcp        0      0 *:34238                 *:*                     LISTEN     
tcp        0      0 *:nfs                   *:*                     LISTEN     
tcp        0      0 cluster-test1-m.c:10020 *:*                     LISTEN     
tcp        0      0 localhost:mysql         *:*                     LISTEN     
tcp        0      0 *:9868                  *:*                     LISTEN     
tcp        0      0 *:9870                  *:*                     LISTEN     
tcp        0      0 *:sunrpc                *:*                     LISTEN     
tcp        0      0 *:webmin                *:*                     LISTEN     
tcp        0      0 cluster-test1-m.c:19888 *:*                     LISTEN     
tcp6       0      0 [::]:10001              [::]:*                  LISTEN     
tcp6       0      0 [::]:44884              [::]:*                  LISTEN     
tcp6       0      0 [::]:50965              [::]:*                  LISTEN     
tcp6       0      0 [::]:ssh                [::]:*                  LISTEN     
tcp6       0      0 cluster-test1-m:omniorb [::]:*                  LISTEN     
tcp6       0      0 [::]:46745              [::]:*                  LISTEN     
tcp6       0      0 cluster-test1-m.c.:8030 [::]:*                  LISTEN     
tcp6       0      0 cluster-test1-m.c.:8031 [::]:*                  LISTEN     
tcp6       0      0 [::]:18080              [::]:*                  LISTEN     
tcp6       0      0 cluster-test1-m.c.:8032 [::]:*                  LISTEN     
tcp6       0      0 cluster-test1-m.c.:8033 [::]:*                  LISTEN     
tcp6       0      0 [::]:nfs                [::]:*                  LISTEN     
tcp6       0      0 [::]:33615              [::]:*                  LISTEN     
tcp6       0      0 [::]:56911              [::]:*                  LISTEN     
tcp6       0      0 [::]:sunrpc             [::]:*                  LISTEN  

Check out directories.

cluster-test1-m:~$ hdfs dfs -ls /
17/09/12 12:12:24 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
Found 2 items
drwxrwxrwt   - mapred hadoop          0 2017-09-12 11:56 /tmp
drwxrwxrwt   - hdfs   hadoop          0 2017-09-12 11:55 /user

There are a few UI screens available to check out the Hadoop cluster and job status.
HDFS NameNode (port 9870)

YARN Resource Manager (port 8088)

Spark Job History (port 18080)

Dataproc approach is an easy deployment tool to create a Hadoop cluster. Although it is powerful, I miss the nice UI like Cloudera Manager. To install Cloudera CDH cluster, I need to use a different approach and I am going to discuss it in the future blog.