There are many ways to create Hadoop clusters and I am going to show a few ways on Google Cloud Platform (GCP). The first approach is the standard way to build a Hadoop cluster, no matter whether you do it on cloud or on-premise. Basically create a group of VM instances and manually install Hadoop cluster on these VM instances. Many people have blogs or articles about this approach and I am not going to repeat the steps here.
In this blog, I am going to discuss the approach using Google Cloud Dataproc and you can actually have a Hadoop cluster up and running
within 2 minutes. Google Cloud Dataproc is a fully-managed cloud service for running Apache Hadoop cluster in a simple and fast way. The followings show the steps to create a Hadoop Cluster and submit a spark job to the cluster.
Create a Hadoop Cluster
Click Dataproc -> Clusters
Then click Enable API
Cloud Dataproc screen shows up. Click Create cluster
Input the following parameters:
Name : cluster-test1
Region : Choose use-central1
Zone : Choose us-central1-c
1. Master Node
Machine Type : The default is n1-standard-4, but I choose n1-standard-1 just for simple testing purpose.
Cluster Mode : There are 3 modes here. Single Mode (1 master, 0 worker), Standard Mode (1 master, N worker), and High Mode (3 masters, N workers). Choose Standard Mode.
Primary disk size : For my testing, 10GB 1s enough.
2. Worker Nodes
Similar configuration like Worker node. I use 3 worker nodes and disk size is 15 GB. You might notice that there is option to use local SSD storage. You can attach up to 8 local SSD devices to the VM instance. Each disk is 375 GB in size and you can not specify 10GB disk size here. The local SSDs are physically attached to the host server and offer higher performance and lower latency storage than Google’s persistent disk storage. The local SSDs is used for temporary data like shuffling data in MapReduce. The data on the local SSD storage is not persistent. For more information, please visit https://cloud.google.com/compute/docs/disks/local-ssd.
Another thing to mention is that Dataproc uses Cloud Storage bucket instead of HDFS for the Hadoop cluster. Although the hadoop command is still working and you won’t feel anything different, the underline storage is different. In my opinion, it is actually better because Google Cloud Storage bucket definitely has much better reliability and scalability than HDFS.
Click Create when everything is done. After a few minutes, the cluster is created.
Click cluster-test1 and it should show the cluster information.
If click VM Instances tab, we can see there is one master and 3 worker instances.
Click Configuration tab. It shows all configuration information.
Submit a Spark Job
Click Cloud Dataproc -> Jobs.
Once Submit Job screen shows up, input the following information, then click Submit.
After the job completes, you should see the followings:
To verify the result, I need to ssh to the master node to find out which ports are listening for connections. Click the drop down on the right of SSH of master node, then click Open in browser window.
Then run the netstat command.
cluster-test1-m:~$ netstat -a |grep LISTEN |grep tcp
tcp 0 0 *:10033 *:* LISTEN
tcp 0 0 *:10002 *:* LISTEN
tcp 0 0 cluster-test1-m.c.:8020 *:* LISTEN
tcp 0 0 *:33044 *:* LISTEN
tcp 0 0 *:ssh *:* LISTEN
tcp 0 0 *:52888 *:* LISTEN
tcp 0 0 *:58266 *:* LISTEN
tcp 0 0 *:35738 *:* LISTEN
tcp 0 0 *:9083 *:* LISTEN
tcp 0 0 *:34238 *:* LISTEN
tcp 0 0 *:nfs *:* LISTEN
tcp 0 0 cluster-test1-m.c:10020 *:* LISTEN
tcp 0 0 localhost:mysql *:* LISTEN
tcp 0 0 *:9868 *:* LISTEN
tcp 0 0 *:9870 *:* LISTEN
tcp 0 0 *:sunrpc *:* LISTEN
tcp 0 0 *:webmin *:* LISTEN
tcp 0 0 cluster-test1-m.c:19888 *:* LISTEN
tcp6 0 0 [::]:10001 [::]:* LISTEN
tcp6 0 0 [::]:44884 [::]:* LISTEN
tcp6 0 0 [::]:50965 [::]:* LISTEN
tcp6 0 0 [::]:ssh [::]:* LISTEN
tcp6 0 0 cluster-test1-m:omniorb [::]:* LISTEN
tcp6 0 0 [::]:46745 [::]:* LISTEN
tcp6 0 0 cluster-test1-m.c.:8030 [::]:* LISTEN
tcp6 0 0 cluster-test1-m.c.:8031 [::]:* LISTEN
tcp6 0 0 [::]:18080 [::]:* LISTEN
tcp6 0 0 cluster-test1-m.c.:8032 [::]:* LISTEN
tcp6 0 0 cluster-test1-m.c.:8033 [::]:* LISTEN
tcp6 0 0 [::]:nfs [::]:* LISTEN
tcp6 0 0 [::]:33615 [::]:* LISTEN
tcp6 0 0 [::]:56911 [::]:* LISTEN
tcp6 0 0 [::]:sunrpc [::]:* LISTEN
Check out directories.
cluster-test1-m:~$ hdfs dfs -ls /
17/09/12 12:12:24 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
Found 2 items
drwxrwxrwt - mapred hadoop 0 2017-09-12 11:56 /tmp
drwxrwxrwt - hdfs hadoop 0 2017-09-12 11:55 /user
There are a few UI screens available to check out the Hadoop cluster and job status.
HDFS NameNode (port 9870)
YARN Resource Manager (port 8088)
Spark Job History (port 18080)
Dataproc approach is an easy deployment tool to create a Hadoop cluster. Although it is powerful, I miss the nice UI like Cloudera Manager. To install Cloudera CDH cluster, I need to use a different approach and I am going to discuss it in the future blog.