Install Cloudera Hadoop Cluster using Cloudera Manager

Three years ago I tried to build up a Hadoop Cluster using Cloudera Manager. The GUI looked nice, but the installation was pain and full of issues. I gave up after many failed tries, and then went with the manual installation. It worked fine and I have built several clusters since then. After several years working on Oracle Exadata, I go back and retry the hadoop installation using Cloudera Manager. This time I installed CDH 5 cluster. The installation experience was much better than three years ago. But not surprised, the installation still has some issues and I can easily identify some bugs during the installation. But at least I can successfully install a 3 node hadoop cluster after several tries. The followings are my steps during the installation.

First, let me give a little detail about my VM environment. I am using Virtualbox and build three VMs.
vmhost1: This is where name node, clouder manager and many other roles are located.
vmhost2: Data Node
vmhost3: Data Node

Note: the default replication factor is 3 for hadoop. In my environment, it is under replicated. So I have to adjust replication factor from 3 to 2 after installation, just to get rid of some annoying alerts.

  • OS: Oracle Linux 6.7, 64-bit
  • CPU: 1 CPU initially for all 3 VMs. Then I realize vmhost1 needs a lot of processing power as majority of the installation and configuration happen on node 1. I gave vmhost1 2 CPUs. It proved still not enough and vmhost1 tended to freeze after installation. After I bump it up to 4 CPUs, vmhost1 looks fine. 1 CPU for Data Node host is enough.
  • Memory: Initially I gave 3G to all of 3 VMs. Then bump up node 1 to 5G before installation. It proved still not enough. After bumping up to 7G on vmhost1, the VM is not freezing anymore. I can see the memory usage is around 6.2G. So 7G configuration is good one. After installation, I reduced Data Node’s memory to 2G to free some memory. If not much job running, the memory usage is less than 1G on Data Node. If just testing out hadoop configuration, I can further reduce the memory to 1.5G per Data Node.
  • Network: Although I have 3 network adpaters built in the VM, I actually use only two of them. One is configured as Internal Network and this is where my cluster VMs are using to communicate with each other. Another one is configured as NAT, just to get internet connection to download packages from Cloudera site.
  • Storage: 30G. The actual size after installation is about 10~12G and really depended on how many times you fail and retry for the installation. The clean installation uses about 10G of space.

Pre-Steps Before the Installation

Before doing the installation, make sure configure the following in the VM:
1. Set SELinux policy to diasabled. Modify the following parameter in /etc/selinux/config file.

2. Disable firewall.
chkconfig iptables off

3. Set swappiness to 0 in /etc/sysctl.conf file. In the latest Cloudera CDH releases, it actually recommends changing to non-zero value, like 10. But for my little test, I set it to 0 like many people did.

4. Disable IPV6 in /etc/sysctl.conf file.
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.all.disable_ipv6 = 1

5. Configure passwordless SSH for root user. This is common step for Oracle RAC installation and I do not repeat the steps here.

Ok, ready for the installation. Here are the steps.
1. Download and Run the Cloudera Manager Server Installer
Logon as root user on vmhost1. All of the installations are under root user.
Run the following commands.

chmod u+x cloudera-manager-installer.bin

It popups the following screen, just click Next or Yes for the rest of screens.

If successful, you will see the following screen.

After click Close, it will pop up a browser window and point to http://localhost:7180/. At this moment, you can click Finish button on the previous installation GUI and close the installation GUI. Then move to browser and patiently wait for your Cloudera Manager starts up. Note. It usually takes several minutes. So be patient.

2. Logon Screen
After the following screen shows up, logon as admin user and use the same admin as password.

3. Choose Version
The next screen is to choose which version to use. The default option is Cloudera Enterprise Data Hub Edition Trial, but with 60 days limit. Although Cloudera Express has no time limit, the Express version misses a lot of features I would like to test out. So I go with the Enterprise 60 days trial version.

4. Thank You Screen
Click Continue for the next Thank You screen.

5. Host Screen
Input vmhost[1-3].local, then click New Search. Note, make sure to use FQDN. I used to have bad experience not using FQDN in the old version of CDH installation. I am not going to waste my time in trying out what happens if not using FQDN.

After the following screen shows up, Click New Search, then the 3 hosts shows up. Then click Continue.

6. Select Repository
For Select Repository screen, the default option is using Parcels. Unfortunately I had issue using Parcel during the installation. It passed the step of installation on all of 3 hosts, but was stuck in download the latest Parcel file. After looking around, it seems the issue was that the default release was for September version, but the latest Parcel is pointing to the old August release. It seems version mismatch to me. Anyway, I am going to try out the Parcels option in the future again. But for this installation I changed to use Packages version. I intentionally did not choose the latest CDH 5.4.5 version. I would like to go with the version has long lag in time. For example there is about one month lag between CDH 5.4.3 and CDH 5.4.4. If 5.4.3 is not stable, Cloudera would put a new release a few days later and can not wait for one month to release new version. So I went with CDH 5.4.3.
Make sure to choose 5.4.3 for Navigator Key Trustee as well.

7. Java Installation
For Java installation, leave it uncheck in default and click Continue.

8. Single User
For Enable Single User Mode, I did NOT check Single User Mode as I want cluster installation.

9. SSH Login Credentials
For SSH Login Credentials, input root password. For Number of Simultaneous Installations, the default value is 10. It created a lot of headache during my installation. Each host downloads its own copy from cloudera website. As three of VMs were fighting each other for the internet bandwidth on my host machine, certain VM could wait there for several minutes for downloading the next package. If wait for more than 30 seconds, Cloudera Manager would time out the installation for this host and marked as failed installation. I am fine with the time out, but not happy with the next action. The the next step after clicking Retry Failed Hosts, it rolls back the installed packages on this VM and restart from scratch for the next try. It could take hours before I could reach to that point. The more elegant way to do the installation should be download once on host and distribute to other hosts for installation. If failed, retry from the failing point. Although the total download files is about a few GB per host, the failed retries can easily make it 10GB per host. So I have to set Number of Simultaneous Installation to 1 to limit to one VM for installation to reduce my failure rate.

10. Installation
The majority of installation time spends here if going with Package option. For Parcel option, this step is very fast because the majority of downloads are in the different screen. The time in this step really depends on the following factors:
1. How fast your internet bandwidth. The faster, the better.
2. The download speed from Cloudera site. Although my internet download speed can easily reach to 12M per second, my actual download time from Cloudera could vary depend on the time of day. Majority of the time is around 1~2M per second. Not great, but manageable. But sometimes it could drop down to 100K per second. This is the time I have higher chance to see the time out failure and fail the installation. At one point I could not tolerate this, I wake up at 2am and began my installation process. It was much faster. I can get 10M per second download speed with about 4~7 M on average. I only saw a few timeout failure on one host.
3. How many times the installation time out and have to retry.

If successful, the following screen shows.

11. Detect Version
After the success of installation, it shows the version screen.

12. Finish Screen
Finally, I can see this Finish screen. Life is good? Wrong! See my comment in the Cluster Setup step.

13. Cluster Setup
When I reached to this step, I knew I was almost done. Just a few more steps, less than 30 minutes work. After a long day, I went for dinner and resume my configuration later. It proved to be the most expensive mistake I have done during this installation. After the dinner, I went back the same screen, click Continue. It show Session Time Out error. Not a big deal as I thought the background process knew where I was for the installation. Open the browser and type in the url, http://localhost:7180. Guess what, not the Cluster Setup screen, but the screen at step 4. Tried many ways and could not find a workaround. Out of ideas, I had to reinstall from step 4. What’s a pain! Another 7~8 hours work. My next installation did not waste any time on this step and completed it as quickly as possible.

Ok, go back to this screen. I want to use both Impala and Spark and could not find the combination for these two except all services. So I chose Custom Services and chose the services mainly from Core with Impala + Spark. Make sure to check Include Cloudera Navigator.

14. Role Assignment
I chose the default, click Continue.

15. Database Setup
Choose the default. Make sure to click Test Connection before clicking Continue.

16. Review
Click Continue.

17. Completion
It shows the progress during the setup.

Finally it show the real completion screen.

After clicking Finish, you should screen similar as follows.
The life is good right now. The powerful Cloudera Manager has much more nice features than three years ago. Really worth my effort to go through the installation.