Configurations after CDH Installation

In the last post, I discussed the steps to install a 3 node hadoop cluster by using Cloudera Manager. In the next few posts, I am going to discuss some technologies that are frequently used, such as Hive, Sqoop, Impala and Spark.

There are a few things that need to be configured after the CDH Installation.

1. Configure NTPD. Start up ntpd process on every host. Otherwise, Clouder Manager could display a healthcheck failure: The host’s NTP service did not respond to a request for the clock offset.
# service ntpd status
# service ntpd start
# chkconfig ntpd on
# chkconfig –list ntpd
# ntpdc -np

2. Configure Replication Factor. As my little cluster has only 2 Data nodes, I need to reduce the replication factor from the default value of 3 to 2 to avoid the annoying blocks under-replicated type of error. First run the following command to change the replication factor to 2.

hadoop fs -setrep -R 2 /

Then goto HDFS Configuration, change Replication Factor to 2.

3. Change message logging level from INFO to WARN. I can not believe how many INFO messages are logged and there are no way I can see a message for more than 3 seconds before it is quickly refreshed away by a flood of INFO messages. In my opinion, majority of the INFO messages are useless and should not be logged in the first place. It seems more like DEBUG messages to me. So before my little cluster goes crazy in logging tons of useless messages, I need to quickly change logging level from INFO to WARNING. Another painful thing is that there are many log files from various Hadoop components, and are located at many different locations. I feel like I am siting in a space shuttle cockpit and need to turn off many switches not in a central location.
space_shuttle_cockpit
I could find out the logfile configuration file, and fix the parameters one by one. But it would take some time and too painful. The easiest way I found out is to use Cloudera Manager to make the change. Bascially, type in logging level as the search term. It will pop up a long list of components with the logging level and change them one by one. You will not believe how many logging level parameters are in the system. After the change, it’s recommended to restart the cluster as certain parameters are stale.
CM_change_INO_WARN

4. Configure Hue’s superuser and password. From Cloudera Manager screen, click Hue to start the Hue screen. The weird part about the Hue is that there is no pre-set superuser for the administration. Whoever logon to the Hue first will become the superuser of Hue. I don’t understand why Hue just takes whatever user and password Cloudera Manager uses. Anyway, to make my life easier, I just use the same login user and password for Cloudera Manager, admin.
hue_initial_screen

5. Add new user.
By default hdfs user is the superuser for HDFS, not the root user. So before doing any work on Hadoop, it is a good idea to create a separte OS user instead of using hdfs user to execute Hadoop commands. Run the following commands on EVERY Host in the cluster.
a. Logon as root user.
b. Create bigdata group.
# groupadd bigdata
# grep bigdata /etc/group

c. Add the new user, wzhou.
# useradd -G bigdata -m wzhou

If the user exist before the bigdata created, do the following
# usermod -a -G bigdata wzhou

d. Change password
# passwd wzhou

e. Verify the user.
# id wzhou

f. Create the user home directory on HDFS.
# sudo -u hdfs hdfs dfs -mkdir /user/wzhou
# sudo -u hdfs hdfs dfs -ls /user

[root@vmhost1 ~]# sudo -u hdfs hdfs dfs -ls /user
Found 8 items
drwxrwxrwx   - mapred hadoop              0 2015-09-15 05:40 /user/history
drwxrwxr-t   - hive   hive                0 2015-09-15 05:44 /user/hive
drwxrwxr-x   - hue    hue                 0 2015-09-15 10:12 /user/hue
drwxrwxr-x   - impala impala              0 2015-09-15 05:46 /user/impala
drwxrwxr-x   - oozie  oozie               0 2015-09-15 05:47 /user/oozie
drwxr-x--x   - spark  spark               0 2015-09-15 05:41 /user/spark
drwxrwxr-x   - sqoop2 sqoop               0 2015-09-15 05:42 /user/sqoop2
drwxr-xr-x   - hdfs   supergroup          0 2015-09-20 11:23 /user/wzhou

g. Change the ownership of the directory.
# sudo -u hdfs hdfs dfs -chown wzhou:bigdata /user/wzhou
# hdfs dfs -ls /user

[root@vmhost1 ~]# sudo -u hdfs hdfs dfs -chown wzhou:bigdata /user/wzhou
[root@vmhost1 ~]# sudo -u hdfs hdfs dfs -ls /user</strong>
Found 8 items
drwxrwxrwx   - mapred hadoop           0 2015-09-15 05:40 /user/history
drwxrwxr-t   - hive   hive             0 2015-09-15 05:44 /user/hive
drwxrwxr-x   - hue    hue              0 2015-09-15 10:12 /user/hue
drwxrwxr-x   - impala impala           0 2015-09-15 05:46 /user/impala
drwxrwxr-x   - oozie  oozie            0 2015-09-15 05:47 /user/oozie
drwxr-x--x   - spark  spark            0 2015-09-15 05:41 /user/spark
drwxrwxr-x   - sqoop2 sqoop            0 2015-09-15 05:42 /user/sqoop2
drwxr-xr-x   - wzhou  bigdata          0 2015-09-20 11:23 /user/wzhou

h. Run a sample test.
Logon as wzhou user and verify whether the user can run sample MapReduce job from hadoop-mapreduce-examples.jar.

[wzhou@vmhost1 hadoop-mapreduce]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 1000000
Number of Maps  = 10
Samples per Map = 1000000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
15/09/20 11:32:28 INFO client.RMProxy: Connecting to ResourceManager at vmhost1.local/192.168.56.71:8032
15/09/20 11:32:29 INFO input.FileInputFormat: Total input paths to process : 10
15/09/20 11:32:29 INFO mapreduce.JobSubmitter: number of splits:10
15/09/20 11:32:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1442764085933_0001
15/09/20 11:32:30 INFO impl.YarnClientImpl: Submitted application application_1442764085933_0001
15/09/20 11:32:30 INFO mapreduce.Job: The url to track the job: http://vmhost1.local:8088/proxy/application_1442764085933_0001/
15/09/20 11:32:30 INFO mapreduce.Job: Running job: job_1442764085933_0001
15/09/20 11:32:44 INFO mapreduce.Job: Job job_1442764085933_0001 running in uber mode : false
15/09/20 11:32:44 INFO mapreduce.Job:  map 0% reduce 0%
15/09/20 11:32:55 INFO mapreduce.Job:  map 10% reduce 0%
15/09/20 11:33:03 INFO mapreduce.Job:  map 20% reduce 0%
15/09/20 11:33:11 INFO mapreduce.Job:  map 30% reduce 0%
15/09/20 11:33:18 INFO mapreduce.Job:  map 40% reduce 0%
15/09/20 11:33:26 INFO mapreduce.Job:  map 50% reduce 0%
15/09/20 11:33:34 INFO mapreduce.Job:  map 60% reduce 0%
15/09/20 11:33:42 INFO mapreduce.Job:  map 70% reduce 0%
15/09/20 11:33:50 INFO mapreduce.Job:  map 80% reduce 0%
15/09/20 11:33:58 INFO mapreduce.Job:  map 90% reduce 0%
15/09/20 11:34:06 INFO mapreduce.Job:  map 100% reduce 0%
15/09/20 11:34:14 INFO mapreduce.Job:  map 100% reduce 100%
15/09/20 11:34:14 INFO mapreduce.Job: Job job_1442764085933_0001 completed successfully
15/09/20 11:34:15 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=124
		FILE: Number of bytes written=1258521
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=2680
		HDFS: Number of bytes written=215
		HDFS: Number of read operations=43
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=3
	Job Counters 
		Launched map tasks=10
		Launched reduce tasks=1
		Data-local map tasks=10
		Total time spent by all maps in occupied slots (ms)=65668
		Total time spent by all reduces in occupied slots (ms)=6387
		Total time spent by all map tasks (ms)=65668
		Total time spent by all reduce tasks (ms)=6387
		Total vcore-seconds taken by all map tasks=65668
		Total vcore-seconds taken by all reduce tasks=6387
		Total megabyte-seconds taken by all map tasks=67244032
		Total megabyte-seconds taken by all reduce tasks=6540288
	Map-Reduce Framework
		Map input records=10
		Map output records=20
		Map output bytes=180
		Map output materialized bytes=360
		Input split bytes=1500
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=360
		Reduce input records=20
		Reduce output records=0
		Spilled Records=40
		Shuffled Maps =10
		Failed Shuffles=0
		Merged Map outputs=10
		GC time elapsed (ms)=1026
		CPU time spent (ms)=8090
		Physical memory (bytes) snapshot=3877482496
		Virtual memory (bytes) snapshot=17644212224
		Total committed heap usage (bytes)=3034685440
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=1180
	File Output Format Counters 
		Bytes Written=97
Job Finished in 106.368 seconds
Estimated value of Pi is 3.14158440000000000000

To restart all services in the cluster, you can just click Restart Action on the cluster from Cloudera Manager screen. However, if you want to start/stop a particular service, you might want to know the dependency of the services. Here are the order of starting/stopping sequence for all services on CDH 5.

Startup Sequence
1. Cloudera Management service
2. ZooKeeper
3. HDFS
4. Solr
5. Flume
6. Hbase
7. Key-Value Store Indexer
8. MapReduce or YARN
9. Hive
10. Impala
11. Oozie
12. Sqoop
13. Hue

Stop Sequence
1. Hue
2. Sqoop
3. Oozie
4. Impala
5. Hive
6. MapReduce or YARN
7. Key-Value Store Indexer
8. Hbase
9. Flume
10. Solr
11. HDFS
12. ZooKeeper
13. Cloudera Management Service

Ok, we are good here. In the next post, I am going to discuss load data to Hive.

Advertisements

6 thoughts on “Configurations after CDH Installation

  1. Pingback: Load Data to Hive Table | My Big Data World

  2. Pingback: Import Data to Hive from Oracle Database | My Big Data World

  3. Pingback: Export data from Hive table to Oracle Database | My Big Data World

  4. Pingback: Use Impala to query a Hive table | My Big Data World

  5. Pingback: Use incremental import in sqoop to load data from Oracle (Part I) | My Big Data World

  6. Pingback: Use incremental import in sqoop to load data from Oracle (Part II) | My Big Data World

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s