Create Hadoop Cluster on Google Cloud Platform

There are many ways to create Hadoop clusters and I am going to show a few ways on Google Cloud Platform (GCP). The first approach is the standard way to build a Hadoop cluster, no matter whether you do it on cloud or on-premise. Basically create a group of VM instances and manually install Hadoop cluster on these VM instances. Many people have blogs or articles about this approach and I am not going to repeat the steps here.

In this blog, I am going to discuss the approach using Google Cloud Dataproc and you can actually have a Hadoop cluster up and running
within 2 minutes. Google Cloud Dataproc is a fully-managed cloud service for running Apache Hadoop cluster in a simple and fast way. The followings show the steps to create a Hadoop Cluster and submit a spark job to the cluster.

Create a Hadoop Cluster
Click Dataproc -> Clusters

Then click Enable API

Cloud Dataproc screen shows up. Click Create cluster

Input the following parameters:
Name : cluster-test1
Region : Choose use-central1
Zone : Choose us-central1-c

1. Master Node
Machine Type : The default is n1-standard-4, but I choose n1-standard-1 just for simple testing purpose.
Cluster Mode : There are 3 modes here. Single Mode (1 master, 0 worker), Standard Mode (1 master, N worker), and High Mode (3 masters, N workers). Choose Standard Mode.
Primary disk size : For my testing, 10GB 1s enough.

2. Worker Nodes
Similar configuration like Worker node. I use 3 worker nodes and disk size is 15 GB. You might notice that there is option to use local SSD storage. You can attach up to 8 local SSD devices to the VM instance. Each disk is 375 GB in size and you can not specify 10GB disk size here. The local SSDs are physically attached to the host server and offer higher performance and lower latency storage than Google’s persistent disk storage. The local SSDs is used for temporary data like shuffling data in MapReduce. The data on the local SSD storage is not persistent. For more information, please visit

Another thing to mention is that Dataproc uses Cloud Storage bucket instead of HDFS for the Hadoop cluster. Although the hadoop command is still working and you won’t feel anything different, the underline storage is different. In my opinion, it is actually better because Google Cloud Storage bucket definitely has much better reliability and scalability than HDFS.

Click Create when everything is done. After a few minutes, the cluster is created.

Click cluster-test1 and it should show the cluster information.

If click VM Instances tab, we can see there is one master and 3 worker instances.

Click Configuration tab. It shows all configuration information.

Submit a Spark Job
Click Cloud Dataproc -> Jobs.

Once Submit Job screen shows up, input the following information, then click Submit.

After the job completes, you should see the followings:

To verify the result, I need to ssh to the master node to find out which ports are listening for connections. Click the drop down on the right of SSH of master node, then click Open in browser window.

Then run the netstat command.

cluster-test1-m:~$ netstat -a |grep LISTEN |grep tcp
tcp        0      0 *:10033                 *:*                     LISTEN     
tcp        0      0 *:10002                 *:*                     LISTEN     
tcp        0      0 cluster-test1-m.c.:8020 *:*                     LISTEN     
tcp        0      0 *:33044                 *:*                     LISTEN     
tcp        0      0 *:ssh                   *:*                     LISTEN     
tcp        0      0 *:52888                 *:*                     LISTEN     
tcp        0      0 *:58266                 *:*                     LISTEN     
tcp        0      0 *:35738                 *:*                     LISTEN     
tcp        0      0 *:9083                  *:*                     LISTEN     
tcp        0      0 *:34238                 *:*                     LISTEN     
tcp        0      0 *:nfs                   *:*                     LISTEN     
tcp        0      0 cluster-test1-m.c:10020 *:*                     LISTEN     
tcp        0      0 localhost:mysql         *:*                     LISTEN     
tcp        0      0 *:9868                  *:*                     LISTEN     
tcp        0      0 *:9870                  *:*                     LISTEN     
tcp        0      0 *:sunrpc                *:*                     LISTEN     
tcp        0      0 *:webmin                *:*                     LISTEN     
tcp        0      0 cluster-test1-m.c:19888 *:*                     LISTEN     
tcp6       0      0 [::]:10001              [::]:*                  LISTEN     
tcp6       0      0 [::]:44884              [::]:*                  LISTEN     
tcp6       0      0 [::]:50965              [::]:*                  LISTEN     
tcp6       0      0 [::]:ssh                [::]:*                  LISTEN     
tcp6       0      0 cluster-test1-m:omniorb [::]:*                  LISTEN     
tcp6       0      0 [::]:46745              [::]:*                  LISTEN     
tcp6       0      0 cluster-test1-m.c.:8030 [::]:*                  LISTEN     
tcp6       0      0 cluster-test1-m.c.:8031 [::]:*                  LISTEN     
tcp6       0      0 [::]:18080              [::]:*                  LISTEN     
tcp6       0      0 cluster-test1-m.c.:8032 [::]:*                  LISTEN     
tcp6       0      0 cluster-test1-m.c.:8033 [::]:*                  LISTEN     
tcp6       0      0 [::]:nfs                [::]:*                  LISTEN     
tcp6       0      0 [::]:33615              [::]:*                  LISTEN     
tcp6       0      0 [::]:56911              [::]:*                  LISTEN     
tcp6       0      0 [::]:sunrpc             [::]:*                  LISTEN  

Check out directories.

cluster-test1-m:~$ hdfs dfs -ls /
17/09/12 12:12:24 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
Found 2 items
drwxrwxrwt   - mapred hadoop          0 2017-09-12 11:56 /tmp
drwxrwxrwt   - hdfs   hadoop          0 2017-09-12 11:55 /user

There are a few UI screens available to check out the Hadoop cluster and job status.
HDFS NameNode (port 9870)

YARN Resource Manager (port 8088)

Spark Job History (port 18080)

Dataproc approach is an easy deployment tool to create a Hadoop cluster. Although it is powerful, I miss the nice UI like Cloudera Manager. To install Cloudera CDH cluster, I need to use a different approach and I am going to discuss it in the future blog.


Google Cloud SQL vs Cloud DataStore vs BigTable vs BigQuery vs Spanner

Many people are familiar with Amazon AWS cloud, but Google Cloud Platform (GCP) is another interesting cloud provider. For Cloud DB storage option on GCP, Google provides the options like Cloud SQL, Cloud Datastore, Google BigTable, Google Cloud BigQuery, and Google Spanner. In this blog, I am going to discuss all of these five options, but mainly focusing on last three as I am more interested in the options that handle large amount of data.

Cloud SQL
If you want to have full relational database in supporting customized table views, stored procedures, tons of indexes and ACID compliance, Cloud SQL is probably your potential choice here. Google Cloud SQL is the database service that support two types of databases: MySQL and PostgreSQL. Both support High Available (HA) and Pay Per Use without Lock-in. It can scale up to 32 processor cores and more than 200GB RAM. Although this option might make your life easier in migrating your data to cloud, it does have all the limitations in MySQL and PostgreSQL, and not scaling well for huge data volume. There are many blogs about performance limitation on MySQL and PostgreSQL. I am not going to repeat here.

For more information about Cloud SQL, please visit

Cloud Datastore
Google Cloud DataStore is a cloud-based NoSQL database for web and mobile applications. It’s scalable NoSQL database and can automatically handles sharding and replication. It also supports ACID transaction, SQL-like queries and REST API. Unlike BigTable, Datastore is optimized for smaller set of data. Although Cloud Datastore is a NoSQL db and you don’t need to define a schema before storing a row, it actually uses more for ad hoc storage of structured data. Cloud Datastore does not have SQL, but have an API called GQL to perform some kind of queries. Here is one example of the query.

// List Google companies with less than 400 employees.
var companies = query.filter('name =', 'Google').filter('size <', 400);

Someone mentioned that Cloud Datastore actually originated from Google’s internal-use database, Megastore. Megastore is widely used inside Google. I couldn’t find Google’s official statement about the link between these two products. But from Google’s publication about Megastore, it does look quite similar to Cloud Datastore.

For more information about Cloud Datastore, please visit

Big Table
Google BigTable is Google’s cloud storage solution for low latency data access. It was originally developed in 2004 and was built on Google File System (GFS). There is one paper about BigTable: Bigtable: A Distributed Storage System for Structured Data. Now It is widely used in many Google’s core services like Google Search, Google Maps, and Gmail. It is designed in NoSQL architecture, but can still use row-based data format. With data read/write under 10 milliseconds, it is good for applications that have frequent data ingestion. It can be scaleable to hundreds of petabytes and handle millions of operations per second.

BigTable is compatible with HBase 1.0 API via extensions. Any move from HBase will be easier. BigTable has no SQL interface and you can only use API go Put/Get/Delete individual rows or run scan operations. BigTable can be easily integrated with other GCP tools, like Cloud Dataflow and Dataproc. BigTable is also the foundation for Cloud Datastore.

Unlike other clouds, GCP compute and storage are separate. You need to consider the following three parts when calculating the cost.
1. The type of Cloud instance, and the number of nodes in the instance.
2. The total amount of storage your tables use.
3. The amount of network bandwidth used. Please note: some part of network traffic is free.

It’s good and bad. The good part is that you don’t need to pay for the compute cost if your system is idle and you pay only the storage cost. The bad part is that it is not easy to forecast your compute usage if you have very large dataset.

As for pricing, I listed the cost to create a 10 node BigTable in us-east1-b zone and for production instance type only. The first shows SSD storage type while the second one shows HDD storage type.









This is the pricing as of June 10, 2017, it could change in the future without notice. There are some interesting observations:

  • The compute cost is the same no matter you choose SSD and HDD storage type. It makes sense as storage and compute are separated in GCP.
  • The Writes are the same for both cases. The Reads is about 20 times slower in HDD compared with SSD. But the scans for HDD drops just 20%. If you know your access pattern is mostly scan, HDD option seem not bad. Although I would not recommend HDD, just feel this observation is interesting and puzzling.
  • The cost to go with HDD storage is only 15% of the cost of going with SSD.

For more information , please visit

BigQuery is Google’s Cloud-based data warehousing solution. Unlike BigTable, it targets data in big picture and can query huge volume of data in a short time. As the data is stored in columnar data format, it is much faster in scanning large amounts of data compared with BigTable. BigQuery allows you to scale to petabyte and is great enterprise data warehouse for analytics. BigQuery is serverless. Serverless computing means computing resource can be spun up on-demand. It benefits users from zero server usage to full-scale usage without involving administrators and managing infrastructure. According to Google, BigQuery can scan Terabytes of data in seconds and Petabytes of data in minutes. For data ingestion, BigQuery allows you to load data from Google Cloud Storage, or Google Cloud DataStore, or stream into BigQuery storage.

However, BigQuery is really for OLAP type of query and scan large amount of data and is not designed for OLTP type queries. For small read/writes, it takes about 2 seconds while BigTable takes about 9 milliseconds for the same amount of data. BigTable is much better off for OLTP type of queries. Although BigQuery support atomic single-row operations, it lacks cross-row transaction support.

For pricing, there are some free operations. I won’t discuss more about the free operation, but just about standard pricing for the most important components in the BigQuery. There are two major components in the cost of using BigQuery: Storage Cost and Query Cost

For storage cost, it is $0.02 per GB/month. However, Google has a long term storage pricing, which is 50% off to $0.01 per GB/month. The definition of long term storage is the table that is not edited (APPEND, OVERWRITE or STEAMING) for 90 days. Each partition in the table is considered separate storage. So you could have standard pricing for some recent partitions while have long term storage pricing for some historical partitions. Even the data is in long term storage, there is no degradation of performance, durability and availability.

For query cost, the first 1 TB of data processed in a month is free, then it is $5 per TB. No charge for cached queries. As BigQuery is stored in columnar data format, the query cost is based on the columns selected. For enterprise with large amount of data and tons of applications, although the bill for data storage is predictable, the bill for query cost is not. The good news is that Google does offer a flat rate monthly cost model instead of on-demand pricing. For example, you can pay $10,000 for 500 BigQuery Slots and BigQuery automatically manages these slot quota.

For more documentation about BigQuery, please visit A sample of BigQuery screen is shown below:

Cloud Spanner
Cloud Spanner is a globally distributed database and was just officially released last month in May 2017. It is a versioned key-value store. From this perspective, it is similar to BigTable. However, it support general-purpose transactions and provide SQL-based query language.

Spanner was developed in 2011 and used internally for Google’s advertising backend, which is called F1. F1 was initially based on a MySQL database. As Google grows rapidly in its advertising revenue, so is the F1 MySQL database. The uncompressed dataset is in tens of TB. It’s definitely way beyond the comfort zone for MySQL. Even with tremendous effort in sharding scheme, the management of the database became very complex and costly. The last resharding of this MySQL database took two years of intense effort. Please note, even for a great company like Google with so many talented people, it still took two years efforts. I can’t image how other companies can survive this size of MySQL database. This is why I am usually cautious about using MySQL database with large size footprint.

Two features I like most in Cloud Spanner:
1. Replication Configuration
Data replication is handled automatically and transparently. But user application can control the way how data is stored. For example, if user data has the requirement to stay in USA only, you could specify to store data in US data centers only. If you want to improve the read performance and availability, you could increase the number of replicas used and geographic placement of replicas to make the data is close to the users as much as possible. If you want to have fast write throughput, you could decide how far replicas are from each other.

2. Globally-distributed database allowing consistent reads and writes.
This feature is critical if I want to have a consistent backup, or have consistent reads at global scale level. The implementation of this feature is using Google’s TrueTime. In stead of using only one source for the time reference, TrueTime is based on the time references from both GPS and atomic clocks. Google indicates the reason to use two different kinds of time reference because they have different failure models. Atomic clocks can fail over long periods of time like drift significantly while GPS can fail when receivers fail or radio interference. Usually you won’t see both time reference fails at the same time because they have different failure models.

Spanner is organized in a set of zones. Each zone has one zone master and 100 to 1000 spanserver. Each table is split into multiple tablets. A table’s state is stored in set of B-tree like structure files and Write-Ahead Log on a file system called Colossus. Colossus is a global distributed file system and the successor to Google File System (GFS). Spanner’s data model is not purely relational, but semi-relational. Each rows must have names, and each table is required to have an ordered set of one or more primary-key columns. Google publishes a Best Practice for Spanner Schema Design. For more information about the Spanner architecture, please check out Google’s research paper: Spanner: Google’s Globally-Distributed Database.

The following shows the option to create a Spanner instance with 10 nodes.

The storage cost is $0.30 per GB/month and $9 per hour per node. Each Spanner node can provide up to 10,000 QPS of reads or 2000 QPS of writes (writing single rows at 1KB data per row), and 2 TB disk storage. Google also recommend provision more spanner nodes to keep CPU utilization below 75%.

Among these five database storage options in GCP, I like Cloud Spanner the most as I feel it is the best to meet the requirement of supporting Relational databases and great scalability and availability. But if you want to find which option is really for you, Google has a nice decision tree to help you to determine the best option for you.

For more information about Cloud Spanner, please visit