Create Cloudera Hadoop Cluster Using Cloudera Director on Google Cloud

I have a blog discussing how to install Cloudera Hadoop Cluster several years ago. It basically took about at least half day to complete the installation in my VM cluster. In my last post, I discussed an approach to deploy Hadoop cluster using DataProc on Google Cloud Platform. It literally took less than two minutes to create a Hadoop Cluster. Although it is a good to have a cluster launched in a very short time, it does not have the nice UI like Cloudera Manager as the Hadoop distribution used by Dataproc is not CDH. I could repeat my blogs to build a Hadoop Cluster using VM instances on Google Cloud Platform. But it will take some time and involve a lot of work. Actually there is another way to create Hadoop cluster on the cloud. Cloudera has a product, called Cloudera Director. It currently supports not only Google Cloud, but also AWS and Azure as well. It is designed to deploy CDH cluster faster and easier to scale the cluster on the cloud. Another important feature is that Cloud Director allows you to move your deployment scripts or steps easily from one cloud provider to another provider and you don’t have to be locked in one cloud vendor. In this blog, I will show you the way to create a CDH cluster using Cloudera Director.

The first step is to start my Cloudera Director instance. In my case, I have already installed Cloudera Director based on the instruction from Cloudera. It is pretty straight forward process and I am not going to repeat it here. The Cloudera Director instance is where you can launch your CDH cluster deployment.

Both Cloudera Director and Cloudera Manager UI are browser-based and you have to setup secure connection between your local machine and VM instances on the cloud. To achieve this, you need to configure SOCKS proxy on your local machine that is used to connect to the Cloudera Director VM. It provides a secure way to connect to your VM on the cloud and can use VM’s internal IP and hostname in the web browser. Google has a nice note about the steps, Securely Connecting to VM Instances. Following this note will help you to setup SOCKS proxy.

Ok, here are the steps.
Logon to Cloudera Director
Open a terminal session locally, and run the following code:

gcloud compute ssh cdh-director-1 \
    --project cdh-director-173715 \
    --zone us-central1-c \
    --ssh-flag="-D" \
    --ssh-flag="1080" \

cdh-director-1 is the name of my Cloudera Director instance on Google cloud and cdh-director-173715 is my Google Cloud project id. After executing the above command, it looks hang and never complete. This is CORRECT behavior. Do not kill or exit this session. Open a browser and type in the internal IP of Cloudera Director instance with port number 7189. For my cdh-director-1 instance, the internal IP is

After input the URL for Cloudera Director. The login screen shows up. Login as admin user.

After login, the initial setup wizard shows up. Click Let’s get started.

In the Add Environment screen, input the information as follows. The Client ID JSON Key is the file you can create during the initial setup of you Google project with SSH key stuff.

In the next Add Cloudera Manager screen, I usually create the Instance Template first. Click the drop down of Instance Template, then select Create a new instance template. I need at least three template, one for Cloudera Manager, one for Master nodes, and one for Worker nodes. In my case here, I did not create a template for Edge nodes. To save resource on my Google cloud environment, I did not create the template for Edge node. Here are the configuration for all three templates.

Cloudera Manager Template

Master Node Template

Worker Node Template

Input the following for Cloudera Manager. For my test, I use Embedded Database. If it is used for production, you need to setup external database first and register the external database here.

After click Continue, Add Cluster screen shows up. There is a gateway instance group and I removed it by clicking Delete Group because I don’t have edge node here. Input the corresponding template and number of instances for masters and workders.

After click Continue, the deployment starts.

After about 20 minutes, it completes. Click Continue.

Review Cluster
The nice Cloudera Director dashboard shows up.

You can also login to Cloudera Manager from the link on Cloudera Director.

Nice and easy. Excellent product from Cloudera. For more information about deploying CDH cluster on Google Cloud, you can also check out Cloudera’s document, Getting Started on Google Cloud Platform.


Resolve Sparklyr not Respond Issue on Port 8880

Recently I was approached by one of my clients to help them to investigate a weird Sparklyr issue. sparklyr is an interface between R and Spark introduced by RStudio about a years ago. The following is the the sparklyr architecture.

When trying to do sc <- spark_connect in RStudio, we got two errors as follows:

  • Failed while connecting to sparklyr to port (8880) for sessionid (3859): Gateway in port (8880) did not respond.
  • Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
  • Here is the detail message.

    > library(sparklyr)
    > library(dplyr) 
    > sc <- spark_connect(master = "yarn-client", config=spark_config(), version="1.6.0", spark_home = '/opt/cloudera/parcels/CDH/lib/spark/')
    Error in force(code) :
    Failed while connecting to sparklyr to port (8880) for sessionid (3859): Gateway in port (8880) did not respond.
    Path: /opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/lib/spark/bin/spark-submit
    Parameters: --class, sparklyr.Shell, --jars, '/usr/lib64/R/library/sparklyr/java/spark-csv_2.11-1.3.0.jar','/usr/lib64/R/library/sparklyr/java/commons-csv-1.1.jar','/usr/lib64/R/library/sparklyr/java/univocity-parsers-1.5.1.jar', '/usr/lib64/R/library/sparklyr/java/sparklyr-1.6-2.10.jar', 8880, 3859
    Log: /tmp/RtmzpSIMln/file9e23246605df7_spark.log
    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
                   at org.apache.spark.deploy.SparkSubmitArguments.handle(SparkSubmitArguments.scala:394)
                   at org.apache.spark.launcher.SparkSubmitOptionParser.parse(
                   at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:97)
                   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114)
                   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
                   at java.lang.ClassLoader.loadClass(
                   at sun.misc.Launcher$AppClassLoader.loadClass(
                   at java.lang.ClassLoader.loadClass(
                   ... 5 more

    Did some research and found many people having the similar issue. Ok, try their recommendations one by one as follows.

  • Set SPARK_HOME environment
  • Try run Sys.setEnv(SPARK_HOME = “/opt/cloudera/parcels/CDH/lib/spark/”). No, not working.

  • Install latest version sparklyr
  • My client installed sparklyr less than one month ago. I don’t see why this option makes sense. Don’t even pursue this path.

  • Check Java Installation
  • The R on the same server uses the same version of Java without any issue. I don’t see why Java installation become a major concern here. Ignore this one.

  • No Hadoop Installation
  • Someone said just Spark installation is not enough, not to have Hadoop Installation as well. Clearly it does not fit our situation. The server is an edge node and has hadoop installation.

  • Do not have a valid kerberos ticket
  • Running system2(‘klist’) does show no kerberos ticket. Ok, I then open up a shell within RStudio Server by clicking tools -> shell, then issuing the kinit command.
    Rerun system2(‘klist’) shows I have a valid kerberos ticket. Try again. still not working.
    Note: even it is not working, this step is necessary for further action when the issue is fixed. So still need to run this one no matter what the result is.

  • Create a different configure and pass to spark_connect
  • Someone recommended to create a new configure and pass it in. It looks like a good idea. Unfortunately, just doesn’t work.

    wzconfig <- spark_config()
    wzconfig$`` <- "client"
    wzconfig$spark.driver.cores <- 1
    wzconfig$spark.executor.cores <- 2
    wzconfig$spark.executor.memory <- "4G"
    sc <- spark_connect(master = "yarn-client", config=wzconfig, version="1.6.0", spark_home = '/opt/cloudera/parcels/CDH/lib/spark/')

    Actually this recommendation is missing another key parameter. By default the total number of executors launched is 2. I would usually bump up this number a little to get a better performance. You can use the following way to set up the
    total number of executors.

    wzconfig$spark.executor.instances <- 3

    Although this approach looks promising, still not working. But this approach is definitely a way to use for other purpose to better control the Spark resource usage.

  • Add remote address
  • Someone mentioned to set remote address. I thought this could another potential option as I resolved issues in Spark related to local IP issue in the past. So I add the following code in the configuration from the previous example, note parameter sparklyr.gateway.address is the hostname of active Resource Manager.

    wzconfig$sparklyr.gateway.remote <- TRUE
    wzconfig$sparklyr.gateway.address <- "" 

    Not working for this case.

  • Change deployment mode to yarn-cluster
  • This is probably the most unrealistic one. If connect as with master = “yarn-cluster”, the spark driver will be somewhere inside the Spark cluster. For our current case, I don’t believe this is the right solution. Don’t even try it.

  • Run Spark example
  • Someone recommended to run a spark-submit to verify SparkPi can be run from the environment. This looks reasonable. The good thing I figured out the issue before executing this one. But this definitely a valid and good test to verify spark-submit.

    /opt/cloudera/parcels/SPARK2/lib/spark2/bin/spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode client --master yarn /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples_2.11-2.1.0.jar 10
  • HA for yarn-cluster
  • There is an interesting post Add support for `yarn-cluster` with high availability #905 discussing about the issue might relate to multiple resource managers. We use HA and this post is an interesting one. But might not fit into our case because I feel we have not reached to the HA part yet with Class Not Found message.

  • Need to set JAVA_HOME
  • Verified it and we have it. So this is not the issue.

  • My Solution
  • After reviewing or trying out some of above solutions, I like to go back my way of thinking. I must say I am not an expert in R or RStudio with very limited knowledge about how it works. But I did have extensive background in Spark tuning and trouble shooting.

    I know the error message Gateway in port (8880) did not respond is always the first message shows up and looks like the cause of the issue. But I thought differently. I believe the 2nd error NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream looks more suspicious than the first one. Early this year I helped one of another clients on a weird Spark job issue, which is in the end, was caused by the incorrect path. It seems to me the path might not be right and cause Spark issue, then caused the first error of port not respond.

    With this idea in mind, I focused more the path verification. Run the command Sys.getenv() to get the environment as follows.

    > Sys.getenv()
    DISPLAY                 :0
    EDITOR                  vi
    GIT_ASKPASS             rpostback-askpass
    HADOOP_CONF_DIR         /etc/hadoop/conf.cloudera.hdfs
    HADOOP_HOME             /opt/cloudera/parcels/CDH
    HOME                    /home/wzhou
    JAVA_HOME               /usr/java/jdk1.8.0_144/jre
    LANG                    en_US.UTF-8
    LD_LIBRARY_PATH         /usr/lib64/R/lib::/lib:/usr/java/jdk1.8.0_92/jre/lib/amd64/server
    LN_S                    ln -s
    LOGNAME                 wzhou
    MAKE                    make
    PAGER                   /usr/bin/less
    PATH                    /usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
    R_BROWSER               /usr/bin/xdg-open
    R_BZIPCMD               /usr/bin/bzip2
    R_DOC_DIR               /usr/share/doc/R-3.3.2
    R_GZIPCMD               /usr/bin/gzip
    R_HOME                  /usr/lib64/R
    R_INCLUDE_DIR           /usr/include/R
    R_LIBS_SITE             /usr/local/lib/R/site-library:/usr/local/lib/R/library:/usr/lib64/R/library:/usr/share/R/library
    R_LIBS_USER             ~/R/x86_64-redhat-linux-gnu-library/3.3
    R_PAPERSIZE             a4
    R_PDFVIEWER             /usr/bin/xdg-open
    R_PLATFORM              x86_64-redhat-linux-gnu
    R_PRINTCMD              lpr
    R_RD4PDF                times,hyper
    R_SESSION_TMPDIR        /tmp/RtmpZf9YMN
    R_SHARE_DIR             /usr/share/R
    R_SYSTEM_ABI            linux,gcc,gxx,gfortran,?
    R_TEXI2DVICMD           /usr/bin/texi2dvi
    R_UNZIPCMD              /usr/bin/unzip
    RS_RPOSTBACK_PATH       /usr/lib/rstudio-server/bin/rpostback
    RSTUDIO                 1
    RSTUDIO_PANDOC          /usr/lib/rstudio-server/bin/pandoc
    RSTUDIO_WINUTILS        bin/winutils
    SED                     /bin/sed
    SPARK_HOME              /opt/cloudera/parcels/SPARK2/lib/spark2
    SSH_ASKPASS             rpostback-askpass
    TAR                     /bin/gtar
    USER                    wzhou
    YARN_CONF_DIR           /etc/hadoop/conf.cloudera.yarn

    Ahhh, I noticed the environment missed SPARK_DIST_CLASSPATH environment variable. Then I set it using the command below just before sc <- spark_connect.

    Sys.setenv(SPARK_DIST_CLASSPATH = '/etc/hadoop/con:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop/.//*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-hdfs/./:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-yarn/.//*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/.//*')

    Ok, try it again. Fantastic, it works!

    Ok, here is the real cause of the issue. It’s unnecessary to specify java path for sparklyr as it does not require a java path. However, it does have dependency on spark-submit. When spark-submit is executed, it can read java path and then submit the jar files to Spark accordingly. The cause of the issue if SPARK_DIST_CLASSPATH is not set, spark-submit is not working and Spark executors can not be launched.

    Other Note
    The following are some of useful commands:

    spark_home_dir() or spark_home
    config <- spark_config()


    Also there are a few useful articles about sparklyr and Rstudio:
    RStudio’s R Interface to Spark on Amazon EMR
    How to Install RStudio Server on CentOS 7
    Using R with Apache Spark
    sparklyr: a test drive on YARN
    Analyzing a billion NYC taxi trips in Spark

    Create Hadoop Cluster on Google Cloud Platform

    There are many ways to create Hadoop clusters and I am going to show a few ways on Google Cloud Platform (GCP). The first approach is the standard way to build a Hadoop cluster, no matter whether you do it on cloud or on-premise. Basically create a group of VM instances and manually install Hadoop cluster on these VM instances. Many people have blogs or articles about this approach and I am not going to repeat the steps here.

    In this blog, I am going to discuss the approach using Google Cloud Dataproc and you can actually have a Hadoop cluster up and running
    within 2 minutes. Google Cloud Dataproc is a fully-managed cloud service for running Apache Hadoop cluster in a simple and fast way. The followings show the steps to create a Hadoop Cluster and submit a spark job to the cluster.

    Create a Hadoop Cluster
    Click Dataproc -> Clusters

    Then click Enable API

    Cloud Dataproc screen shows up. Click Create cluster

    Input the following parameters:
    Name : cluster-test1
    Region : Choose use-central1
    Zone : Choose us-central1-c

    1. Master Node
    Machine Type : The default is n1-standard-4, but I choose n1-standard-1 just for simple testing purpose.
    Cluster Mode : There are 3 modes here. Single Mode (1 master, 0 worker), Standard Mode (1 master, N worker), and High Mode (3 masters, N workers). Choose Standard Mode.
    Primary disk size : For my testing, 10GB 1s enough.

    2. Worker Nodes
    Similar configuration like Worker node. I use 3 worker nodes and disk size is 15 GB. You might notice that there is option to use local SSD storage. You can attach up to 8 local SSD devices to the VM instance. Each disk is 375 GB in size and you can not specify 10GB disk size here. The local SSDs are physically attached to the host server and offer higher performance and lower latency storage than Google’s persistent disk storage. The local SSDs is used for temporary data like shuffling data in MapReduce. The data on the local SSD storage is not persistent. For more information, please visit

    Another thing to mention is that Dataproc uses Cloud Storage bucket instead of HDFS for the Hadoop cluster. Although the hadoop command is still working and you won’t feel anything different, the underline storage is different. In my opinion, it is actually better because Google Cloud Storage bucket definitely has much better reliability and scalability than HDFS.

    Click Create when everything is done. After a few minutes, the cluster is created.

    Click cluster-test1 and it should show the cluster information.

    If click VM Instances tab, we can see there is one master and 3 worker instances.

    Click Configuration tab. It shows all configuration information.

    Submit a Spark Job
    Click Cloud Dataproc -> Jobs.

    Once Submit Job screen shows up, input the following information, then click Submit.

    After the job completes, you should see the followings:

    To verify the result, I need to ssh to the master node to find out which ports are listening for connections. Click the drop down on the right of SSH of master node, then click Open in browser window.

    Then run the netstat command.

    cluster-test1-m:~$ netstat -a |grep LISTEN |grep tcp
    tcp        0      0 *:10033                 *:*                     LISTEN     
    tcp        0      0 *:10002                 *:*                     LISTEN     
    tcp        0      0 cluster-test1-m.c.:8020 *:*                     LISTEN     
    tcp        0      0 *:33044                 *:*                     LISTEN     
    tcp        0      0 *:ssh                   *:*                     LISTEN     
    tcp        0      0 *:52888                 *:*                     LISTEN     
    tcp        0      0 *:58266                 *:*                     LISTEN     
    tcp        0      0 *:35738                 *:*                     LISTEN     
    tcp        0      0 *:9083                  *:*                     LISTEN     
    tcp        0      0 *:34238                 *:*                     LISTEN     
    tcp        0      0 *:nfs                   *:*                     LISTEN     
    tcp        0      0 cluster-test1-m.c:10020 *:*                     LISTEN     
    tcp        0      0 localhost:mysql         *:*                     LISTEN     
    tcp        0      0 *:9868                  *:*                     LISTEN     
    tcp        0      0 *:9870                  *:*                     LISTEN     
    tcp        0      0 *:sunrpc                *:*                     LISTEN     
    tcp        0      0 *:webmin                *:*                     LISTEN     
    tcp        0      0 cluster-test1-m.c:19888 *:*                     LISTEN     
    tcp6       0      0 [::]:10001              [::]:*                  LISTEN     
    tcp6       0      0 [::]:44884              [::]:*                  LISTEN     
    tcp6       0      0 [::]:50965              [::]:*                  LISTEN     
    tcp6       0      0 [::]:ssh                [::]:*                  LISTEN     
    tcp6       0      0 cluster-test1-m:omniorb [::]:*                  LISTEN     
    tcp6       0      0 [::]:46745              [::]:*                  LISTEN     
    tcp6       0      0 cluster-test1-m.c.:8030 [::]:*                  LISTEN     
    tcp6       0      0 cluster-test1-m.c.:8031 [::]:*                  LISTEN     
    tcp6       0      0 [::]:18080              [::]:*                  LISTEN     
    tcp6       0      0 cluster-test1-m.c.:8032 [::]:*                  LISTEN     
    tcp6       0      0 cluster-test1-m.c.:8033 [::]:*                  LISTEN     
    tcp6       0      0 [::]:nfs                [::]:*                  LISTEN     
    tcp6       0      0 [::]:33615              [::]:*                  LISTEN     
    tcp6       0      0 [::]:56911              [::]:*                  LISTEN     
    tcp6       0      0 [::]:sunrpc             [::]:*                  LISTEN  

    Check out directories.

    cluster-test1-m:~$ hdfs dfs -ls /
    17/09/12 12:12:24 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
    Found 2 items
    drwxrwxrwt   - mapred hadoop          0 2017-09-12 11:56 /tmp
    drwxrwxrwt   - hdfs   hadoop          0 2017-09-12 11:55 /user

    There are a few UI screens available to check out the Hadoop cluster and job status.
    HDFS NameNode (port 9870)

    YARN Resource Manager (port 8088)

    Spark Job History (port 18080)

    Dataproc approach is an easy deployment tool to create a Hadoop cluster. Although it is powerful, I miss the nice UI like Cloudera Manager. To install Cloudera CDH cluster, I need to use a different approach and I am going to discuss it in the future blog.

    Google Cloud SQL vs Cloud DataStore vs BigTable vs BigQuery vs Spanner

    Many people are familiar with Amazon AWS cloud, but Google Cloud Platform (GCP) is another interesting cloud provider. For Cloud DB storage option on GCP, Google provides the options like Cloud SQL, Cloud Datastore, Google BigTable, Google Cloud BigQuery, and Google Spanner. In this blog, I am going to discuss all of these five options, but mainly focusing on last three as I am more interested in the options that handle large amount of data.

    Cloud SQL
    If you want to have full relational database in supporting customized table views, stored procedures, tons of indexes and ACID compliance, Cloud SQL is probably your potential choice here. Google Cloud SQL is the database service that support two types of databases: MySQL and PostgreSQL. Both support High Available (HA) and Pay Per Use without Lock-in. It can scale up to 32 processor cores and more than 200GB RAM. Although this option might make your life easier in migrating your data to cloud, it does have all the limitations in MySQL and PostgreSQL, and not scaling well for huge data volume. There are many blogs about performance limitation on MySQL and PostgreSQL. I am not going to repeat here.

    For more information about Cloud SQL, please visit

    Cloud Datastore
    Google Cloud DataStore is a cloud-based NoSQL database for web and mobile applications. It’s scalable NoSQL database and can automatically handles sharding and replication. It also supports ACID transaction, SQL-like queries and REST API. Unlike BigTable, Datastore is optimized for smaller set of data. Although Cloud Datastore is a NoSQL db and you don’t need to define a schema before storing a row, it actually uses more for ad hoc storage of structured data. Cloud Datastore does not have SQL, but have an API called GQL to perform some kind of queries. Here is one example of the query.

    // List Google companies with less than 400 employees.
    var companies = query.filter('name =', 'Google').filter('size <', 400);

    Someone mentioned that Cloud Datastore actually originated from Google’s internal-use database, Megastore. Megastore is widely used inside Google. I couldn’t find Google’s official statement about the link between these two products. But from Google’s publication about Megastore, it does look quite similar to Cloud Datastore.

    For more information about Cloud Datastore, please visit

    Big Table
    Google BigTable is Google’s cloud storage solution for low latency data access. It was originally developed in 2004 and was built on Google File System (GFS). There is one paper about BigTable: Bigtable: A Distributed Storage System for Structured Data. Now It is widely used in many Google’s core services like Google Search, Google Maps, and Gmail. It is designed in NoSQL architecture, but can still use row-based data format. With data read/write under 10 milliseconds, it is good for applications that have frequent data ingestion. It can be scaleable to hundreds of petabytes and handle millions of operations per second.

    BigTable is compatible with HBase 1.0 API via extensions. Any move from HBase will be easier. BigTable has no SQL interface and you can only use API go Put/Get/Delete individual rows or run scan operations. BigTable can be easily integrated with other GCP tools, like Cloud Dataflow and Dataproc. BigTable is also the foundation for Cloud Datastore.

    Unlike other clouds, GCP compute and storage are separate. You need to consider the following three parts when calculating the cost.
    1. The type of Cloud instance, and the number of nodes in the instance.
    2. The total amount of storage your tables use.
    3. The amount of network bandwidth used. Please note: some part of network traffic is free.

    It’s good and bad. The good part is that you don’t need to pay for the compute cost if your system is idle and you pay only the storage cost. The bad part is that it is not easy to forecast your compute usage if you have very large dataset.

    As for pricing, I listed the cost to create a 10 node BigTable in us-east1-b zone and for production instance type only. The first shows SSD storage type while the second one shows HDD storage type.









    This is the pricing as of June 10, 2017, it could change in the future without notice. There are some interesting observations:

    • The compute cost is the same no matter you choose SSD and HDD storage type. It makes sense as storage and compute are separated in GCP.
    • The Writes are the same for both cases. The Reads is about 20 times slower in HDD compared with SSD. But the scans for HDD drops just 20%. If you know your access pattern is mostly scan, HDD option seem not bad. Although I would not recommend HDD, just feel this observation is interesting and puzzling.
    • The cost to go with HDD storage is only 15% of the cost of going with SSD.

    For more information , please visit

    BigQuery is Google’s Cloud-based data warehousing solution. Unlike BigTable, it targets data in big picture and can query huge volume of data in a short time. As the data is stored in columnar data format, it is much faster in scanning large amounts of data compared with BigTable. BigQuery allows you to scale to petabyte and is great enterprise data warehouse for analytics. BigQuery is serverless. Serverless computing means computing resource can be spun up on-demand. It benefits users from zero server usage to full-scale usage without involving administrators and managing infrastructure. According to Google, BigQuery can scan Terabytes of data in seconds and Petabytes of data in minutes. For data ingestion, BigQuery allows you to load data from Google Cloud Storage, or Google Cloud DataStore, or stream into BigQuery storage.

    However, BigQuery is really for OLAP type of query and scan large amount of data and is not designed for OLTP type queries. For small read/writes, it takes about 2 seconds while BigTable takes about 9 milliseconds for the same amount of data. BigTable is much better off for OLTP type of queries. Although BigQuery support atomic single-row operations, it lacks cross-row transaction support.

    For pricing, there are some free operations. I won’t discuss more about the free operation, but just about standard pricing for the most important components in the BigQuery. There are two major components in the cost of using BigQuery: Storage Cost and Query Cost

    For storage cost, it is $0.02 per GB/month. However, Google has a long term storage pricing, which is 50% off to $0.01 per GB/month. The definition of long term storage is the table that is not edited (APPEND, OVERWRITE or STEAMING) for 90 days. Each partition in the table is considered separate storage. So you could have standard pricing for some recent partitions while have long term storage pricing for some historical partitions. Even the data is in long term storage, there is no degradation of performance, durability and availability.

    For query cost, the first 1 TB of data processed in a month is free, then it is $5 per TB. No charge for cached queries. As BigQuery is stored in columnar data format, the query cost is based on the columns selected. For enterprise with large amount of data and tons of applications, although the bill for data storage is predictable, the bill for query cost is not. The good news is that Google does offer a flat rate monthly cost model instead of on-demand pricing. For example, you can pay $10,000 for 500 BigQuery Slots and BigQuery automatically manages these slot quota.

    For more documentation about BigQuery, please visit A sample of BigQuery screen is shown below:

    Cloud Spanner
    Cloud Spanner is a globally distributed database and was just officially released last month in May 2017. It is a versioned key-value store. From this perspective, it is similar to BigTable. However, it support general-purpose transactions and provide SQL-based query language.

    Spanner was developed in 2011 and used internally for Google’s advertising backend, which is called F1. F1 was initially based on a MySQL database. As Google grows rapidly in its advertising revenue, so is the F1 MySQL database. The uncompressed dataset is in tens of TB. It’s definitely way beyond the comfort zone for MySQL. Even with tremendous effort in sharding scheme, the management of the database became very complex and costly. The last resharding of this MySQL database took two years of intense effort. Please note, even for a great company like Google with so many talented people, it still took two years efforts. I can’t image how other companies can survive this size of MySQL database. This is why I am usually cautious about using MySQL database with large size footprint.

    Two features I like most in Cloud Spanner:
    1. Replication Configuration
    Data replication is handled automatically and transparently. But user application can control the way how data is stored. For example, if user data has the requirement to stay in USA only, you could specify to store data in US data centers only. If you want to improve the read performance and availability, you could increase the number of replicas used and geographic placement of replicas to make the data is close to the users as much as possible. If you want to have fast write throughput, you could decide how far replicas are from each other.

    2. Globally-distributed database allowing consistent reads and writes.
    This feature is critical if I want to have a consistent backup, or have consistent reads at global scale level. The implementation of this feature is using Google’s TrueTime. In stead of using only one source for the time reference, TrueTime is based on the time references from both GPS and atomic clocks. Google indicates the reason to use two different kinds of time reference because they have different failure models. Atomic clocks can fail over long periods of time like drift significantly while GPS can fail when receivers fail or radio interference. Usually you won’t see both time reference fails at the same time because they have different failure models.

    Spanner is organized in a set of zones. Each zone has one zone master and 100 to 1000 spanserver. Each table is split into multiple tablets. A table’s state is stored in set of B-tree like structure files and Write-Ahead Log on a file system called Colossus. Colossus is a global distributed file system and the successor to Google File System (GFS). Spanner’s data model is not purely relational, but semi-relational. Each rows must have names, and each table is required to have an ordered set of one or more primary-key columns. Google publishes a Best Practice for Spanner Schema Design. For more information about the Spanner architecture, please check out Google’s research paper: Spanner: Google’s Globally-Distributed Database.

    The following shows the option to create a Spanner instance with 10 nodes.

    The storage cost is $0.30 per GB/month and $9 per hour per node. Each Spanner node can provide up to 10,000 QPS of reads or 2000 QPS of writes (writing single rows at 1KB data per row), and 2 TB disk storage. Google also recommend provision more spanner nodes to keep CPU utilization below 75%.

    Among these five database storage options in GCP, I like Cloud Spanner the most as I feel it is the best to meet the requirement of supporting Relational databases and great scalability and availability. But if you want to find which option is really for you, Google has a nice decision tree to help you to determine the best option for you.

    For more information about Cloud Spanner, please visit

    Join Type: Bucket Join

    In the last three blogs, I discussed Common Join, Map Join and Skewed Join. Common Join is the default join type. Map Join is best used when one of join tables is small and can be fit into memory. Skewed Join improves the query performance for data is skewed in the join keys. What happened when both join tables are very large and none of the above three joins can work? This is where Bucket Join fits in.

    Bucket Join is also called Collocated Joins. It is used when all join tables are large and table data has been distributed by the join key. In this case, data copy is unnecessary. It’s map side join and join can happen in the local node. Another condition for Bucket Join is that the number of buckets in one table must be equal or multiple of the number of buckets in another table of the join.

    So at the time when creating the table, make sure the buckets are created using the join columns, and BEFORE the data is inserted into the table. Also set both parameter hive.optimize.bucketmapjoin and hive.enforce.bucketing to true before inserting data. One example of creating a bucketed table is shown below:

    CREATE TABLE mytable ( 	
    name string,	 
    city string,	
    employee_id int ) 	
    CLUSTERED BY (employee_id) INTO 256 BUCKETS	

    The above join is also called Bucket Map Join. If the join tables have the same number of buckets and data are also sorted using the join columns, Sort Merge Bucket Map Join is used.

    How to Identify the Join
    When using EXPLAIN command, you will see Sorted Merge Bucket Map Join Operator below Map Operator Tree.


    set hive.optimize.bucketmapjoin = true;
    set hive.optimize.bucketmapjoin.sortedmerge = true;