Access Sparkling Water via R Studio

In the last few blogs, I discussed the following topics related to Sparkling Water.
Sparking Water Shell: Cloud size under 12 Exception
H2O vs Sparkling Water
The H2O Flow UI is nice, but need to some mouse clicks to get what you want. In this blog, I am going to discuss how to access Sparkling Water via R Studio. You can access the same H2O frame from both Sparkling Water and R Studio.

Data Preparation
Create the following store transaction file, store_trans.txt

[~]# cat store_trans.txt 
3000251,20171111,1321,Austin,TX,1234,Food,Pizza,34910,8.10,Angla
3000252,20171111,1321,Austin,TX,7812,Food,Milk,16920,3.99,Rosina
3000753,20171112,2010,Houston,TX,3190,Food,Pizza,34910,8.10,Broderick
3000954,20171112,1442,Austin,TX,1234,Food,Pizza,34910,8.10,Jeanne
3008255,20171112,1651,Austin,TX,5134,Sports,Shoes,12950,45.99,Angla
3026256,20171112,1632,Austin,TX,1234,Food,Pizza,34910,8.10,Jeanne

Upload the file to HDFS.

[~]# hdfs dfs -put store_trans.txt /user/wzhou/test/data/
[~]# hdfs dfs -ls /user/wzhou/test/data
Found 2 items
-rw-r--r--   3 wzhou supergroup       2495 2017-11-14 09:58 /user/wzhou/test/data/stock.csv
-rw-r--r--   3 wzhou supergroup        400 2017-11-16 04:50 /user/wzhou/test/data/store_trans.txt

Start Sparkling Shell
Run the following code to start sparkling shell. I used only 2 instance here and you could specify more instances and it will spread into multiple instances.

kinit wzhou
bin/sparkling-shell \
–master yarn \
–conf spark.executor.instances=2 \
–conf spark.executor.memory=1g \
–conf spark.driver.memory=1g \
–conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
–conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
–conf spark.dynamicAllocation.enabled=false \
–conf spark.sql.autoBroadcastJoinThreshold=-1 \
–conf spark.locality.wait=30000 \
–conf spark.scheduler.minRegisteredResourcesRatio=1

Here is the execution result.

[root@worker--908ba3bf-d4af-4db1-bc3e-f7a9ba5372fa sparkling-water-2.2.2]# bin/sparkling-shell \
> --master yarn \
> --conf spark.executor.instances=2 \
> --conf spark.executor.memory=1g \
> --conf spark.driver.memory=1g \
> --conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
> --conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
> --conf spark.dynamicAllocation.enabled=false \
> --conf spark.sql.autoBroadcastJoinThreshold=-1 \
> --conf spark.locality.wait=30000 \
> --conf spark.scheduler.minRegisteredResourcesRatio=1

-----
  Spark master (MASTER)     : yarn
  Spark home   (SPARK_HOME) : 
  H2O build version         : 3.14.0.7 (weierstrass)
  Spark build version       : 2.2.0
  Scala version             : 2.11
----

/usr/java/jdk1.8.0_131
lib_dir=/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/bin/../lib
bin_dir=/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/bin
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.142.0.8:4040
Spark context available as 'sc' (master = yarn, app id = application_1510827150310_0003).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

Start H2O Cluster
The H2O Cluster is running inside the Spark cluster.

import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(spark)
import h2oContext._

scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val h2oContext = H2OContext.getOrCreate(spark) 
h2oContext: org.apache.spark.h2o.H2OContext =                                   

Sparkling Water Context:
 * H2O name: sparkling-water-root_application_1510827150310_0003
 * cluster size: 2
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (2,worker--c7e86f76-a1fa-49e5-9074-d7bef4dbb43e.c.cdh-director-173715.internal,54321)
  (1,worker--908ba3bf-d4af-4db1-bc3e-f7a9ba5372fa.c.cdh-director-173715.internal,54321)
  ------------------------
  Open H2O Flow in browser: http://10.142.0.8:54323 (CMD + click in Mac OSX)

scala> import h2oContext._ 
import h2oContext._

Access H2O Flow UI
Open a browser, and type in http://10.142.0.8:54323. Port 54321 is the default port number. The reason why it is 54323 is that I have run another one that is using 54321 right now.
Click importFiles

Find out the store_tran.txt file and click Import

Click Parse these files

You should see the following and then click Parse

Click View

Right now we have the nice store_trans.hex H2O Frame.

Run command getFrames and we should be able to see the store_trans.hex frame.

You can then do tons of stuff, like model building, predication and so on. I am not going to discuss more in this blog. At this moment, let me switch to R Studio.

Load Required Libraries in R Studio
Run the following:

options(rsparkling.sparklingwater.version = "2.2.2")
library(rsparkling)
library(sparklyr)
library(h2o)
library(dplyr)

Here is the result

> options(rsparkling.sparklingwater.version = "2.2.2")
> library(rsparkling)
> library(sparklyr)
> library(h2o)
----------------------------------------------------------------------
Your next step is to start H2O:
    > h2o.init()
For H2O package documentation, ask for help:
    > ??h2o
After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------
Attaching package: ‘h2o’
The following objects are masked from ‘package:stats’:
    cor, sd, var
The following objects are masked from ‘package:base’:
    &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames, colnames<-, ifelse, is.character, is.factor, is.numeric, log, log10, log1p, log2, round, signif, trunc 
> library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
    filter, lag
The following objects are masked from ‘package:base’:
    intersect, setdiff, setequal, union

Connect to H2O Cluster
Do not run h2o.init() as the messages shown above. It will start a one node H2O cluster in your local node. I want to reuse the existing H2O cluster that is across multiple nodes. Run the following:
h2o.init(ip=”10.142.0.8″, port=54323)
h2o.clusterIsUp()

You should see the connection to the existing H2O cluster is successful. Also ignore version mismatch message.

Check cluster status and the status should be health in all nodes.

Let me see whether I can see the H2O frame I created from H2O Flow UI.

Excellent, I can see it. Let me do some operation against this frame.

Ok, we can see it is quite easy to access H2O from R Studio.

Advertisements