Use Python for H2O

I wrote serveral blogs related to H2O topic as follows:
H2O vs Sparkling Water
Sparking Water Shell: Cloud size under 12 Exception
Access Sparkling Water via R Studio
Running H2O Cluster in Background and at Specific Port Number
Weird Ref-count mismatch Message from H2O

While both H2O Flow UI and R Studio can access H2O cluster started by Sparkling Water Shell, both lack the powerful functionality to manipulate data like Python does. In this blog, I am going to discuss how to use Python for H2O.

There are two main topics related to python for H2O. The first is to start a H2O cluster and another one is to access an existing H2O cluster using python.

1. Start H2O Cluster using pysparkling
I discussed start a H2O cluster using sparkling-shell in blog Sparking Water Shell: Cloud size under 12 Exception. For python, need to use pysparkling.

/opt/sparkling-water-2.2.2/bin/pysparkling \
--master yarn \
--conf spark.ext.h2o.cloud.name=WeidongH2O-Cluster \
--conf spark.ext.h2o.client.port.base=26000 \
--conf spark.executor.instances=6 \
--conf spark.executor.memory=12g \
--conf spark.driver.memory=8g \
--conf spark.yarn.executor.memoryOverhead=4g \
--conf spark.yarn.driver.memoryOverhead=4g \
--conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
--conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--conf spark.locality.wait=30000 \
--conf spark.yarn.queue=HighPool \
--conf spark.scheduler.minRegisteredResourcesRatio=1

Then run the following python commands.

from pysparkling import *
from pyspark import SparkContext
from pyspark.sql import SQLContext
import h2o
hc = H2OContext.getOrCreate(sc)

The following is the sample output.

[wzhou@enkbda1node05 sparkling-water-2.2.2]$ /opt/sparkling-water-2.2.2/bin/pysparkling \
> --master yarn \
> --conf spark.ext.h2o.cloud.name=WeidongH2O-Cluster \
> --conf spark.ext.h2o.client.port.base=26000 \
> --conf spark.executor.instances=6 \
> --conf spark.executor.memory=12g \
> --conf spark.driver.memory=8g \
> --conf spark.yarn.executor.memoryOverhead=4g \
> --conf spark.yarn.driver.memoryOverhead=4g \
> --conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
> --conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
> --conf spark.dynamicAllocation.enabled=false \
> --conf spark.sql.autoBroadcastJoinThreshold=-1 \
> --conf spark.locality.wait=30000 \
> --conf spark.yarn.queue=HighPool \
> --conf spark.scheduler.minRegisteredResourcesRatio=1
Python 2.7.13 |Anaconda 4.4.0 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/anaconda2/lib/python2.7/site-packages/pyspark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/12/10 15:00:20 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/12/10 15:00:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/12/10 15:00:21 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
17/12/10 15:00:24 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
SparkSession available as 'spark'.
>>> from pysparkling import *
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> import h2o
>>> hc = H2OContext.getOrCreate(sc)
/opt/sparkling-water-2.2.2/py/build/dist/h2o_pysparkling_2.2-2.2.2.zip/pysparkling/context.py:111: UserWarning: Method H2OContext.getOrCreate with argument of type SparkContext is deprecated and parameter of type SparkSession is preferred.
17/12/10 15:01:45 WARN h2o.H2OContext: Method H2OContext.getOrCreate with an argument of type SparkContext is deprecated and parameter of type SparkSession is preferred.
Connecting to H2O server at http://192.168.10.41:26000... successful.
--------------------------  -------------------------------
H2O cluster uptime:         12 secs
H2O cluster version:        3.14.0.7
H2O cluster version age:    1 month and 21 days
H2O cluster name:           WeidongH2O-Cluster
H2O cluster total nodes:    6
H2O cluster free memory:    63.35 Gb
H2O cluster total cores:    192
H2O cluster allowed cores:  192
H2O cluster status:         accepting new members, healthy
H2O connection url:         http://192.168.10.41:26000
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         Algos, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  -------------------------------

Sparkling Water Context:
 * H2O name: WeidongH2O-Cluster
 * cluster size: 6
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (3,enkbda1node12.enkitec.com,26000)
  (1,enkbda1node13.enkitec.com,26000)
  (6,enkbda1node10.enkitec.com,26000)
  (5,enkbda1node09.enkitec.com,26000)
  (4,enkbda1node08.enkitec.com,26000)
  (2,enkbda1node11.enkitec.com,26000)
  ------------------------

  Open H2O Flow in browser: http://192.168.10.41:26000 (CMD + click in Mac OSX)


>>>
>>> h2o
<module 'h2o' from '/opt/sparkling-water-2.2.2/py/build/dist/h2o_pysparkling_2.2-2.2.2.zip/h2o/__init__.py'>
>>> h2o.cluster_status()
[WARNING] in <stdin> line 1:
    >>> ????
        ^^^^ Deprecated, use ``h2o.cluster().show_status(True)``.
--------------------------  -------------------------------
H2O cluster uptime:         14 secs
H2O cluster version:        3.14.0.7
H2O cluster version age:    1 month and 21 days
H2O cluster name:           WeidongH2O-Cluster
H2O cluster total nodes:    6
H2O cluster free memory:    63.35 Gb
H2O cluster total cores:    192
H2O cluster allowed cores:  192
H2O cluster status:         locked, healthy
H2O connection url:         http://192.168.10.41:26000
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         Algos, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  -------------------------------
Nodes info:     Node 1                                         Node 2                                         Node 3                                         Node 4                                         Node 5                                         Node 6
--------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------
h2o             enkbda1node08.enkitec.com/192.168.10.44:26000  enkbda1node09.enkitec.com/192.168.10.45:26000  enkbda1node10.enkitec.com/192.168.10.46:26000  enkbda1node11.enkitec.com/192.168.10.47:26000  enkbda1node12.enkitec.com/192.168.10.48:26000  enkbda1node13.enkitec.com/192.168.10.49:26000
healthy         True                                           True                                           True                                           True                                           True                                           True
last_ping       1513108921427                                  1513108920821                                  1513108920602                                  1513108920876                                  1513108920718                                  1513108921149
num_cpus        32                                             32                                             32                                             32                                             32                                             32
sys_load        1.24                                           1.38                                           0.42                                           1.08                                           0.75                                           0.83
mem_value_size  0                                              0                                              0                                              0                                              0                                              0
free_mem        11339923456                                    11337239552                                    11339133952                                    11332368384                                    11331912704                                    11338491904
pojo_mem        113671168                                      116355072                                      114460672                                      121226240                                      121681920                                      115102720
swap_mem        0                                              0                                              0                                              0                                              0                                              0
free_disk       422764871680                                   389836439552                                   422211223552                                   418693251072                                   422237437952                                   422187106304
max_disk        491885953024                                   491885953024                                   491885953024                                   491885953024                                   491885953024                                   491885953024
pid             1172                                           801                                            15879                                          17866                                          28980                                          30818
num_keys        0                                              0                                              0                                              0                                              0                                              0
tcps_active     0                                              0                                              0                                              0                                              0                                              0
open_fds        441                                            441                                            441                                            441                                            441                                            441
rpcs_active     0                                              0                                              0                                              0                                              0                                              0

2. Accessing Existing H2O Cluster from a Edge Node
To access an existing H2O cluster from a edge node also requires to use pysparkling command. But you don’t have to specify a lot of parameters just like above step to start a H2O cluster. Running without any parameters is fine for the client purpose.

First start the pysparkling by running /opt/sparkling-water-2.2.2/bin/pysparkling command.

Once in the python prompt, run the following to connect to an existing cluster. The key is run h2o.init(ip=”enkbda1node05.enkitec.com”, port=26000). This command will connect to the existing H2O cluster.

from pysparkling import *
from pyspark import SparkContext
from pyspark.sql import SQLContext
import h2o

h2o.init(ip="enkbda1node05.enkitec.com", port=26000)
h2o.cluster_status()

Run some testing code there.

print("Importing hdfs data")
stock_data = h2o.import_file("hdfs://ENKBDA1-ns/user/wzhou/work/test/stock_price.txt")
stock_data

print("Spliting data")
train,test = stock_data.split_frame(ratios=[0.9])

h2o.ls()

The following is the sample output.

[wzhou@ham-lnx-vs-0086 ~]$ /opt/sparkling-water-2.2.2/bin/pysparkling
Python 2.7.13 |Anaconda 4.4.0 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/

Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
SparkSession available as 'spark'.
>>> from pysparkling import *
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> import h2o
>>>
>>> h2o.init(ip="enkbda1node05.enkitec.com", port=26000)
Warning: connecting to remote server but falling back to local... Did you mean to use `h2o.connect()`?
Checking whether there is an H2O instance running at http://enkbda1node05.enkitec.com:26000. connected.
--------------------------  ---------------------------------------
H2O cluster uptime:         5 mins 43 secs
H2O cluster version:        3.14.0.7
H2O cluster version age:    1 month and 21 days
H2O cluster name:           WeidongH2O-Cluster
H2O cluster total nodes:    6
H2O cluster free memory:    58.10 Gb
H2O cluster total cores:    192
H2O cluster allowed cores:  192
H2O cluster status:         locked, healthy
H2O connection url:         http://enkbda1node05.enkitec.com:26000
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         Algos, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  ---------------------------------------
>>> h2o.cluster_status()
[WARNING] in <stdin> line 1:
    >>> ????
        ^^^^ Deprecated, use ``h2o.cluster().show_status(True)``.
--------------------------  ---------------------------------------
H2O cluster uptime:         5 mins 43 secs
H2O cluster version:        3.14.0.7
H2O cluster version age:    1 month and 21 days
H2O cluster name:           WeidongH2O-Cluster
H2O cluster total nodes:    6
H2O cluster free memory:    58.10 Gb
H2O cluster total cores:    192
H2O cluster allowed cores:  192
H2O cluster status:         locked, healthy
H2O connection url:         http://enkbda1node05.enkitec.com:26000
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         Algos, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  ---------------------------------------
Nodes info:     Node 1                                         Node 2                                         Node 3                                         Node 4                                         Node 5                                         Node 6
--------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------
h2o             enkbda1node08.enkitec.com/192.168.10.44:26000  enkbda1node09.enkitec.com/192.168.10.45:26000  enkbda1node10.enkitec.com/192.168.10.46:26000  enkbda1node11.enkitec.com/192.168.10.47:26000  enkbda1node12.enkitec.com/192.168.10.48:26000  enkbda1node13.enkitec.com/192.168.10.49:26000
healthy         True                                           True                                           True                                           True                                           True                                           True
last_ping       1513109250525                                  1513109250725                                  1513109250400                                  1513109250218                                  1513109250709                                  1513109250536
num_cpus        32                                             32                                             32                                             32                                             32                                             32
sys_load        1.55                                           1.94                                           2.4                                            1.73                                           0.4                                            1.51
mem_value_size  0                                              0                                              0                                              0                                              0                                              0
free_mem        11339923456                                    10122031104                                    10076566528                                    10283636736                                    10280018944                                    10278896640
pojo_mem        113671168                                      1331563520                                     1377028096                                     1169957888                                     1173575680                                     1174697984
swap_mem        0                                              0                                              0                                              0                                              0                                              0
free_disk       422754385920                                   389839585280                                   422231146496                                   418692202496                                   422226952192                                   422176620544
max_disk        491885953024                                   491885953024                                   491885953024                                   491885953024                                   491885953024                                   491885953024
pid             1172                                           801                                            15879                                          17866                                          28980                                          30818
num_keys        0                                              0                                              0                                              0                                              0                                              0
tcps_active     0                                              0                                              0                                              0                                              0                                              0
open_fds        440                                            440                                            440                                            440                                            440                                            440
rpcs_active     0                                              0                                              0                                              0                                              0                                              0
>>> print("Importing hdfs data")
Importing hdfs data
>>> stock_data = h2o.import_file("hdfs://ENKBDA1-ns/user/wzhou/work/test/stock_price.txt")
Parse progress: | 100%

>>>
>>> stock_data
date                   close    volume    open    high     low
-------------------  -------  --------  ------  ------  ------
2016-09-23 00:00:00    24.05     56837   24.13  24.22   23.88
2016-09-22 00:00:00    24.1      56675   23.49  24.18   23.49
2016-09-21 00:00:00    23.38     70925   23.21  23.58   23.025
2016-09-20 00:00:00    23.07     35429   23.17  23.264  22.98
2016-09-19 00:00:00    23.12     34257   23.22  23.27   22.96
2016-09-16 00:00:00    23.16     83309   22.96  23.21   22.96
2016-09-15 00:00:00    23.01     43258   22.7   23.25   22.53
2016-09-14 00:00:00    22.69     33891   22.81  22.88   22.66
2016-09-13 00:00:00    22.81     59871   22.75  22.89   22.53
2016-09-12 00:00:00    22.85    109145   22.9   22.95   22.74

[65 rows x 6 columns]

>>> print("Spliting data")
Spliting data
>>> train,test = stock_data.split_frame(ratios=[0.9])
>>> h2o.ls()
                      key
0  py_1_sid_83d3_splitter
1         stock_price.hex

Let’s go to R Studio. We should see this H2O frame.

Let’s check it out from H2O Flow UI.

Cool, it’s there as well and we’re good here.

Advertisements

Weird Ref-count mismatch Message from H2O

H2O is a nice fast tools for data science work. I have discussed this topic in the following blogs:
H2O vs Sparkling Water
Sparking Water Shell: Cloud size under 12 Exception
Access Sparkling Water via R Studio
Running H2O Cluster in Background and at Specific Port Number
Recently we run into some weird issue when using H2O when using R Studio. For unknown reasons, it throws the following error messages:

score <- as.data.frame(main[,c('id', 'my_test_score')])

ERROR: Unexpected HTTP Status code: 500 Server Error (url = http://enkbda1node05.enkitec.com:26000/99/Rapids)

java.lang.IllegalStateException
 [1] "java.lang.IllegalStateException: Ref-count mismatch for vec $04ffc52a0000ffffffff$hdfs://ENKBDA1-ns/user/wzhou/test/data/test1/part-00000-8516f9f8-3d5f-164b-cf7e-18ca7e8d467f-d000.snappy.parquet: REFCNT = 2, should be 1"
 [2] "    water.rapids.Session.sanity_check_refs(Session.java:341)"                                                                                                                                                                                                     
 [3] "    water.rapids.Session.exec(Session.java:83)"                                                                                                                                                                                                                   
 [4] "    water.rapids.Rapids.exec(Rapids.java:93)"                                                                                                                                                                                                                     
 [5] "    water.api.RapidsHandler.exec(RapidsHandler.java:38)"                                                                                                                                                                                                          
 [6] "    sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)"                                                                                                                                                                                                 
 [7] "    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)"                                                                                                                                                                        
 [8] "    java.lang.reflect.Method.invoke(Method.java:498)"                                                                                                                                                                                                             
 [9] "    water.api.Handler.handle(Handler.java:63)"                                                                                                                                                                                                                    
[10] "    water.api.RequestServer.serve(RequestServer.java:448)"                                                                                                                                                                                                        
[11] "    water.api.RequestServer.doGeneric(RequestServer.java:297)"                                                                                                                                                                                                    
[12] "    water.api.RequestServer.doPost(RequestServer.java:223)"                                                                                                                                                                                                       
[13] "    javax.servlet.http.HttpServlet.service(HttpServlet.java:707)"                                                                                                                                                                                                 
[14] "    javax.servlet.http.HttpServlet.service(HttpServlet.java:790)"                                                                                                                                                                                                 
[15] "    ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)"                                                                                                                                                                                
[16] "    ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)"                                                                                                                                                                            
[17] "    ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)"                                                                                                                                                                    
[18] "    ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)"                                                                                                                                                                             
[19] "    ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)"                                                                                                                                                                     
[20] "    ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)"                                                                                                                                                                         
[21] "    ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"                                                                                                                                                                 
[22] "    ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"                                                                                                                                                                       
[23] "    water.JettyHTTPD$LoginHandler.handle(JettyHTTPD.java:189)"                                                                                                                                                                                                    
[24] "    ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"                                                                                                                                                                 
[25] "    ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"                                                                                                                                                                       
[26] "    ai.h2o.org.eclipse.jetty.server.Server.handle(Server.java:370)"                                                                                                                                                                                               
[27] "    ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)"                                                                                                                                                        
[28] "    ai.h2o.org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)"                                                                                                                                                         
[29] "    ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)"                                                                                                                                                              
[30] "    ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)"                                                                                                                                              
[31] "    ai.h2o.org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)"                                                                                                                                                                                      
[32] "    ai.h2o.org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)"                                                                                                                                                                                 
[33] "    ai.h2o.org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)"                                                                                                                                                                
[34] "    ai.h2o.org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)"                                                                                                                                                          
[35] "    ai.h2o.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)"                                                                                                                                                                      
[36] "    ai.h2o.org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)"                                                                                                                                                                       
[37] "    java.lang.Thread.run(Thread.java:745)"                                                                                                                                                                                                                        

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  : 
  
ERROR MESSAGE:

Ref-count mismatch for vec $04ffc52a0000ffffffff$hdfs://ENKBDA1-ns/user/wzhou/test/data/test1/part-00000-8516f9f8-3d5f-164b-cf7e-18ca7e8d467f-d000.snappy.parquet: REFCNT = 2, should be 1

Tried different ways and still got the same error. Had to bounce the H2O cluster and get rid of this error. Then further investigation found out the steps we can reproduce this issue. If we delete some or all H2O frames from H2O UI, we could run into this issue.

Basically, just run getFrames command, then select the frames want to be deleted, click Delete selected frames at the bottom. Run getFrames again. It should show those frames deleted. But if I go to R Studio, run h2o.ls(), it will show the exact Ref-count mismatch error. It seems like the frames were deleted from H2O UI perspective, but not from R Studio’s perspective.

Ok, we found out the cause. How to resolve it? Check out H2O source code at https://github.com/h2oai/h2o-3/blob/master/h2o-core/src/main/java/water/rapids/Session.java.

we could see that this error is thrown during sanity_check_refs function. The function also has the following interesting description

Check that ref counts are in a consistent state.
* This should only be called between calls to Rapids expressions (otherwise may blow false-positives).

It seems there is something not handled well in other part of the code. If inconsistent in the object references, some other part of the code might fail. It looks bugs to me.

After some investigation, found out h2o.removeAll() running from R Studio should fix this issue.

Running H2O Cluster in Background and at Specific Port Number

In a few of my previous blog, I discussed H2O vs Sparkling Water, Sparking Water Shell: Cloud size under 12 Exception, and Access Sparkling Water via R Studio.

When using H2O Sparkling Water, there are two common issues. The first one is that the default port number is 54321. However, if the H2O cluster has been bounced multiple times, the assigned port could be assigned to a different port number, actually next available port number, 54323. If the cluster is used by many data analysts, it is inconvenient to inform all of them every time the port number is changed. You want users remember only one port number.

The second issue is that the sparkling shell session can not be running in the background. If close the putty session running the sparkling shell, the H2O cluster is terminated.

This blog discusses the solution to work around the above two issues.

For port number, there are actually two parameters related: spark.ext.h2o.client.port.base and spark.ext.h2o.node.port.base. The spark.ext.h2o.client.port.base is the port number for H2O UI while spark.ext.h2o.node.port.base is the port used by H2O cluster internally for the communication among H2O nodes. Make sure these two port numbers should be different. Having these two are the same will cause issue. I also add spark.ext.h2o.cloud.name for the name of my H2O cluster.

I created two separate scripts: run_sparkling_shell.sh for the running command for sparkling shell and sparkling-shell-init.scala for starting up commands for H2O cluster in scala.

[wzhou@enkbda1node05 ~]$ cat run_sparkling_shell.sh
bin/sparkling-shell \
--master yarn \
--conf spark.ext.h2o.cloud.name=WeidongH2O-Cluster \
--conf spark.ext.h2o.client.port.base=26000 \
--conf spark.ext.h2o.node.port.base=26005 \
--conf spark.executor.instances=10 \
--conf spark.executor.memory=12g \
--conf spark.driver.memory=8g \
--conf spark.executor.cores=4 \
--conf spark.yarn.executor.memoryOverhead=4g \
--conf spark.yarn.driver.memoryOverhead=4g \
--conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
--conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--conf spark.locality.wait=30000 \
--conf spark.yarn.queue=WZTestPool \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
-i  sparkling-shell-init.scala

The code for sparkling-shell-init.scala.

import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(spark)
import h2oContext._

To execute sparkling-shell in background, my first try was to use nohup. It didn’t work. When calling sparkling-shell-init.scala script, it automatically adds :quit command at the end and terminate H2O cluster.

When I did work on Exadata, I used to use screen command a lot. It is a very useful tool for protecting long running critical job execution, like patching/upgrade and import/export. Therefore, I use the same trick in screen to help me to get around the background issue. Here are the steps.

1. Start screen session
Use screen -ls command to check whether I have screen available.

[wzhou@enkbda1node05 ~]$ screen -ls
No Sockets found in /var/run/screen/S-wzhou.

Start a screen session.
[wzhou@enkbda1node05 ~]$ screen

2. Start H2O Cluster
Run the script to start H2O cluster.

[wzhou@enkbda1node05 ~]$ ./run_sparkling_shell.sh

-----
  Spark master (MASTER)     : yarn
  Spark home   (SPARK_HOME) :
  H2O build version         : 3.14.0.7 (weierstrass)
  Spark build version       : 2.2.0
  Scala version             : 2.11
----

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/anaconda2/lib/python2.7/site-packages/pyspark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/12/09 10:12:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/12/09 10:12:58 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
17/12/09 10:12:59 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://192.168.10.14:4040
Spark context available as 'sc' (master = yarn, app id = application_2567590118914_1007).
Spark session available as 'spark'.
Loading sparkling-shell-init.scala...
import org.apache.spark.h2o._
h2oContext: org.apache.spark.h2o.H2OContext =

Sparkling Water Context:
 * H2O name: WeidongH2O-Cluster
 * cluster size: 10
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (2,enkbda1node08.enkitec.com,26005)
  (4,enkbda1node12.enkitec.com,26005)
  (9,enkbda1node17.enkitec.com,26005)
  (5,enkbda1node13.enkitec.com,26005)
  (7,enkbda1node15.enkitec.com,26005)
  (1,enkbda1node09.enkitec.com,26005)
  (8,enkbda1node04.enkitec.com,26005)
  (6,enkbda1node10.enkitec.com,26005)
  (3,enkbda1node11.enkitec.com,26005)
  (10,enkbda1node05.enkitec.com,26005)
  ------------------------

  Open H2O Flow in browser: http://192.168.10.14:26000 (CMD + click in Mac OSX)


import h2oContext._

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
scala&gt;

3. Verify H2O Cluster
Right now, close the session. Open a new session. Check out the existing screen session.

[wzhou@enkbda1node05 ~]$ screen -ls
There is a screen on:
        19044.pts-0.enkbda1node05       (Detached)
1 Socket in /var/run/screen/S-wzhou.

Now, attach to the existing session
[wzhou@enkbda1node05 ~]$ screen -x 19044

You should see the original session that was running sparking shell. The UI is still working as expected.

Access Sparkling Water via R Studio

In the last few blogs, I discussed the following topics related to Sparkling Water.
Sparking Water Shell: Cloud size under 12 Exception
H2O vs Sparkling Water
The H2O Flow UI is nice, but need to some mouse clicks to get what you want. In this blog, I am going to discuss how to access Sparkling Water via R Studio. You can access the same H2O frame from both Sparkling Water and R Studio.

Data Preparation
Create the following store transaction file, store_trans.txt

[~]# cat store_trans.txt 
3000251,20171111,1321,Austin,TX,1234,Food,Pizza,34910,8.10,Angla
3000252,20171111,1321,Austin,TX,7812,Food,Milk,16920,3.99,Rosina
3000753,20171112,2010,Houston,TX,3190,Food,Pizza,34910,8.10,Broderick
3000954,20171112,1442,Austin,TX,1234,Food,Pizza,34910,8.10,Jeanne
3008255,20171112,1651,Austin,TX,5134,Sports,Shoes,12950,45.99,Angla
3026256,20171112,1632,Austin,TX,1234,Food,Pizza,34910,8.10,Jeanne

Upload the file to HDFS.

[~]# hdfs dfs -put store_trans.txt /user/wzhou/test/data/
[~]# hdfs dfs -ls /user/wzhou/test/data
Found 2 items
-rw-r--r--   3 wzhou supergroup       2495 2017-11-14 09:58 /user/wzhou/test/data/stock.csv
-rw-r--r--   3 wzhou supergroup        400 2017-11-16 04:50 /user/wzhou/test/data/store_trans.txt

Start Sparkling Shell
Run the following code to start sparkling shell. I used only 2 instance here and you could specify more instances and it will spread into multiple instances.

kinit wzhou
bin/sparkling-shell \
–master yarn \
–conf spark.executor.instances=2 \
–conf spark.executor.memory=1g \
–conf spark.driver.memory=1g \
–conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
–conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
–conf spark.dynamicAllocation.enabled=false \
–conf spark.sql.autoBroadcastJoinThreshold=-1 \
–conf spark.locality.wait=30000 \
–conf spark.scheduler.minRegisteredResourcesRatio=1

Here is the execution result.

[root@worker--908ba3bf-d4af-4db1-bc3e-f7a9ba5372fa sparkling-water-2.2.2]# bin/sparkling-shell \
> --master yarn \
> --conf spark.executor.instances=2 \
> --conf spark.executor.memory=1g \
> --conf spark.driver.memory=1g \
> --conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
> --conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
> --conf spark.dynamicAllocation.enabled=false \
> --conf spark.sql.autoBroadcastJoinThreshold=-1 \
> --conf spark.locality.wait=30000 \
> --conf spark.scheduler.minRegisteredResourcesRatio=1

-----
  Spark master (MASTER)     : yarn
  Spark home   (SPARK_HOME) : 
  H2O build version         : 3.14.0.7 (weierstrass)
  Spark build version       : 2.2.0
  Scala version             : 2.11
----

/usr/java/jdk1.8.0_131
lib_dir=/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/bin/../lib
bin_dir=/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/bin
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.142.0.8:4040
Spark context available as 'sc' (master = yarn, app id = application_1510827150310_0003).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

Start H2O Cluster
The H2O Cluster is running inside the Spark cluster.

import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(spark)
import h2oContext._

scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val h2oContext = H2OContext.getOrCreate(spark) 
h2oContext: org.apache.spark.h2o.H2OContext =                                   

Sparkling Water Context:
 * H2O name: sparkling-water-root_application_1510827150310_0003
 * cluster size: 2
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (2,worker--c7e86f76-a1fa-49e5-9074-d7bef4dbb43e.c.cdh-director-173715.internal,54321)
  (1,worker--908ba3bf-d4af-4db1-bc3e-f7a9ba5372fa.c.cdh-director-173715.internal,54321)
  ------------------------
  Open H2O Flow in browser: http://10.142.0.8:54323 (CMD + click in Mac OSX)

scala> import h2oContext._ 
import h2oContext._

Access H2O Flow UI
Open a browser, and type in http://10.142.0.8:54323. Port 54321 is the default port number. The reason why it is 54323 is that I have run another one that is using 54321 right now.
Click importFiles

Find out the store_tran.txt file and click Import

Click Parse these files

You should see the following and then click Parse

Click View

Right now we have the nice store_trans.hex H2O Frame.

Run command getFrames and we should be able to see the store_trans.hex frame.

You can then do tons of stuff, like model building, predication and so on. I am not going to discuss more in this blog. At this moment, let me switch to R Studio.

Load Required Libraries in R Studio
Run the following:

options(rsparkling.sparklingwater.version = "2.2.2")
library(rsparkling)
library(sparklyr)
library(h2o)
library(dplyr)

Here is the result

> options(rsparkling.sparklingwater.version = "2.2.2")
> library(rsparkling)
> library(sparklyr)
> library(h2o)
----------------------------------------------------------------------
Your next step is to start H2O:
    > h2o.init()
For H2O package documentation, ask for help:
    > ??h2o
After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------
Attaching package: ‘h2o’
The following objects are masked from ‘package:stats’:
    cor, sd, var
The following objects are masked from ‘package:base’:
    &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames, colnames<-, ifelse, is.character, is.factor, is.numeric, log, log10, log1p, log2, round, signif, trunc 
> library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
    filter, lag
The following objects are masked from ‘package:base’:
    intersect, setdiff, setequal, union

Connect to H2O Cluster
Do not run h2o.init() as the messages shown above. It will start a one node H2O cluster in your local node. I want to reuse the existing H2O cluster that is across multiple nodes. Run the following:
h2o.init(ip=”10.142.0.8″, port=54323)
h2o.clusterIsUp()

You should see the connection to the existing H2O cluster is successful. Also ignore version mismatch message.

Check cluster status and the status should be health in all nodes.

Let me see whether I can see the H2O frame I created from H2O Flow UI.

Excellent, I can see it. Let me do some operation against this frame.

Ok, we can see it is quite easy to access H2O from R Studio.

Sparking Water Shell: Cloud size under 12 Exception

In my last blog, I compared Sparking Water and H2O. Before I made Sparking-shell work, I run into a lot of issues. One of annoying errors was runtime exception: Cloud size under xx. Searched internet and found many people have the similar problems. There are many recommendations, ranging from downloading the latest and matching version, to set to certain parameters during startup. Unfortunately none of them were working for me. But finally I figured out the issue and would like to share my solution in this blog.

After I downloaded Sparking Water, unzipped the file, and run sparking-shell command as shown from http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.2/2/index.html. It looked good initially.

[sparkling-water-2.2.2]$ bin/sparkling-shell --conf "spark.executor.memory=1g"

-----
  Spark master (MASTER)     : local[*]
  Spark home   (SPARK_HOME) : /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2
  H2O build version         : 3.14.0.7 (weierstrass)
  Spark build version       : 2.2.0
  Scala version             : 2.11
----

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.132.110.145:4040
Spark context available as 'sc' (master = yarn, app id = application_3608740912046_1343).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

But when I run command val h2oContext = H2OContext.getOrCreate(spark), it gave me many errors as follows:

scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val h2oContext = H2OContext.getOrCreate(spark)
17/11/05 10:07:48 WARN internal.InternalH2OBackend: Increasing 'spark.locality.wait' to value 30000
17/11/05 10:07:48 WARN internal.InternalH2OBackend: Due to non-deterministic behavior of Spark broadcast-based joins
We recommend to disable them by configuring `spark.sql.autoBroadcastJoinThreshold` variable to value `-1`:
sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
17/11/05 10:07:48 WARN internal.InternalH2OBackend: The property 'spark.scheduler.minRegisteredResourcesRatio' is not specified!
We recommend to pass `--conf spark.scheduler.minRegisteredResourcesRatio=1`
17/11/05 10:07:48 WARN internal.InternalH2OBackend: Unsupported options spark.dynamicAllocation.enabled detected!
17/11/05 10:07:48 WARN internal.InternalH2OBackend:
The application is going down, since the parameter (spark.ext.h2o.fail.on.unsupported.spark.param,true) is true!
If you would like to skip the fail call, please, specify the value of the parameter to false.

java.lang.IllegalArgumentException: Unsupported argument: (spark.dynamicAllocation.enabled,true)
  at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$checkUnsupportedSparkOptions$1.apply(InternalBackendUtils.scala:46)
  at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$checkUnsupportedSparkOptions$1.apply(InternalBackendUtils.scala:38)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.h2o.backends.internal.InternalBackendUtils$class.checkUnsupportedSparkOptions(InternalBackendUtils.scala:38)
  at org.apache.spark.h2o.backends.internal.InternalH2OBackend.checkUnsupportedSparkOptions(InternalH2OBackend.scala:30)
  at org.apache.spark.h2o.backends.internal.InternalH2OBackend.checkAndUpdateConf(InternalH2OBackend.scala:60)
  at org.apache.spark.h2o.H2OContext.<init>(H2OContext.scala:90)
  at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:355)
  at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:383)
  ... 50 elided

You can see I need to pass in more parameters when starting sparking-shell. Change the parameters as follows:

bin/sparkling-shell \
--master yarn \
--conf spark.executor.memory=1g \
--conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
--conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--conf spark.locality.wait=30000 \
--conf spark.scheduler.minRegisteredResourcesRatio=1

Ok, this time it looked better, at least warning messages disappeared. But got error message java.lang.RuntimeException: Cloud size under 2.

[sparkling-water-2.2.2]$ bin/sparkling-shell \
> --master yarn \
> --conf spark.executor.memory=1g \
> --conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
> --conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
> --conf spark.dynamicAllocation.enabled=false \
> --conf spark.sql.autoBroadcastJoinThreshold=-1 \
> --conf spark.locality.wait=30000 \
> --conf spark.scheduler.minRegisteredResourcesRatio=1

-----
  Spark master (MASTER)     : yarn
  Spark home   (SPARK_HOME) : /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2
  H2O build version         : 3.14.0.7 (weierstrass)
  Spark build version       : 2.2.0
  Scala version             : 2.11
----

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.132.110.145:4040
Spark context available as 'sc' (master = yarn, app id = application_3608740912046_1344).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val h2oContext = H2OContext.getOrCreate(spark)
java.lang.RuntimeException: Cloud size under 2
  at water.H2O.waitForCloudSize(H2O.java:1689)
  at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:117)
  at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:121)
  at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:355)
  at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:383)
  ... 50 elided

Not big deal. Anyway I didn’t specify the number of executors used and executor memory. It may complain about the size of the H2O cluster is too small. Add the following three parameters.

–conf spark.executor.instances=12 \
–conf spark.executor.memory=10g \
–conf spark.driver.memory=8g \

Rerun the whole thing got the same error with the size number changing to 12. It did not look right to me. Then I check out the H2O error logfile and found tons of messages as follows:

11-06 10:23:16.355 10.132.110.145:54325  30452  #09:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
11-06 10:23:16.790 10.132.110.145:54325  30452  #06:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused

It looks like Sparking Water can not connect to Spark cluster. After some investigation, I then realized I installed and run sparking-shell from edge node on the BDA. If the H2O cluster was running inside a Spark application, the communication of Spark cluster on BDA is through BDA’s private network, or InfiniteBand network. Edge node can not directly communicate to IB network on BDA. With this assumption in mind, I installed and run Sparking Water on one of BDA nodes, it worked perfectly without any issue. Problem solved!