Use Jupyter Notebook to Access H2O Driverless AI

I discussed H2O Driverless AI installation in my last blog, Install H2O Driverless AI on Google Cloud Platform. H2O AI docker image contains the deployment of Jupyter Notebook. Once H2O AI starts, we can use Jupyter notebook directly. In this blog, I am going to discuss how to use Jupyter Notebook to connect to H2O AI.

To login Jupyter Notebook, I need to know the login token. It is usually shown in the console output at the ‎time starting Jupyter. However If I check out the Docker logs command, it shows the output from H2O AI.

root@h2otest:~# docker ps
CONTAINER ID        IMAGE                    COMMAND             CREATED             STATUS              PORTS                                                                                                NAMES
5b803337e8b5        opsh2oai/h2oai-runtime   "./run.sh"          About an hour ago   Up About an hour    0.0.0.0:8888->8888/tcp, 0.0.0.0:9090->9090/tcp, 0.0.0.0:12345->12345/tcp, 0.0.0.0:54321->54321/tcp   h2oai

root@h2otest:~# docker logs h2oai
---------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------
     version: 1.0.30

- Put data in the volume mounted at /data
- Logs are written to the volume mounted at /log/20180424-140930
- Connect to Driverless AI on port 12345 inside the container
- Connect to Jupyter notebook on port 8888 inside the container

But the output at least tells me the logfile location. SSH to the container and check out Jupyter log.

root@h2otest:~# ./ssh_h2oai.sh 
root@5b803337e8b5:/# cd /log/20180424-140930
root@5b803337e8b5:/log/20180424-140930# ls -l
total 84
-rw-r--r-- 1 root root 61190 Apr 24 14:53 h2oai.log
-rw-r--r-- 1 root root 14340 Apr 24 15:14 h2o.log
-rw-r--r-- 1 root root  2700 Apr 24 14:58 jupyter.log
-rw-r--r-- 1 root root    52 Apr 24 14:09 procsy.log
root@5b803337e8b5:/log/20180424-140930# cat jupyter.log
config:
    /jupyter/.jupyter
    /h2oai_env/etc/jupyter
    /usr/local/etc/jupyter
    /etc/jupyter
data:
    /jupyter/.local/share/jupyter
    /h2oai_env/share/jupyter
    /usr/local/share/jupyter
    /usr/share/jupyter
runtime:
    /jupyter/.local/share/jupyter/runtime
[I 14:10:01.512 NotebookApp] Writing notebook server cookie secret to /jupyter/.local/share/jupyter/runtime/notebook_cookie_secret
[W 14:10:04.062 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[I 14:10:04.224 NotebookApp] Serving notebooks from local directory: /jupyter
[I 14:10:04.224 NotebookApp] 0 active kernels
[I 14:10:04.224 NotebookApp] The Jupyter Notebook is running at:
[I 14:10:04.224 NotebookApp] http://[all ip addresses on your system]:8888/?token=f1b8f6dc7fb0aab7caec278a2bf971249b765140e4b3b338
[I 14:10:04.224 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 14:10:04.224 NotebookApp] 
    
    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=f1b8f6dc7fb0aab7caec278a2bf971249b765140e4b3b338
[W 14:19:26.189 NotebookApp] 401 POST /login?next=%2Ftree%3F (10.142.0.2) 834.30ms referer=http://10.142.0.2:8888/login?next=%2Ftree%3F
[I 14:20:15.706 NotebookApp] 302 POST /login?next=%2Ftree%3F (10.142.0.2) 1.36ms

Although this approach worked majority of time, I did run into issue for a few times that Jupyter login said the token is invalid. After some research, I found out another way that guarantees to get the correct token. It’s a json file under /jupyter/.local/share/jupyter/runtime directory. The filename nbserver-xx.json changes each time H2O AI starts.

root@5b803337e8b5:/# ls -l /jupyter/.local/share/jupyter/runtime
total 12
-rw-r--r-T 1 root root  263 Apr 24 14:24 kernel-b225302b-f2d9-47ac-b99c-f1f55eb54021.json
-rw-r--r-- 1 root root  245 Apr 24 14:10 nbserver-51.json
-rw------- 1 root root 1386 Apr 24 14:10 notebook_cookie_secret
root@5b803337e8b5:/# cat /jupyter/.local/share/jupyter/runtime/nbserver-51.json
{
  "base_url": "/",
  "hostname": "localhost",
  "notebook_dir": "/jupyter",
  "password": false,
  "pid": 51,
  "port": 8888,
  "secure": false,
  "token": "f1b8f6dc7fb0aab7caec278a2bf971249b765140e4b3b338",
  "url": "http://localhost:8888/"

Based on that, I created a script to get the token without ssh to the container.

root@h2otest:~# cat get_jy_token.sh 
#!/bin/bash

JSON_FILENAME=`docker exec -it h2oai ls -l /jupyter/.local/share/jupyter/runtime | grep nbserver |awk '{print $9}' | tr -d "\r"`
#echo $JSON_FILENAME
docker exec -it h2oai grep token /jupyter/.local/share/jupyter/runtime/$JSON_FILENAME

Run the script and got the token.

root@h2otest:~# ./get_jy_token.sh 
  "token": "f1b8f6dc7fb0aab7caec278a2bf971249b765140e4b3b338",

Ok, let me go to the login screen and input the token.

The Jupyter screen shows up.

There is two sample notebooks installed by default. I tried to make it working. However the sample data in docker image does not seem working. There is no detail API document available at this moment. So I just did a few basic stuff to prove it work. The following is the code I input in the notebook.

import h2oai_client
import numpy as np
import pandas as pd
# import h2o
import requests
import math
from h2oai_client import Client, ModelParameters, InterpretParameters

address = 'http://35.229.57.147:12345'
username = 'h2o'
password = 'h2o'
h2oai = Client(address = address, username = username, password = password)

stock_path = '/data/stock_price.csv'
stockData = h2oai.create_dataset_sync(stock_path)
stockData.dump()


I went back to H2O AI UI and found out three more stock_price dataset were created by my Jupyter notebook.

So each time I run the command h2oai.create_dataset_sync(stock_path), it creates a new dataset. The dataset with same path is not going to eliminated. To avoid duplication, I have to manually delete the duplicated one from UI. It’s not a big deal. Just need to remember to cleanup the duplicated dataset if run the same notebook multiple times. Another way to get around this issue is to use different login name. As different login name sees the datasets only belong to the current user, you could have a login name for production use and a different login name for development or testing. You can safely remove the duplicated dataset in the development username without worrying about removing the wrong one.

Install H2O Driverless AI on Google Cloud Platform

I wrote many blogs about H2O and H2O Sparkling Water in the past. Today I am going to discuss the installation of H2O Driverless AI (H2O AI). H2O AI targets machine learning, especially deep learning. While H2O focuses more on algorithm, models, and predication, H2O AI automates some of the most difficult data science and ML workflows to offer automatic visualizations and Machine Learning Interpretability (MLI). Here is the architecture of H2O AI.

There are some difference in different installation environment. To check out different environment, use H2O Driverless AI installation document at http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/installing.html.

This blog discusses the topic only related to Google Cloud. Here are a few important things to know before the installation.
1. It requires a lot of memory and CPUs, if possible use GPU. I uses 8 CPUs and 52 GB memory on Google cloud. If you can use GPU, add GPU option. For me, I don’t have the access to GPU in my account.
2. The OS is based on Ubuntu 16.04 and I believe it is the minimum version supported.
3. OS disk size should be >= 64GB. I used 64GB.
4. Instead of installation software package, H2O AI uses Docker image. Yes, Docker needs to be installed first.
5. If plan to use python to connect the H2O AI, the supported version of python is v3.6.

Ok, here is the installation procedure on GCP:
1. Create a new firewall rule
Click VPC Network -> Firewall Rules -> Create Firewall Rule
Input the following:
Name : h2oai
Description: The firewall rule for H2O driverless AI
Target tags: h2o
Source IP ranges: 0.0.0.0/0
Protocols and ports: tcp:12345,54321
Please note: H2O’s documentation misses the port 54321, which is used by H2O Flow UI. Needs to open this port. Otherwise you can not access H2O Flow UI.


2. Create a new VM instance
Name: h2otest
Zone: us-east1-c
Cores: 8 vCPU
Memory: 52 GB
Boot disk: 64 GB, Ubuntu 16.04
Service account: use your GCP service account
Network tags: h2o

3. Install and configure Docker
Logon to h2otest VM instance and su to root user.
Create a script, build.sh

apt-get -y update
apt-get -y --no-install-recommends install \
  curl \
  apt-utils \
  python-software-properties \
  software-properties-common

add-apt-repository -y "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -

apt-get update
apt-get install -y docker-ce

Run the script

root@h2otest:~# chmod u+x build.sh
root@h2otest:~# ./build.sh

Created required directories.

mkdir ~/tmp
mkdir ~/log
mkdir ~/data
mkdir ~/scripts
mkdir ~/license
mkdir ~/demo
mkdir -p ~/jupyter/notebooks

Adding current user to Docker container is optional. I did anyway.

root@h2otest:~# usermod -aG docker weidong.zhou
root@h2otest:~# id weidong.zhou
uid=1001(weidong.zhou) gid=1002(weidong.zhou) groups=1002(weidong.zhou),4(adm),20(dialout),24(cdrom),25(floppy),29(audio),30(dip),44(video),46(plugdev),109(netdev),110(lxd),1000(ubuntu),1001(google-sudoers),999(docker)

4. Download and Load H2O AI Docker Image
Download the docker image.

root@h2otest:~# wget https://s3-us-west-2.amazonaws.com/h2o-internal-release/docker/driverless-ai-docker-runtime-latest-release.gz
--2018-04-18 16:43:31--  https://s3-us-west-2.amazonaws.com/h2o-internal-release/docker/driverless-ai-docker-runtime-latest-release.gz
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.209.8
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.209.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2167098485 (2.0G) [application/gzip]
Saving to: ‘driverless-ai-docker-runtime-latest-release.gz’

driverless-ai-docker-runtime-latest-release.g 100%[==============================================================================================>]   2.02G  26.2MB/s    in 94s     

2018-04-18 16:45:05 (22.0 MB/s) - ‘driverless-ai-docker-runtime-latest-release.gz’ saved [2167098485/2167098485]

Load Docker image.

root@h2otest:~# docker load < driverless-ai-docker-runtime-latest-release.gz 9d3227c1793b: Loading layer [==================================================>]  121.3MB/121.3MB
a1a54d352248: Loading layer [==================================================>]  15.87kB/15.87kB
. . . .
ed86b627a562: Loading layer [==================================================>]  1.536kB/1.536kB
7d38d6d61cec: Loading layer [==================================================>]  1.536kB/1.536kB
de539994349c: Loading layer [==================================================>]  3.584kB/3.584kB
8e992954a9eb: Loading layer [==================================================>]  3.584kB/3.584kB
ff71b3e896ef: Loading layer [==================================================>]  8.192kB/8.192kB
Loaded image: opsh2oai/h2oai-runtime:latest
root@h2otest:~# docker image ls
REPOSITORY               TAG                 IMAGE ID            CREATED             SIZE
opsh2oai/h2oai-runtime   latest              dff251c69407        12 days ago         5.46GB

5. Start H2O AI
Create a startup script, start_h2oai.sh. Please note: H2O document has an error, missing port 54321 for H2O Flow UI. Then run the script

root@h2otest:~# cat start_h2oai.sh 
#!/bin/bash

docker run \
    --rm \
    -u `id -u`:`id -g` \
    -p 12345:12345 \
    -p 54321:54321 \
    -p 8888:8888 \
    -p 9090:9090 \
    -v `pwd`/data:/data \
    -v `pwd`/log:/log \
    -v `pwd`/license:/license \
    -v `pwd`/tmp:/tmp \
    opsh2oai/h2oai-runtime

root@h2otest:~# chmod a+x start_h2oai.sh
root@h2otest:~# ./start_h2oai.sh 
---------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------
     version: 1.0.30

- Put data in the volume mounted at /data
- Logs are written to the volume mounted at /log/20180419-094058
- Connect to Driverless AI on port 12345 inside the container
- Connect to Jupyter notebook on port 8888 inside the container

Also create a script, ssh_h2oai.sh to quickly ssh to the H2O AI container without knowing the container id first.

root@h2otest:~# vi ssh_h2oai.sh
root@h2otest:~# cat ssh_h2oai.sh 
#!/bin/bash

CONTAINER_ID=`docker ps|grep h2oai-runtime|awk '{print $1}'`
docker exec -it $CONTAINER_ID bash
root@h2otest:~# chmod a+x ssh_h2oai.sh 
root@h2otest:~# ./ssh_h2oai.sh
root@09bd138f4f41:/#

6. Use H2O AI
Get the external IP of the H2O VM. In my case, it is 35.196.90.114. Then access URL at http://35.196.90.114:12345/. You will see H2O AI evaluation agreement screen. Click I Agree to these Terms to continue.
The Logon screen shows up. I use the following information to sign in.
Username: h2o
Password: h2o
Actually it doesn’t matter what you input. You can use any username to login. It just didn’t check. I know it has the feature to integrate with LDAP. I just didn’t give a try this time.

After sign in, it will ask you to input license information. Fill out your information at https://www.h2o.ai/try-driverless-ai/ and you will receive a 21-day trail license in the email.

The first screen shows up is the Datasets overview. You can add dataset from one of three sources: File System, Hadoop File System, Amazon S3. To use some sample data, I chose Amazon S3‘s AirlinesTest.csv.zip file.



For every dataset, there are two kinds of Actions: Visualize or Predict
Click Visualize. Many interesting visualization charts show up.



If click Predict, Experiment screen shows up. Choose a Target Column. In my example, I chose ArrTime column. Click Launch Experiment

Once finished, it will show a list of options. For example, I clicked Interpret this model on original features


For people familiar with H2O Flow UI, H2O AI still has this UI, just click H2O-3 from the menu. The H2O Flow UI will show up.

In general, H2O AI has an impressive UI and tons of new stuff. No wonder it is not a free version. In the next blog, I am going to discuss how to configure python client to access H2O AI.

Parquet File Can not Be Read in Sparkling Water H2O

For the past few months, I wrote several blogs related to H2O topic:
Use Python for H2O
H2O vs Sparkling Water
Sparking Water Shell: Cloud size under 12 Exception
Access Sparkling Water via R Studio
Running H2O Cluster in Background and at Specific Port Number
Weird Ref-count mismatch Message from H2O

Sparkling Water and H2O are very good in terms of performance for data science projects. But if something works beautifully in one environment, not working in another environment could be the most annoyed thing in the world. This is exactly what had happened at one of my clients.

They used to use Sparkling Water and H2O in Oracle BDA environment and worked great. However, even with a full rack BDA (18 nodes), it is still not enough to run big dataset on H2O. Recently they moved to a much bigger CDH cluster (non-BDA environment) with CDH 5.13 installed. Sparkling Water is still working, however there was one major issue: parquet files can not be read correctly. There are no issue in reading the same parquet files from Spark shell and pyspark. This is really an annoying issue as parquet format is one of data formats that are heavily used by the client. After many failed tries and investigation, I was finally able to figure out the issue and implement a workaround solution. This blog discussed this parquet reading issue and workaround solution in Sparkling Water.

Create Test Data Set
I did the test in my CDH cluster (CDH 5.13). I first created a small test data set, stock.csv, and uploaded to /user/root directory on HDFS.

date,close,volume,open,high,low
9/23/16,24.05,56837,24.13,24.22,23.88
9/22/16,24.1,56675,23.49,24.18,23.49
9/21/16,23.38,70925,23.21,23.58,23.025
9/20/16,23.07,35429,23.17,23.264,22.98
9/19/16,23.12,34257,23.22,23.27,22.96
9/16/16,23.16,83309,22.96,23.21,22.96
9/15/16,23.01,43258,22.7,23.25,22.53
9/14/16,22.69,33891,22.81,22.88,22.66
9/13/16,22.81,59871,22.75,22.89,22.53
9/12/16,22.85,109145,22.9,22.95,22.74
9/9/16,23.03,115901,23.53,23.53,23.02
9/8/16,23.6,32717,23.8,23.83,23.55
9/7/16,23.85,143635,23.69,23.89,23.69
9/6/16,23.68,43577,23.78,23.79,23.43
9/2/16,23.84,31333,23.45,23.93,23.41
9/1/16,23.42,49547,23.45,23.48,23.26

Create a Parquet File
Run the following in spark2-shell to create a parquet file and make sure that I can read it back.

scala> val myTest=spark.read.format("csv").load("/user/root/stock.csv")
myTest: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 4 more fields]
 
scala> myTest.show(3)
+-------+-----+------+-----+-----+-----+
|    _c0|  _c1|   _c2|  _c3|  _c4|  _c5|
+-------+-----+------+-----+-----+-----+
|   date|close|volume| open| high|  low|
|9/23/16|24.05| 56837|24.13|24.22|23.88|
|9/22/16| 24.1| 56675|23.49|24.18|23.49|
+-------+-----+------+-----+-----+-----+
only showing top 3 rows
 
scala> myTest.write.format("parquet").save("/user/root/mytest.parquet")
 
scala> val readTest = spark.read.format("parquet").load("/user/root/mytest.parquet")
readTest: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 4 more fields]
 
scala> readTest.show(3)
+-------+-----+------+-----+-----+-----+
|    _c0|  _c1|   _c2|  _c3|  _c4|  _c5|
+-------+-----+------+-----+-----+-----+
|   date|close|volume| open| high|  low|
|9/23/16|24.05| 56837|24.13|24.22|23.88|
|9/22/16| 24.1| 56675|23.49|24.18|23.49|
+-------+-----+------+-----+-----+-----+
only showing top 3 rows

Start a Sparkling Water H2O Cluster
I started a Sparking Water Cluster with 2 nodes.

[root@a84-master--2df67700-f9d1-46f3-afcf-ba27a523e143 sparkling-water-2.2.7]# . /etc/spark2/conf.cloudera.spark2_on_yarn/spark-env.sh
/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2
[root@a84-master--2df67700-f9d1-46f3-afcf-ba27a523e143 sparkling-water-2.2.7]# bin/sparkling-shell \
> --master yarn \
> --conf spark.executor.instances=2 \
> --conf spark.executor.memory=1g \
> --conf spark.driver.memory=1g \
> --conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
> --conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
> --conf spark.dynamicAllocation.enabled=false \
> --conf spark.sql.autoBroadcastJoinThreshold=-1 \
> --conf spark.locality.wait=30000 \
> --conf spark.scheduler.minRegisteredResourcesRatio=1

-----
  Spark master (MASTER)     : yarn
  Spark home   (SPARK_HOME) : /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2
  H2O build version         : 3.16.0.4 (wheeler)
  Spark build version       : 2.2.1
  Scala version             : 2.11
----

/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.128.0.5:4040
Spark context available as 'sc' (master = yarn, app id = application_1518097883047_0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val h2oContext = H2OContext.getOrCreate(spark) 
h2oContext: org.apache.spark.h2o.H2OContext =                                   

Sparkling Water Context:
 * H2O name: sparkling-water-root_application_1518097883047_0001
 * cluster size: 2
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (1,a84-worker--6e693a0a-d92c-4172-81b7-f2f07b6d5d7c.c.cdh-director-194318.internal,54321)
  (2,a84-worker--c0ecccae-cead-44f2-9f75-39aadb1d024a.c.cdh-director-194318.internal,54321)
  ------------------------

  Open H2O Flow in browser: http://10.128.0.5:54321 (CMD + click in Mac OSX)

scala> import h2oContext._
import h2oContext._
scala> 

Read Parquet File from H2O Flow UI
Open H2O Flow UI and read the same parquet file.

After click Parse these files, got corrupted file.

Obviously, parquet file was not read correctly. At this moment, there are no error messages in the H2O console. If continue to import the file, the H2O Flow UI throw the following error

The H2O console would show the following error:

scala> 18/02/08 09:23:59 WARN servlet.ServletHandler: Error for /3/Parse
java.lang.NoClassDefFoundError: org/apache/parquet/hadoop/api/ReadSupport
	at water.parser.parquet.ParquetParser.correctTypeConversions(ParquetParser.java:104)
	at water.parser.parquet.ParquetParserProvider.createParserSetup(ParquetParserProvider.java:48)
	at water.parser.ParseSetup.getFinalSetup(ParseSetup.java:213)
	at water.parser.ParseDataset.forkParseDataset(ParseDataset.java:119)
	at water.parser.ParseDataset.parse(ParseDataset.java:43)
	at water.api.ParseHandler.parse(ParseHandler.java:36)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at water.api.Handler.handle(Handler.java:63)
	at water.api.RequestServer.serve(RequestServer.java:451)
	at water.api.RequestServer.doGeneric(RequestServer.java:296)
	at water.api.RequestServer.doPost(RequestServer.java:222)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
	at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)
	at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
	at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
	at water.JettyHTTPD$LoginHandler.handle(JettyHTTPD.java:192)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
	at ai.h2o.org.eclipse.jetty.server.Server.handle(Server.java:370)
	at ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
	at ai.h2o.org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
	at ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
	at ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
	at ai.h2o.org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
	at ai.h2o.org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
	at ai.h2o.org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
	at ai.h2o.org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
	at ai.h2o.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
	at ai.h2o.org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.parquet.hadoop.api.ReadSupport
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 39 more

Solution
As BDA has no issue in the same Sparkling Water H2O deployment and BDA used CDH 5.10, I initially focused more on CDH version difference. I built three CDH clusters using three different CDH versions: 5.13, 5.12 and 5.10. All of them show the exact same error. This made me rule out the possibility from CDH version difference and shifted focus on the environment difference, especially class path and jar files. Tried setting JAVA_HOME, SPARK_HOME, SPARK_DIST_CLASSPATH and unfortunately none of them worked.

I noticed /etc/spark2/conf.cloudera.spark2_on_yarn/classpath.txt seem have much less entries than classpath.txt under spark 1.6. Tried adding back the missing entries. Still no luck.

Added two more parameters to get more information about H2O log.

--conf spark.ext.h2o.node.log.level=INFO \
--conf spark.ext.h2o.client.log.level=INFO \

It gave a little more useful information. It complained about class ParquetFileWriter not found.

$ cat h2o_10.54.225.9_54000-5-error.log
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR: DistributedException from /192.168.10.54:54005: 'org/apache/parquet/hadoop/ParquetFileWriter', caused by java.lang.NoClassDefFoundError: org/apache/parquet/hadoop/ParquetFileWriter
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.MRTask.getResult(MRTask.java:478)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.MRTask.getResult(MRTask.java:486)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.MRTask.doAll(MRTask.java:402)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.parser.ParseSetup.guessSetup(ParseSetup.java:283)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.api.ParseSetupHandler.guessSetup(ParseSetupHandler.java:40)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

My client found a temporary solution by using h2odriver.jar following the instruction from Using H2O on Hadoop. The command used is shown below:

cd /opt/h2o-3
hadoop jar h2odriver.jar -nodes 70 -mapperXmx 40g -output hdfs://PROD01ns/user/mytestuser/log24

Although this solution provides similar functionalities in Sparkling Water, it has some critical performance issues:
1. The above command would create 70 nodes H2O cluster. If using Sparkling Water, it would be evenly distribute to all available nodes. But the above h2odriver.jarapproach would heavily use a few hadoop nodes. For big dataset, majority of activities happened only to 3~4 nodes, which made those nodes’ cpu utilization close to 100%. For one test big dataset, it has never completed the parsing file. It failed after 20 minutes run.
2. Unlike Sparkling Water, it actually read files during the parsing phase, not in the importing phase.
3. The performance is pretty bad compared with Sparkling Water. I guess Sparkling Water is using underlined Spark to distribute the load evenly.

Anyway this hadoop jar h2odriver.jar solution is not an ideal workaround for this issue.

Then I happened to read this blog: Incorrect results with corrupt binary statistics with the new Parquet reader. This article has nothing to do my issue, but it mentioned about parquet v1.8. I did remember seeing one note from one Sparkling Water developer discussing should integrate with parquet v1.8 in the future for certain parquet issue in H2O. Unfortunately I could not find the link to this discussion any more. But it inspired me to think that maybe the issue is that Sparkling Water depends certain parquet library and the current environment don’t have it. The standard CDH distribution and Spark2 seem using parquet v1.5. Oracle BDA has many more software installed and maybe it happened to have the correct library installed somewhere. It seems H2O related jar file may contain this library, what’s happened if I include the H2O jar somewhere in Sparkling Water.

With this idea in mind, I download H2O from http://h2o-release.s3.amazonaws.com/h2o/rel-wheeler/4/index.html. Unzip the file and h2o.jar file is the one I need. I then modified sparkling-shell and change the last line of code as follows by add h2o.jar file to jars parameter.

#spark-shell --jars "$FAT_JAR_FILE" --driver-memory "$DRIVER_MEMORY" --conf spark.driver.extraJavaOptions="$EXTRA_DRIVER_PROPS" "$@"
H2O_JAR_FILE="/home/mytestuser/install/h2o-3.16.0.4/h2o.jar"
spark-shell --jars "$FAT_JAR_FILE,$H2O_JAR_FILE" --driver-memory "$DRIVER_MEMORY" --conf spark.driver.extraJavaOptions="$EXTRA_DRIVER_PROPS" "$@"

Restart my H2O cluster. It worked!

Finally after many days work, Sparkling Water can work again in the new cluster. Reloading the big testing dataset, it took less than 1 minute to load the same dataset with only 24 H2O nodes. The load was also evenly distributed to the cluster. Problem solved!

Use Python for H2O

I wrote serveral blogs related to H2O topic as follows:
H2O vs Sparkling Water
Sparking Water Shell: Cloud size under 12 Exception
Access Sparkling Water via R Studio
Running H2O Cluster in Background and at Specific Port Number
Weird Ref-count mismatch Message from H2O

While both H2O Flow UI and R Studio can access H2O cluster started by Sparkling Water Shell, both lack the powerful functionality to manipulate data like Python does. In this blog, I am going to discuss how to use Python for H2O.

There are two main topics related to python for H2O. The first is to start a H2O cluster and another one is to access an existing H2O cluster using python.

1. Start H2O Cluster using pysparkling
I discussed start a H2O cluster using sparkling-shell in blog Sparking Water Shell: Cloud size under 12 Exception. For python, need to use pysparkling.

/opt/sparkling-water-2.2.2/bin/pysparkling \
--master yarn \
--conf spark.ext.h2o.cloud.name=WeidongH2O-Cluster \
--conf spark.ext.h2o.client.port.base=26000 \
--conf spark.executor.instances=6 \
--conf spark.executor.memory=12g \
--conf spark.driver.memory=8g \
--conf spark.yarn.executor.memoryOverhead=4g \
--conf spark.yarn.driver.memoryOverhead=4g \
--conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
--conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--conf spark.locality.wait=30000 \
--conf spark.yarn.queue=HighPool \
--conf spark.scheduler.minRegisteredResourcesRatio=1

Then run the following python commands.

from pysparkling import *
from pyspark import SparkContext
from pyspark.sql import SQLContext
import h2o
hc = H2OContext.getOrCreate(sc)

The following is the sample output.

[wzhou@enkbda1node05 sparkling-water-2.2.2]$ /opt/sparkling-water-2.2.2/bin/pysparkling \
> --master yarn \
> --conf spark.ext.h2o.cloud.name=WeidongH2O-Cluster \
> --conf spark.ext.h2o.client.port.base=26000 \
> --conf spark.executor.instances=6 \
> --conf spark.executor.memory=12g \
> --conf spark.driver.memory=8g \
> --conf spark.yarn.executor.memoryOverhead=4g \
> --conf spark.yarn.driver.memoryOverhead=4g \
> --conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
> --conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
> --conf spark.dynamicAllocation.enabled=false \
> --conf spark.sql.autoBroadcastJoinThreshold=-1 \
> --conf spark.locality.wait=30000 \
> --conf spark.yarn.queue=HighPool \
> --conf spark.scheduler.minRegisteredResourcesRatio=1
Python 2.7.13 |Anaconda 4.4.0 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/anaconda2/lib/python2.7/site-packages/pyspark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/12/10 15:00:20 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/12/10 15:00:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/12/10 15:00:21 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
17/12/10 15:00:24 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
SparkSession available as 'spark'.
>>> from pysparkling import *
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> import h2o
>>> hc = H2OContext.getOrCreate(sc)
/opt/sparkling-water-2.2.2/py/build/dist/h2o_pysparkling_2.2-2.2.2.zip/pysparkling/context.py:111: UserWarning: Method H2OContext.getOrCreate with argument of type SparkContext is deprecated and parameter of type SparkSession is preferred.
17/12/10 15:01:45 WARN h2o.H2OContext: Method H2OContext.getOrCreate with an argument of type SparkContext is deprecated and parameter of type SparkSession is preferred.
Connecting to H2O server at http://192.168.10.41:26000... successful.
--------------------------  -------------------------------
H2O cluster uptime:         12 secs
H2O cluster version:        3.14.0.7
H2O cluster version age:    1 month and 21 days
H2O cluster name:           WeidongH2O-Cluster
H2O cluster total nodes:    6
H2O cluster free memory:    63.35 Gb
H2O cluster total cores:    192
H2O cluster allowed cores:  192
H2O cluster status:         accepting new members, healthy
H2O connection url:         http://192.168.10.41:26000
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         Algos, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  -------------------------------

Sparkling Water Context:
 * H2O name: WeidongH2O-Cluster
 * cluster size: 6
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (3,enkbda1node12.enkitec.com,26000)
  (1,enkbda1node13.enkitec.com,26000)
  (6,enkbda1node10.enkitec.com,26000)
  (5,enkbda1node09.enkitec.com,26000)
  (4,enkbda1node08.enkitec.com,26000)
  (2,enkbda1node11.enkitec.com,26000)
  ------------------------

  Open H2O Flow in browser: http://192.168.10.41:26000 (CMD + click in Mac OSX)


>>>
>>> h2o
<module 'h2o' from '/opt/sparkling-water-2.2.2/py/build/dist/h2o_pysparkling_2.2-2.2.2.zip/h2o/__init__.py'>
>>> h2o.cluster_status()
[WARNING] in <stdin> line 1:
    >>> ????
        ^^^^ Deprecated, use ``h2o.cluster().show_status(True)``.
--------------------------  -------------------------------
H2O cluster uptime:         14 secs
H2O cluster version:        3.14.0.7
H2O cluster version age:    1 month and 21 days
H2O cluster name:           WeidongH2O-Cluster
H2O cluster total nodes:    6
H2O cluster free memory:    63.35 Gb
H2O cluster total cores:    192
H2O cluster allowed cores:  192
H2O cluster status:         locked, healthy
H2O connection url:         http://192.168.10.41:26000
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         Algos, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  -------------------------------
Nodes info:     Node 1                                         Node 2                                         Node 3                                         Node 4                                         Node 5                                         Node 6
--------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------
h2o             enkbda1node08.enkitec.com/192.168.10.44:26000  enkbda1node09.enkitec.com/192.168.10.45:26000  enkbda1node10.enkitec.com/192.168.10.46:26000  enkbda1node11.enkitec.com/192.168.10.47:26000  enkbda1node12.enkitec.com/192.168.10.48:26000  enkbda1node13.enkitec.com/192.168.10.49:26000
healthy         True                                           True                                           True                                           True                                           True                                           True
last_ping       1513108921427                                  1513108920821                                  1513108920602                                  1513108920876                                  1513108920718                                  1513108921149
num_cpus        32                                             32                                             32                                             32                                             32                                             32
sys_load        1.24                                           1.38                                           0.42                                           1.08                                           0.75                                           0.83
mem_value_size  0                                              0                                              0                                              0                                              0                                              0
free_mem        11339923456                                    11337239552                                    11339133952                                    11332368384                                    11331912704                                    11338491904
pojo_mem        113671168                                      116355072                                      114460672                                      121226240                                      121681920                                      115102720
swap_mem        0                                              0                                              0                                              0                                              0                                              0
free_disk       422764871680                                   389836439552                                   422211223552                                   418693251072                                   422237437952                                   422187106304
max_disk        491885953024                                   491885953024                                   491885953024                                   491885953024                                   491885953024                                   491885953024
pid             1172                                           801                                            15879                                          17866                                          28980                                          30818
num_keys        0                                              0                                              0                                              0                                              0                                              0
tcps_active     0                                              0                                              0                                              0                                              0                                              0
open_fds        441                                            441                                            441                                            441                                            441                                            441
rpcs_active     0                                              0                                              0                                              0                                              0                                              0

2. Accessing Existing H2O Cluster from a Edge Node
To access an existing H2O cluster from a edge node also requires to use pysparkling command. But you don’t have to specify a lot of parameters just like above step to start a H2O cluster. Running without any parameters is fine for the client purpose.

First start the pysparkling by running /opt/sparkling-water-2.2.2/bin/pysparkling command.

Once in the python prompt, run the following to connect to an existing cluster. The key is run h2o.init(ip=”enkbda1node05.enkitec.com”, port=26000). This command will connect to the existing H2O cluster.

from pysparkling import *
from pyspark import SparkContext
from pyspark.sql import SQLContext
import h2o

h2o.init(ip="enkbda1node05.enkitec.com", port=26000)
h2o.cluster_status()

Run some testing code there.

print("Importing hdfs data")
stock_data = h2o.import_file("hdfs://ENKBDA1-ns/user/wzhou/work/test/stock_price.txt")
stock_data

print("Spliting data")
train,test = stock_data.split_frame(ratios=[0.9])

h2o.ls()

The following is the sample output.

[wzhou@ham-lnx-vs-0086 ~]$ /opt/sparkling-water-2.2.2/bin/pysparkling
Python 2.7.13 |Anaconda 4.4.0 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/

Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
SparkSession available as 'spark'.
>>> from pysparkling import *
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> import h2o
>>>
>>> h2o.init(ip="enkbda1node05.enkitec.com", port=26000)
Warning: connecting to remote server but falling back to local... Did you mean to use `h2o.connect()`?
Checking whether there is an H2O instance running at http://enkbda1node05.enkitec.com:26000. connected.
--------------------------  ---------------------------------------
H2O cluster uptime:         5 mins 43 secs
H2O cluster version:        3.14.0.7
H2O cluster version age:    1 month and 21 days
H2O cluster name:           WeidongH2O-Cluster
H2O cluster total nodes:    6
H2O cluster free memory:    58.10 Gb
H2O cluster total cores:    192
H2O cluster allowed cores:  192
H2O cluster status:         locked, healthy
H2O connection url:         http://enkbda1node05.enkitec.com:26000
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         Algos, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  ---------------------------------------
>>> h2o.cluster_status()
[WARNING] in <stdin> line 1:
    >>> ????
        ^^^^ Deprecated, use ``h2o.cluster().show_status(True)``.
--------------------------  ---------------------------------------
H2O cluster uptime:         5 mins 43 secs
H2O cluster version:        3.14.0.7
H2O cluster version age:    1 month and 21 days
H2O cluster name:           WeidongH2O-Cluster
H2O cluster total nodes:    6
H2O cluster free memory:    58.10 Gb
H2O cluster total cores:    192
H2O cluster allowed cores:  192
H2O cluster status:         locked, healthy
H2O connection url:         http://enkbda1node05.enkitec.com:26000
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         Algos, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  ---------------------------------------
Nodes info:     Node 1                                         Node 2                                         Node 3                                         Node 4                                         Node 5                                         Node 6
--------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------
h2o             enkbda1node08.enkitec.com/192.168.10.44:26000  enkbda1node09.enkitec.com/192.168.10.45:26000  enkbda1node10.enkitec.com/192.168.10.46:26000  enkbda1node11.enkitec.com/192.168.10.47:26000  enkbda1node12.enkitec.com/192.168.10.48:26000  enkbda1node13.enkitec.com/192.168.10.49:26000
healthy         True                                           True                                           True                                           True                                           True                                           True
last_ping       1513109250525                                  1513109250725                                  1513109250400                                  1513109250218                                  1513109250709                                  1513109250536
num_cpus        32                                             32                                             32                                             32                                             32                                             32
sys_load        1.55                                           1.94                                           2.4                                            1.73                                           0.4                                            1.51
mem_value_size  0                                              0                                              0                                              0                                              0                                              0
free_mem        11339923456                                    10122031104                                    10076566528                                    10283636736                                    10280018944                                    10278896640
pojo_mem        113671168                                      1331563520                                     1377028096                                     1169957888                                     1173575680                                     1174697984
swap_mem        0                                              0                                              0                                              0                                              0                                              0
free_disk       422754385920                                   389839585280                                   422231146496                                   418692202496                                   422226952192                                   422176620544
max_disk        491885953024                                   491885953024                                   491885953024                                   491885953024                                   491885953024                                   491885953024
pid             1172                                           801                                            15879                                          17866                                          28980                                          30818
num_keys        0                                              0                                              0                                              0                                              0                                              0
tcps_active     0                                              0                                              0                                              0                                              0                                              0
open_fds        440                                            440                                            440                                            440                                            440                                            440
rpcs_active     0                                              0                                              0                                              0                                              0                                              0
>>> print("Importing hdfs data")
Importing hdfs data
>>> stock_data = h2o.import_file("hdfs://ENKBDA1-ns/user/wzhou/work/test/stock_price.txt")
Parse progress: | 100%

>>>
>>> stock_data
date                   close    volume    open    high     low
-------------------  -------  --------  ------  ------  ------
2016-09-23 00:00:00    24.05     56837   24.13  24.22   23.88
2016-09-22 00:00:00    24.1      56675   23.49  24.18   23.49
2016-09-21 00:00:00    23.38     70925   23.21  23.58   23.025
2016-09-20 00:00:00    23.07     35429   23.17  23.264  22.98
2016-09-19 00:00:00    23.12     34257   23.22  23.27   22.96
2016-09-16 00:00:00    23.16     83309   22.96  23.21   22.96
2016-09-15 00:00:00    23.01     43258   22.7   23.25   22.53
2016-09-14 00:00:00    22.69     33891   22.81  22.88   22.66
2016-09-13 00:00:00    22.81     59871   22.75  22.89   22.53
2016-09-12 00:00:00    22.85    109145   22.9   22.95   22.74

[65 rows x 6 columns]

>>> print("Spliting data")
Spliting data
>>> train,test = stock_data.split_frame(ratios=[0.9])
>>> h2o.ls()
                      key
0  py_1_sid_83d3_splitter
1         stock_price.hex

Let’s go to R Studio. We should see this H2O frame.

Let’s check it out from H2O Flow UI.

Cool, it’s there as well and we’re good here.

Weird Ref-count mismatch Message from H2O

H2O is a nice fast tools for data science work. I have discussed this topic in the following blogs:
H2O vs Sparkling Water
Sparking Water Shell: Cloud size under 12 Exception
Access Sparkling Water via R Studio
Running H2O Cluster in Background and at Specific Port Number
Recently we run into some weird issue when using H2O when using R Studio. For unknown reasons, it throws the following error messages:

score <- as.data.frame(main[,c('id', 'my_test_score')])

ERROR: Unexpected HTTP Status code: 500 Server Error (url = http://enkbda1node05.enkitec.com:26000/99/Rapids)

java.lang.IllegalStateException
 [1] "java.lang.IllegalStateException: Ref-count mismatch for vec $04ffc52a0000ffffffff$hdfs://ENKBDA1-ns/user/wzhou/test/data/test1/part-00000-8516f9f8-3d5f-164b-cf7e-18ca7e8d467f-d000.snappy.parquet: REFCNT = 2, should be 1"
 [2] "    water.rapids.Session.sanity_check_refs(Session.java:341)"                                                                                                                                                                                                     
 [3] "    water.rapids.Session.exec(Session.java:83)"                                                                                                                                                                                                                   
 [4] "    water.rapids.Rapids.exec(Rapids.java:93)"                                                                                                                                                                                                                     
 [5] "    water.api.RapidsHandler.exec(RapidsHandler.java:38)"                                                                                                                                                                                                          
 [6] "    sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)"                                                                                                                                                                                                 
 [7] "    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)"                                                                                                                                                                        
 [8] "    java.lang.reflect.Method.invoke(Method.java:498)"                                                                                                                                                                                                             
 [9] "    water.api.Handler.handle(Handler.java:63)"                                                                                                                                                                                                                    
[10] "    water.api.RequestServer.serve(RequestServer.java:448)"                                                                                                                                                                                                        
[11] "    water.api.RequestServer.doGeneric(RequestServer.java:297)"                                                                                                                                                                                                    
[12] "    water.api.RequestServer.doPost(RequestServer.java:223)"                                                                                                                                                                                                       
[13] "    javax.servlet.http.HttpServlet.service(HttpServlet.java:707)"                                                                                                                                                                                                 
[14] "    javax.servlet.http.HttpServlet.service(HttpServlet.java:790)"                                                                                                                                                                                                 
[15] "    ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)"                                                                                                                                                                                
[16] "    ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)"                                                                                                                                                                            
[17] "    ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)"                                                                                                                                                                    
[18] "    ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)"                                                                                                                                                                             
[19] "    ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)"                                                                                                                                                                     
[20] "    ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)"                                                                                                                                                                         
[21] "    ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"                                                                                                                                                                 
[22] "    ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"                                                                                                                                                                       
[23] "    water.JettyHTTPD$LoginHandler.handle(JettyHTTPD.java:189)"                                                                                                                                                                                                    
[24] "    ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"                                                                                                                                                                 
[25] "    ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"                                                                                                                                                                       
[26] "    ai.h2o.org.eclipse.jetty.server.Server.handle(Server.java:370)"                                                                                                                                                                                               
[27] "    ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)"                                                                                                                                                        
[28] "    ai.h2o.org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)"                                                                                                                                                         
[29] "    ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)"                                                                                                                                                              
[30] "    ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)"                                                                                                                                              
[31] "    ai.h2o.org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)"                                                                                                                                                                                      
[32] "    ai.h2o.org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)"                                                                                                                                                                                 
[33] "    ai.h2o.org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)"                                                                                                                                                                
[34] "    ai.h2o.org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)"                                                                                                                                                          
[35] "    ai.h2o.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)"                                                                                                                                                                      
[36] "    ai.h2o.org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)"                                                                                                                                                                       
[37] "    java.lang.Thread.run(Thread.java:745)"                                                                                                                                                                                                                        

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  : 
  
ERROR MESSAGE:

Ref-count mismatch for vec $04ffc52a0000ffffffff$hdfs://ENKBDA1-ns/user/wzhou/test/data/test1/part-00000-8516f9f8-3d5f-164b-cf7e-18ca7e8d467f-d000.snappy.parquet: REFCNT = 2, should be 1

Tried different ways and still got the same error. Had to bounce the H2O cluster and get rid of this error. Then further investigation found out the steps we can reproduce this issue. If we delete some or all H2O frames from H2O UI, we could run into this issue.

Basically, just run getFrames command, then select the frames want to be deleted, click Delete selected frames at the bottom. Run getFrames again. It should show those frames deleted. But if I go to R Studio, run h2o.ls(), it will show the exact Ref-count mismatch error. It seems like the frames were deleted from H2O UI perspective, but not from R Studio’s perspective.

Ok, we found out the cause. How to resolve it? Check out H2O source code at https://github.com/h2oai/h2o-3/blob/master/h2o-core/src/main/java/water/rapids/Session.java.

we could see that this error is thrown during sanity_check_refs function. The function also has the following interesting description

Check that ref counts are in a consistent state.
* This should only be called between calls to Rapids expressions (otherwise may blow false-positives).

It seems there is something not handled well in other part of the code. If inconsistent in the object references, some other part of the code might fail. It looks bugs to me.

After some investigation, found out h2o.removeAll() running from R Studio should fix this issue.