Use Jupyter Notebook to Access H2O Driverless AI

I discussed H2O Driverless AI installation in my last blog, Install H2O Driverless AI on Google Cloud Platform. H2O AI docker image contains the deployment of Jupyter Notebook. Once H2O AI starts, we can use Jupyter notebook directly. In this blog, I am going to discuss how to use Jupyter Notebook to connect to H2O AI.

To login Jupyter Notebook, I need to know the login token. It is usually shown in the console output at the ‎time starting Jupyter. However If I check out the Docker logs command, it shows the output from H2O AI.

root@h2otest:~# docker ps
CONTAINER ID        IMAGE                    COMMAND             CREATED             STATUS              PORTS                                                                                                NAMES
5b803337e8b5        opsh2oai/h2oai-runtime   "./run.sh"          About an hour ago   Up About an hour    0.0.0.0:8888->8888/tcp, 0.0.0.0:9090->9090/tcp, 0.0.0.0:12345->12345/tcp, 0.0.0.0:54321->54321/tcp   h2oai

root@h2otest:~# docker logs h2oai
---------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------
     version: 1.0.30

- Put data in the volume mounted at /data
- Logs are written to the volume mounted at /log/20180424-140930
- Connect to Driverless AI on port 12345 inside the container
- Connect to Jupyter notebook on port 8888 inside the container

But the output at least tells me the logfile location. SSH to the container and check out Jupyter log.

root@h2otest:~# ./ssh_h2oai.sh 
root@5b803337e8b5:/# cd /log/20180424-140930
root@5b803337e8b5:/log/20180424-140930# ls -l
total 84
-rw-r--r-- 1 root root 61190 Apr 24 14:53 h2oai.log
-rw-r--r-- 1 root root 14340 Apr 24 15:14 h2o.log
-rw-r--r-- 1 root root  2700 Apr 24 14:58 jupyter.log
-rw-r--r-- 1 root root    52 Apr 24 14:09 procsy.log
root@5b803337e8b5:/log/20180424-140930# cat jupyter.log
config:
    /jupyter/.jupyter
    /h2oai_env/etc/jupyter
    /usr/local/etc/jupyter
    /etc/jupyter
data:
    /jupyter/.local/share/jupyter
    /h2oai_env/share/jupyter
    /usr/local/share/jupyter
    /usr/share/jupyter
runtime:
    /jupyter/.local/share/jupyter/runtime
[I 14:10:01.512 NotebookApp] Writing notebook server cookie secret to /jupyter/.local/share/jupyter/runtime/notebook_cookie_secret
[W 14:10:04.062 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[I 14:10:04.224 NotebookApp] Serving notebooks from local directory: /jupyter
[I 14:10:04.224 NotebookApp] 0 active kernels
[I 14:10:04.224 NotebookApp] The Jupyter Notebook is running at:
[I 14:10:04.224 NotebookApp] http://[all ip addresses on your system]:8888/?token=f1b8f6dc7fb0aab7caec278a2bf971249b765140e4b3b338
[I 14:10:04.224 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 14:10:04.224 NotebookApp] 
    
    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=f1b8f6dc7fb0aab7caec278a2bf971249b765140e4b3b338
[W 14:19:26.189 NotebookApp] 401 POST /login?next=%2Ftree%3F (10.142.0.2) 834.30ms referer=http://10.142.0.2:8888/login?next=%2Ftree%3F
[I 14:20:15.706 NotebookApp] 302 POST /login?next=%2Ftree%3F (10.142.0.2) 1.36ms

Although this approach worked majority of time, I did run into issue for a few times that Jupyter login said the token is invalid. After some research, I found out another way that guarantees to get the correct token. It’s a json file under /jupyter/.local/share/jupyter/runtime directory. The filename nbserver-xx.json changes each time H2O AI starts.

root@5b803337e8b5:/# ls -l /jupyter/.local/share/jupyter/runtime
total 12
-rw-r--r-T 1 root root  263 Apr 24 14:24 kernel-b225302b-f2d9-47ac-b99c-f1f55eb54021.json
-rw-r--r-- 1 root root  245 Apr 24 14:10 nbserver-51.json
-rw------- 1 root root 1386 Apr 24 14:10 notebook_cookie_secret
root@5b803337e8b5:/# cat /jupyter/.local/share/jupyter/runtime/nbserver-51.json
{
  "base_url": "/",
  "hostname": "localhost",
  "notebook_dir": "/jupyter",
  "password": false,
  "pid": 51,
  "port": 8888,
  "secure": false,
  "token": "f1b8f6dc7fb0aab7caec278a2bf971249b765140e4b3b338",
  "url": "http://localhost:8888/"

Based on that, I created a script to get the token without ssh to the container.

root@h2otest:~# cat get_jy_token.sh 
#!/bin/bash

JSON_FILENAME=`docker exec -it h2oai ls -l /jupyter/.local/share/jupyter/runtime | grep nbserver |awk '{print $9}' | tr -d "\r"`
#echo $JSON_FILENAME
docker exec -it h2oai grep token /jupyter/.local/share/jupyter/runtime/$JSON_FILENAME

Run the script and got the token.

root@h2otest:~# ./get_jy_token.sh 
  "token": "f1b8f6dc7fb0aab7caec278a2bf971249b765140e4b3b338",

Ok, let me go to the login screen and input the token.

The Jupyter screen shows up.

There is two sample notebooks installed by default. I tried to make it working. However the sample data in docker image does not seem working. There is no detail API document available at this moment. So I just did a few basic stuff to prove it work. The following is the code I input in the notebook.

import h2oai_client
import numpy as np
import pandas as pd
# import h2o
import requests
import math
from h2oai_client import Client, ModelParameters, InterpretParameters

address = 'http://35.229.57.147:12345'
username = 'h2o'
password = 'h2o'
h2oai = Client(address = address, username = username, password = password)

stock_path = '/data/stock_price.csv'
stockData = h2oai.create_dataset_sync(stock_path)
stockData.dump()


I went back to H2O AI UI and found out three more stock_price dataset were created by my Jupyter notebook.

So each time I run the command h2oai.create_dataset_sync(stock_path), it creates a new dataset. The dataset with same path is not going to eliminated. To avoid duplication, I have to manually delete the duplicated one from UI. It’s not a big deal. Just need to remember to cleanup the duplicated dataset if run the same notebook multiple times. Another way to get around this issue is to use different login name. As different login name sees the datasets only belong to the current user, you could have a login name for production use and a different login name for development or testing. You can safely remove the duplicated dataset in the development username without worrying about removing the wrong one.

Advertisements

Install H2O Driverless AI on Google Cloud Platform

I wrote many blogs about H2O and H2O Sparkling Water in the past. Today I am going to discuss the installation of H2O Driverless AI (H2O AI). H2O AI targets machine learning, especially deep learning. While H2O focuses more on algorithm, models, and predication, H2O AI automates some of the most difficult data science and ML workflows to offer automatic visualizations and Machine Learning Interpretability (MLI). Here is the architecture of H2O AI.

There are some difference in different installation environment. To check out different environment, use H2O Driverless AI installation document at http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/installing.html.

This blog discusses the topic only related to Google Cloud. Here are a few important things to know before the installation.
1. It requires a lot of memory and CPUs, if possible use GPU. I uses 8 CPUs and 52 GB memory on Google cloud. If you can use GPU, add GPU option. For me, I don’t have the access to GPU in my account.
2. The OS is based on Ubuntu 16.04 and I believe it is the minimum version supported.
3. OS disk size should be >= 64GB. I used 64GB.
4. Instead of installation software package, H2O AI uses Docker image. Yes, Docker needs to be installed first.
5. If plan to use python to connect the H2O AI, the supported version of python is v3.6.

Ok, here is the installation procedure on GCP:
1. Create a new firewall rule
Click VPC Network -> Firewall Rules -> Create Firewall Rule
Input the following:
Name : h2oai
Description: The firewall rule for H2O driverless AI
Target tags: h2o
Source IP ranges: 0.0.0.0/0
Protocols and ports: tcp:12345,54321
Please note: H2O’s documentation misses the port 54321, which is used by H2O Flow UI. Needs to open this port. Otherwise you can not access H2O Flow UI.


2. Create a new VM instance
Name: h2otest
Zone: us-east1-c
Cores: 8 vCPU
Memory: 52 GB
Boot disk: 64 GB, Ubuntu 16.04
Service account: use your GCP service account
Network tags: h2o

3. Install and configure Docker
Logon to h2otest VM instance and su to root user.
Create a script, build.sh

apt-get -y update
apt-get -y --no-install-recommends install \
  curl \
  apt-utils \
  python-software-properties \
  software-properties-common

add-apt-repository -y "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -

apt-get update
apt-get install -y docker-ce

Run the script

root@h2otest:~# chmod u+x build.sh
root@h2otest:~# ./build.sh

Created required directories.

mkdir ~/tmp
mkdir ~/log
mkdir ~/data
mkdir ~/scripts
mkdir ~/license
mkdir ~/demo
mkdir -p ~/jupyter/notebooks

Adding current user to Docker container is optional. I did anyway.

root@h2otest:~# usermod -aG docker weidong.zhou
root@h2otest:~# id weidong.zhou
uid=1001(weidong.zhou) gid=1002(weidong.zhou) groups=1002(weidong.zhou),4(adm),20(dialout),24(cdrom),25(floppy),29(audio),30(dip),44(video),46(plugdev),109(netdev),110(lxd),1000(ubuntu),1001(google-sudoers),999(docker)

4. Download and Load H2O AI Docker Image
Download the docker image.

root@h2otest:~# wget https://s3-us-west-2.amazonaws.com/h2o-internal-release/docker/driverless-ai-docker-runtime-latest-release.gz
--2018-04-18 16:43:31--  https://s3-us-west-2.amazonaws.com/h2o-internal-release/docker/driverless-ai-docker-runtime-latest-release.gz
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.209.8
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.209.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2167098485 (2.0G) [application/gzip]
Saving to: ‘driverless-ai-docker-runtime-latest-release.gz’

driverless-ai-docker-runtime-latest-release.g 100%[==============================================================================================>]   2.02G  26.2MB/s    in 94s     

2018-04-18 16:45:05 (22.0 MB/s) - ‘driverless-ai-docker-runtime-latest-release.gz’ saved [2167098485/2167098485]

Load Docker image.

root@h2otest:~# docker load < driverless-ai-docker-runtime-latest-release.gz 9d3227c1793b: Loading layer [==================================================>]  121.3MB/121.3MB
a1a54d352248: Loading layer [==================================================>]  15.87kB/15.87kB
. . . .
ed86b627a562: Loading layer [==================================================>]  1.536kB/1.536kB
7d38d6d61cec: Loading layer [==================================================>]  1.536kB/1.536kB
de539994349c: Loading layer [==================================================>]  3.584kB/3.584kB
8e992954a9eb: Loading layer [==================================================>]  3.584kB/3.584kB
ff71b3e896ef: Loading layer [==================================================>]  8.192kB/8.192kB
Loaded image: opsh2oai/h2oai-runtime:latest
root@h2otest:~# docker image ls
REPOSITORY               TAG                 IMAGE ID            CREATED             SIZE
opsh2oai/h2oai-runtime   latest              dff251c69407        12 days ago         5.46GB

5. Start H2O AI
Create a startup script, start_h2oai.sh. Please note: H2O document has an error, missing port 54321 for H2O Flow UI. Then run the script

root@h2otest:~# cat start_h2oai.sh 
#!/bin/bash

docker run \
    --rm \
    -u `id -u`:`id -g` \
    -p 12345:12345 \
    -p 54321:54321 \
    -p 8888:8888 \
    -p 9090:9090 \
    -v `pwd`/data:/data \
    -v `pwd`/log:/log \
    -v `pwd`/license:/license \
    -v `pwd`/tmp:/tmp \
    opsh2oai/h2oai-runtime

root@h2otest:~# chmod a+x start_h2oai.sh
root@h2otest:~# ./start_h2oai.sh 
---------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------
     version: 1.0.30

- Put data in the volume mounted at /data
- Logs are written to the volume mounted at /log/20180419-094058
- Connect to Driverless AI on port 12345 inside the container
- Connect to Jupyter notebook on port 8888 inside the container

Also create a script, ssh_h2oai.sh to quickly ssh to the H2O AI container without knowing the container id first.

root@h2otest:~# vi ssh_h2oai.sh
root@h2otest:~# cat ssh_h2oai.sh 
#!/bin/bash

CONTAINER_ID=`docker ps|grep h2oai-runtime|awk '{print $1}'`
docker exec -it $CONTAINER_ID bash
root@h2otest:~# chmod a+x ssh_h2oai.sh 
root@h2otest:~# ./ssh_h2oai.sh
root@09bd138f4f41:/#

6. Use H2O AI
Get the external IP of the H2O VM. In my case, it is 35.196.90.114. Then access URL at http://35.196.90.114:12345/. You will see H2O AI evaluation agreement screen. Click I Agree to these Terms to continue.
The Logon screen shows up. I use the following information to sign in.
Username: h2o
Password: h2o
Actually it doesn’t matter what you input. You can use any username to login. It just didn’t check. I know it has the feature to integrate with LDAP. I just didn’t give a try this time.

After sign in, it will ask you to input license information. Fill out your information at https://www.h2o.ai/try-driverless-ai/ and you will receive a 21-day trail license in the email.

The first screen shows up is the Datasets overview. You can add dataset from one of three sources: File System, Hadoop File System, Amazon S3. To use some sample data, I chose Amazon S3‘s AirlinesTest.csv.zip file.



For every dataset, there are two kinds of Actions: Visualize or Predict
Click Visualize. Many interesting visualization charts show up.



If click Predict, Experiment screen shows up. Choose a Target Column. In my example, I chose ArrTime column. Click Launch Experiment

Once finished, it will show a list of options. For example, I clicked Interpret this model on original features


For people familiar with H2O Flow UI, H2O AI still has this UI, just click H2O-3 from the menu. The H2O Flow UI will show up.

In general, H2O AI has an impressive UI and tons of new stuff. No wonder it is not a free version. In the next blog, I am going to discuss how to configure python client to access H2O AI.