I wrote serveral blogs related to H2O topic as follows:
H2O vs Sparkling Water
Sparking Water Shell: Cloud size under 12 Exception
Access Sparkling Water via R Studio
Running H2O Cluster in Background and at Specific Port Number
Weird Ref-count mismatch Message from H2O
While both H2O Flow UI and R Studio can access H2O cluster started by Sparkling Water Shell, both lack the powerful functionality to manipulate data like Python does. In this blog, I am going to discuss how to use Python for H2O.
There are two main topics related to python for H2O. The first is to start a H2O cluster and another one is to access an existing H2O cluster using python.
1. Start H2O Cluster using pysparkling
I discussed start a H2O cluster using sparkling-shell in blog Sparking Water Shell: Cloud size under 12 Exception. For python, need to use pysparkling.
/opt/sparkling-water-2.2.2/bin/pysparkling \ --master yarn \ --conf spark.ext.h2o.cloud.name=WeidongH2O-Cluster \ --conf spark.ext.h2o.client.port.base=26000 \ --conf spark.executor.instances=6 \ --conf spark.executor.memory=12g \ --conf spark.driver.memory=8g \ --conf spark.yarn.executor.memoryOverhead=4g \ --conf spark.yarn.driver.memoryOverhead=4g \ --conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \ --conf spark.ext.h2o.fail.on.unsupported.spark.param=false \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.sql.autoBroadcastJoinThreshold=-1 \ --conf spark.locality.wait=30000 \ --conf spark.yarn.queue=HighPool \ --conf spark.scheduler.minRegisteredResourcesRatio=1
Then run the following python commands.
from pysparkling import * from pyspark import SparkContext from pyspark.sql import SQLContext import h2o hc = H2OContext.getOrCreate(sc)
The following is the sample output.
[wzhou@enkbda1node05 sparkling-water-2.2.2]$ /opt/sparkling-water-2.2.2/bin/pysparkling \ > --master yarn \ > --conf spark.ext.h2o.cloud.name=WeidongH2O-Cluster \ > --conf spark.ext.h2o.client.port.base=26000 \ > --conf spark.executor.instances=6 \ > --conf spark.executor.memory=12g \ > --conf spark.driver.memory=8g \ > --conf spark.yarn.executor.memoryOverhead=4g \ > --conf spark.yarn.driver.memoryOverhead=4g \ > --conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \ > --conf spark.ext.h2o.fail.on.unsupported.spark.param=false \ > --conf spark.dynamicAllocation.enabled=false \ > --conf spark.sql.autoBroadcastJoinThreshold=-1 \ > --conf spark.locality.wait=30000 \ > --conf spark.yarn.queue=HighPool \ > --conf spark.scheduler.minRegisteredResourcesRatio=1 Python 2.7.13 |Anaconda 4.4.0 (64-bit)| (default, Dec 20 2016, 23:09:15) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://anaconda.org SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/anaconda2/lib/python2.7/site-packages/pyspark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/12/10 15:00:20 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 17/12/10 15:00:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/12/10 15:00:21 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 17/12/10 15:00:24 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Python version 2.7.13 (default, Dec 20 2016 23:09:15) SparkSession available as 'spark'. >>> from pysparkling import * >>> from pyspark import SparkContext >>> from pyspark.sql import SQLContext >>> import h2o >>> hc = H2OContext.getOrCreate(sc) /opt/sparkling-water-2.2.2/py/build/dist/h2o_pysparkling_2.2-2.2.2.zip/pysparkling/context.py:111: UserWarning: Method H2OContext.getOrCreate with argument of type SparkContext is deprecated and parameter of type SparkSession is preferred. 17/12/10 15:01:45 WARN h2o.H2OContext: Method H2OContext.getOrCreate with an argument of type SparkContext is deprecated and parameter of type SparkSession is preferred. Connecting to H2O server at http://192.168.10.41:26000... successful. -------------------------- ------------------------------- H2O cluster uptime: 12 secs H2O cluster version: 3.14.0.7 H2O cluster version age: 1 month and 21 days H2O cluster name: WeidongH2O-Cluster H2O cluster total nodes: 6 H2O cluster free memory: 63.35 Gb H2O cluster total cores: 192 H2O cluster allowed cores: 192 H2O cluster status: accepting new members, healthy H2O connection url: http://192.168.10.41:26000 H2O connection proxy: H2O internal security: False H2O API Extensions: Algos, AutoML, Core V3, Core V4 Python version: 2.7.13 final -------------------------- ------------------------------- Sparkling Water Context: * H2O name: WeidongH2O-Cluster * cluster size: 6 * list of used nodes: (executorId, host, port) ------------------------ (3,enkbda1node12.enkitec.com,26000) (1,enkbda1node13.enkitec.com,26000) (6,enkbda1node10.enkitec.com,26000) (5,enkbda1node09.enkitec.com,26000) (4,enkbda1node08.enkitec.com,26000) (2,enkbda1node11.enkitec.com,26000) ------------------------ Open H2O Flow in browser: http://192.168.10.41:26000 (CMD + click in Mac OSX) >>> >>> h2o <module 'h2o' from '/opt/sparkling-water-2.2.2/py/build/dist/h2o_pysparkling_2.2-2.2.2.zip/h2o/__init__.py'> >>> h2o.cluster_status() [WARNING] in <stdin> line 1: >>> ???? ^^^^ Deprecated, use ``h2o.cluster().show_status(True)``. -------------------------- ------------------------------- H2O cluster uptime: 14 secs H2O cluster version: 3.14.0.7 H2O cluster version age: 1 month and 21 days H2O cluster name: WeidongH2O-Cluster H2O cluster total nodes: 6 H2O cluster free memory: 63.35 Gb H2O cluster total cores: 192 H2O cluster allowed cores: 192 H2O cluster status: locked, healthy H2O connection url: http://192.168.10.41:26000 H2O connection proxy: H2O internal security: False H2O API Extensions: Algos, AutoML, Core V3, Core V4 Python version: 2.7.13 final -------------------------- ------------------------------- Nodes info: Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 -------------- --------------------------------------------- --------------------------------------------- --------------------------------------------- --------------------------------------------- --------------------------------------------- --------------------------------------------- h2o enkbda1node08.enkitec.com/192.168.10.44:26000 enkbda1node09.enkitec.com/192.168.10.45:26000 enkbda1node10.enkitec.com/192.168.10.46:26000 enkbda1node11.enkitec.com/192.168.10.47:26000 enkbda1node12.enkitec.com/192.168.10.48:26000 enkbda1node13.enkitec.com/192.168.10.49:26000 healthy True True True True True True last_ping 1513108921427 1513108920821 1513108920602 1513108920876 1513108920718 1513108921149 num_cpus 32 32 32 32 32 32 sys_load 1.24 1.38 0.42 1.08 0.75 0.83 mem_value_size 0 0 0 0 0 0 free_mem 11339923456 11337239552 11339133952 11332368384 11331912704 11338491904 pojo_mem 113671168 116355072 114460672 121226240 121681920 115102720 swap_mem 0 0 0 0 0 0 free_disk 422764871680 389836439552 422211223552 418693251072 422237437952 422187106304 max_disk 491885953024 491885953024 491885953024 491885953024 491885953024 491885953024 pid 1172 801 15879 17866 28980 30818 num_keys 0 0 0 0 0 0 tcps_active 0 0 0 0 0 0 open_fds 441 441 441 441 441 441 rpcs_active 0 0 0 0 0 0
2. Accessing Existing H2O Cluster from a Edge Node
To access an existing H2O cluster from a edge node also requires to use pysparkling command. But you don’t have to specify a lot of parameters just like above step to start a H2O cluster. Running without any parameters is fine for the client purpose.
First start the pysparkling by running /opt/sparkling-water-2.2.2/bin/pysparkling command.
Once in the python prompt, run the following to connect to an existing cluster. The key is run h2o.init(ip=”enkbda1node05.enkitec.com”, port=26000). This command will connect to the existing H2O cluster.
from pysparkling import * from pyspark import SparkContext from pyspark.sql import SQLContext import h2o h2o.init(ip="enkbda1node05.enkitec.com", port=26000) h2o.cluster_status()
Run some testing code there.
print("Importing hdfs data") stock_data = h2o.import_file("hdfs://ENKBDA1-ns/user/wzhou/work/test/stock_price.txt") stock_data print("Spliting data") train,test = stock_data.split_frame(ratios=[0.9]) h2o.ls()
The following is the sample output.
[wzhou@ham-lnx-vs-0086 ~]$ /opt/sparkling-water-2.2.2/bin/pysparkling Python 2.7.13 |Anaconda 4.4.0 (64-bit)| (default, Dec 20 2016, 23:09:15) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://anaconda.org Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.2.0.cloudera1 /_/ Using Python version 2.7.13 (default, Dec 20 2016 23:09:15) SparkSession available as 'spark'. >>> from pysparkling import * >>> from pyspark import SparkContext >>> from pyspark.sql import SQLContext >>> import h2o >>> >>> h2o.init(ip="enkbda1node05.enkitec.com", port=26000) Warning: connecting to remote server but falling back to local... Did you mean to use `h2o.connect()`? Checking whether there is an H2O instance running at http://enkbda1node05.enkitec.com:26000. connected. -------------------------- --------------------------------------- H2O cluster uptime: 5 mins 43 secs H2O cluster version: 3.14.0.7 H2O cluster version age: 1 month and 21 days H2O cluster name: WeidongH2O-Cluster H2O cluster total nodes: 6 H2O cluster free memory: 58.10 Gb H2O cluster total cores: 192 H2O cluster allowed cores: 192 H2O cluster status: locked, healthy H2O connection url: http://enkbda1node05.enkitec.com:26000 H2O connection proxy: H2O internal security: False H2O API Extensions: Algos, AutoML, Core V3, Core V4 Python version: 2.7.13 final -------------------------- --------------------------------------- >>> h2o.cluster_status() [WARNING] in <stdin> line 1: >>> ???? ^^^^ Deprecated, use ``h2o.cluster().show_status(True)``. -------------------------- --------------------------------------- H2O cluster uptime: 5 mins 43 secs H2O cluster version: 3.14.0.7 H2O cluster version age: 1 month and 21 days H2O cluster name: WeidongH2O-Cluster H2O cluster total nodes: 6 H2O cluster free memory: 58.10 Gb H2O cluster total cores: 192 H2O cluster allowed cores: 192 H2O cluster status: locked, healthy H2O connection url: http://enkbda1node05.enkitec.com:26000 H2O connection proxy: H2O internal security: False H2O API Extensions: Algos, AutoML, Core V3, Core V4 Python version: 2.7.13 final -------------------------- --------------------------------------- Nodes info: Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 -------------- --------------------------------------------- --------------------------------------------- --------------------------------------------- --------------------------------------------- --------------------------------------------- --------------------------------------------- h2o enkbda1node08.enkitec.com/192.168.10.44:26000 enkbda1node09.enkitec.com/192.168.10.45:26000 enkbda1node10.enkitec.com/192.168.10.46:26000 enkbda1node11.enkitec.com/192.168.10.47:26000 enkbda1node12.enkitec.com/192.168.10.48:26000 enkbda1node13.enkitec.com/192.168.10.49:26000 healthy True True True True True True last_ping 1513109250525 1513109250725 1513109250400 1513109250218 1513109250709 1513109250536 num_cpus 32 32 32 32 32 32 sys_load 1.55 1.94 2.4 1.73 0.4 1.51 mem_value_size 0 0 0 0 0 0 free_mem 11339923456 10122031104 10076566528 10283636736 10280018944 10278896640 pojo_mem 113671168 1331563520 1377028096 1169957888 1173575680 1174697984 swap_mem 0 0 0 0 0 0 free_disk 422754385920 389839585280 422231146496 418692202496 422226952192 422176620544 max_disk 491885953024 491885953024 491885953024 491885953024 491885953024 491885953024 pid 1172 801 15879 17866 28980 30818 num_keys 0 0 0 0 0 0 tcps_active 0 0 0 0 0 0 open_fds 440 440 440 440 440 440 rpcs_active 0 0 0 0 0 0 >>> print("Importing hdfs data") Importing hdfs data >>> stock_data = h2o.import_file("hdfs://ENKBDA1-ns/user/wzhou/work/test/stock_price.txt") Parse progress: | 100% >>> >>> stock_data date close volume open high low ------------------- ------- -------- ------ ------ ------ 2016-09-23 00:00:00 24.05 56837 24.13 24.22 23.88 2016-09-22 00:00:00 24.1 56675 23.49 24.18 23.49 2016-09-21 00:00:00 23.38 70925 23.21 23.58 23.025 2016-09-20 00:00:00 23.07 35429 23.17 23.264 22.98 2016-09-19 00:00:00 23.12 34257 23.22 23.27 22.96 2016-09-16 00:00:00 23.16 83309 22.96 23.21 22.96 2016-09-15 00:00:00 23.01 43258 22.7 23.25 22.53 2016-09-14 00:00:00 22.69 33891 22.81 22.88 22.66 2016-09-13 00:00:00 22.81 59871 22.75 22.89 22.53 2016-09-12 00:00:00 22.85 109145 22.9 22.95 22.74 [65 rows x 6 columns] >>> print("Spliting data") Spliting data >>> train,test = stock_data.split_frame(ratios=[0.9]) >>> h2o.ls() key 0 py_1_sid_83d3_splitter 1 stock_price.hex
Let’s go to R Studio. We should see this H2O frame.
Let’s check it out from H2O Flow UI.
Cool, it’s there as well and we’re good here.
You must be logged in to post a comment.