Parquet File Can not Be Read in Sparkling Water H2O

For the past few months, I wrote several blogs related to H2O topic:
Use Python for H2O
H2O vs Sparkling Water
Sparking Water Shell: Cloud size under 12 Exception
Access Sparkling Water via R Studio
Running H2O Cluster in Background and at Specific Port Number
Weird Ref-count mismatch Message from H2O

Sparkling Water and H2O are very good in terms of performance for data science projects. But if something works beautifully in one environment, not working in another environment could be the most annoyed thing in the world. This is exactly what had happened at one of my clients.

They used to use Sparkling Water and H2O in Oracle BDA environment and worked great. However, even with a full rack BDA (18 nodes), it is still not enough to run big dataset on H2O. Recently they moved to a much bigger CDH cluster (non-BDA environment) with CDH 5.13 installed. Sparkling Water is still working, however there was one major issue: parquet files can not be read correctly. There are no issue in reading the same parquet files from Spark shell and pyspark. This is really an annoying issue as parquet format is one of data formats that are heavily used by the client. After many failed tries and investigation, I was finally able to figure out the issue and implement a workaround solution. This blog discussed this parquet reading issue and workaround solution in Sparkling Water.

Create Test Data Set
I did the test in my CDH cluster (CDH 5.13). I first created a small test data set, stock.csv, and uploaded to /user/root directory on HDFS.

date,close,volume,open,high,low
9/23/16,24.05,56837,24.13,24.22,23.88
9/22/16,24.1,56675,23.49,24.18,23.49
9/21/16,23.38,70925,23.21,23.58,23.025
9/20/16,23.07,35429,23.17,23.264,22.98
9/19/16,23.12,34257,23.22,23.27,22.96
9/16/16,23.16,83309,22.96,23.21,22.96
9/15/16,23.01,43258,22.7,23.25,22.53
9/14/16,22.69,33891,22.81,22.88,22.66
9/13/16,22.81,59871,22.75,22.89,22.53
9/12/16,22.85,109145,22.9,22.95,22.74
9/9/16,23.03,115901,23.53,23.53,23.02
9/8/16,23.6,32717,23.8,23.83,23.55
9/7/16,23.85,143635,23.69,23.89,23.69
9/6/16,23.68,43577,23.78,23.79,23.43
9/2/16,23.84,31333,23.45,23.93,23.41
9/1/16,23.42,49547,23.45,23.48,23.26

Create a Parquet File
Run the following in spark2-shell to create a parquet file and make sure that I can read it back.

scala> val myTest=spark.read.format("csv").load("/user/root/stock.csv")
myTest: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 4 more fields]
 
scala> myTest.show(3)
+-------+-----+------+-----+-----+-----+
|    _c0|  _c1|   _c2|  _c3|  _c4|  _c5|
+-------+-----+------+-----+-----+-----+
|   date|close|volume| open| high|  low|
|9/23/16|24.05| 56837|24.13|24.22|23.88|
|9/22/16| 24.1| 56675|23.49|24.18|23.49|
+-------+-----+------+-----+-----+-----+
only showing top 3 rows
 
scala> myTest.write.format("parquet").save("/user/root/mytest.parquet")
 
scala> val readTest = spark.read.format("parquet").load("/user/root/mytest.parquet")
readTest: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 4 more fields]
 
scala> readTest.show(3)
+-------+-----+------+-----+-----+-----+
|    _c0|  _c1|   _c2|  _c3|  _c4|  _c5|
+-------+-----+------+-----+-----+-----+
|   date|close|volume| open| high|  low|
|9/23/16|24.05| 56837|24.13|24.22|23.88|
|9/22/16| 24.1| 56675|23.49|24.18|23.49|
+-------+-----+------+-----+-----+-----+
only showing top 3 rows

Start a Sparkling Water H2O Cluster
I started a Sparking Water Cluster with 2 nodes.

[root@a84-master--2df67700-f9d1-46f3-afcf-ba27a523e143 sparkling-water-2.2.7]# . /etc/spark2/conf.cloudera.spark2_on_yarn/spark-env.sh
/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2
[root@a84-master--2df67700-f9d1-46f3-afcf-ba27a523e143 sparkling-water-2.2.7]# bin/sparkling-shell \
> --master yarn \
> --conf spark.executor.instances=2 \
> --conf spark.executor.memory=1g \
> --conf spark.driver.memory=1g \
> --conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
> --conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
> --conf spark.dynamicAllocation.enabled=false \
> --conf spark.sql.autoBroadcastJoinThreshold=-1 \
> --conf spark.locality.wait=30000 \
> --conf spark.scheduler.minRegisteredResourcesRatio=1

-----
  Spark master (MASTER)     : yarn
  Spark home   (SPARK_HOME) : /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2
  H2O build version         : 3.16.0.4 (wheeler)
  Spark build version       : 2.2.1
  Scala version             : 2.11
----

/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.128.0.5:4040
Spark context available as 'sc' (master = yarn, app id = application_1518097883047_0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val h2oContext = H2OContext.getOrCreate(spark) 
h2oContext: org.apache.spark.h2o.H2OContext =                                   

Sparkling Water Context:
 * H2O name: sparkling-water-root_application_1518097883047_0001
 * cluster size: 2
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (1,a84-worker--6e693a0a-d92c-4172-81b7-f2f07b6d5d7c.c.cdh-director-194318.internal,54321)
  (2,a84-worker--c0ecccae-cead-44f2-9f75-39aadb1d024a.c.cdh-director-194318.internal,54321)
  ------------------------

  Open H2O Flow in browser: http://10.128.0.5:54321 (CMD + click in Mac OSX)

scala> import h2oContext._
import h2oContext._
scala> 

Read Parquet File from H2O Flow UI
Open H2O Flow UI and read the same parquet file.

After click Parse these files, got corrupted file.

Obviously, parquet file was not read correctly. At this moment, there are no error messages in the H2O console. If continue to import the file, the H2O Flow UI throw the following error

The H2O console would show the following error:

scala> 18/02/08 09:23:59 WARN servlet.ServletHandler: Error for /3/Parse
java.lang.NoClassDefFoundError: org/apache/parquet/hadoop/api/ReadSupport
	at water.parser.parquet.ParquetParser.correctTypeConversions(ParquetParser.java:104)
	at water.parser.parquet.ParquetParserProvider.createParserSetup(ParquetParserProvider.java:48)
	at water.parser.ParseSetup.getFinalSetup(ParseSetup.java:213)
	at water.parser.ParseDataset.forkParseDataset(ParseDataset.java:119)
	at water.parser.ParseDataset.parse(ParseDataset.java:43)
	at water.api.ParseHandler.parse(ParseHandler.java:36)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at water.api.Handler.handle(Handler.java:63)
	at water.api.RequestServer.serve(RequestServer.java:451)
	at water.api.RequestServer.doGeneric(RequestServer.java:296)
	at water.api.RequestServer.doPost(RequestServer.java:222)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
	at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
	at ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)
	at ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
	at ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
	at water.JettyHTTPD$LoginHandler.handle(JettyHTTPD.java:192)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
	at ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
	at ai.h2o.org.eclipse.jetty.server.Server.handle(Server.java:370)
	at ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
	at ai.h2o.org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
	at ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
	at ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
	at ai.h2o.org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
	at ai.h2o.org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
	at ai.h2o.org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
	at ai.h2o.org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
	at ai.h2o.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
	at ai.h2o.org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.parquet.hadoop.api.ReadSupport
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 39 more

Solution
As BDA has no issue in the same Sparkling Water H2O deployment and BDA used CDH 5.10, I initially focused more on CDH version difference. I built three CDH clusters using three different CDH versions: 5.13, 5.12 and 5.10. All of them show the exact same error. This made me rule out the possibility from CDH version difference and shifted focus on the environment difference, especially class path and jar files. Tried setting JAVA_HOME, SPARK_HOME, SPARK_DIST_CLASSPATH and unfortunately none of them worked.

I noticed /etc/spark2/conf.cloudera.spark2_on_yarn/classpath.txt seem have much less entries than classpath.txt under spark 1.6. Tried adding back the missing entries. Still no luck.

Added two more parameters to get more information about H2O log.

--conf spark.ext.h2o.node.log.level=INFO \
--conf spark.ext.h2o.client.log.level=INFO \

It gave a little more useful information. It complained about class ParquetFileWriter not found.

$ cat h2o_10.54.225.9_54000-5-error.log
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR: DistributedException from /192.168.10.54:54005: 'org/apache/parquet/hadoop/ParquetFileWriter', caused by java.lang.NoClassDefFoundError: org/apache/parquet/hadoop/ParquetFileWriter
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.MRTask.getResult(MRTask.java:478)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.MRTask.getResult(MRTask.java:486)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.MRTask.doAll(MRTask.java:402)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.parser.ParseSetup.guessSetup(ParseSetup.java:283)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at water.api.ParseSetupHandler.guessSetup(ParseSetupHandler.java:40)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
01-17 04:55:47.406 10.54.225.9:54000     18567  #6115-112 ERRR:         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

My client found a temporary solution by using h2odriver.jar following the instruction from Using H2O on Hadoop. The command used is shown below:

cd /opt/h2o-3
hadoop jar h2odriver.jar -nodes 70 -mapperXmx 40g -output hdfs://PROD01ns/user/mytestuser/log24

Although this solution provides similar functionalities in Sparkling Water, it has some critical performance issues:
1. The above command would create 70 nodes H2O cluster. If using Sparkling Water, it would be evenly distribute to all available nodes. But the above h2odriver.jarapproach would heavily use a few hadoop nodes. For big dataset, majority of activities happened only to 3~4 nodes, which made those nodes’ cpu utilization close to 100%. For one test big dataset, it has never completed the parsing file. It failed after 20 minutes run.
2. Unlike Sparkling Water, it actually read files during the parsing phase, not in the importing phase.
3. The performance is pretty bad compared with Sparkling Water. I guess Sparkling Water is using underlined Spark to distribute the load evenly.

Anyway this hadoop jar h2odriver.jar solution is not an ideal workaround for this issue.

Then I happened to read this blog: Incorrect results with corrupt binary statistics with the new Parquet reader. This article has nothing to do my issue, but it mentioned about parquet v1.8. I did remember seeing one note from one Sparkling Water developer discussing should integrate with parquet v1.8 in the future for certain parquet issue in H2O. Unfortunately I could not find the link to this discussion any more. But it inspired me to think that maybe the issue is that Sparkling Water depends certain parquet library and the current environment don’t have it. The standard CDH distribution and Spark2 seem using parquet v1.5. Oracle BDA has many more software installed and maybe it happened to have the correct library installed somewhere. It seems H2O related jar file may contain this library, what’s happened if I include the H2O jar somewhere in Sparkling Water.

With this idea in mind, I download H2O from http://h2o-release.s3.amazonaws.com/h2o/rel-wheeler/4/index.html. Unzip the file and h2o.jar file is the one I need. I then modified sparkling-shell and change the last line of code as follows by add h2o.jar file to jars parameter.

#spark-shell --jars "$FAT_JAR_FILE" --driver-memory "$DRIVER_MEMORY" --conf spark.driver.extraJavaOptions="$EXTRA_DRIVER_PROPS" "$@"
H2O_JAR_FILE="/home/mytestuser/install/h2o-3.16.0.4/h2o.jar"
spark-shell --jars "$FAT_JAR_FILE,$H2O_JAR_FILE" --driver-memory "$DRIVER_MEMORY" --conf spark.driver.extraJavaOptions="$EXTRA_DRIVER_PROPS" "$@"

Restart my H2O cluster. It worked!

Finally after many days work, Sparkling Water can work again in the new cluster. Reloading the big testing dataset, it took less than 1 minute to load the same dataset with only 24 H2O nodes. The load was also evenly distributed to the cluster. Problem solved!

Advertisements

Use Python for H2O

I wrote serveral blogs related to H2O topic as follows:
H2O vs Sparkling Water
Sparking Water Shell: Cloud size under 12 Exception
Access Sparkling Water via R Studio
Running H2O Cluster in Background and at Specific Port Number
Weird Ref-count mismatch Message from H2O

While both H2O Flow UI and R Studio can access H2O cluster started by Sparkling Water Shell, both lack the powerful functionality to manipulate data like Python does. In this blog, I am going to discuss how to use Python for H2O.

There are two main topics related to python for H2O. The first is to start a H2O cluster and another one is to access an existing H2O cluster using python.

1. Start H2O Cluster using pysparkling
I discussed start a H2O cluster using sparkling-shell in blog Sparking Water Shell: Cloud size under 12 Exception. For python, need to use pysparkling.

/opt/sparkling-water-2.2.2/bin/pysparkling \
--master yarn \
--conf spark.ext.h2o.cloud.name=WeidongH2O-Cluster \
--conf spark.ext.h2o.client.port.base=26000 \
--conf spark.executor.instances=6 \
--conf spark.executor.memory=12g \
--conf spark.driver.memory=8g \
--conf spark.yarn.executor.memoryOverhead=4g \
--conf spark.yarn.driver.memoryOverhead=4g \
--conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
--conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--conf spark.locality.wait=30000 \
--conf spark.yarn.queue=HighPool \
--conf spark.scheduler.minRegisteredResourcesRatio=1

Then run the following python commands.

from pysparkling import *
from pyspark import SparkContext
from pyspark.sql import SQLContext
import h2o
hc = H2OContext.getOrCreate(sc)

The following is the sample output.

[wzhou@enkbda1node05 sparkling-water-2.2.2]$ /opt/sparkling-water-2.2.2/bin/pysparkling \
> --master yarn \
> --conf spark.ext.h2o.cloud.name=WeidongH2O-Cluster \
> --conf spark.ext.h2o.client.port.base=26000 \
> --conf spark.executor.instances=6 \
> --conf spark.executor.memory=12g \
> --conf spark.driver.memory=8g \
> --conf spark.yarn.executor.memoryOverhead=4g \
> --conf spark.yarn.driver.memoryOverhead=4g \
> --conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
> --conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
> --conf spark.dynamicAllocation.enabled=false \
> --conf spark.sql.autoBroadcastJoinThreshold=-1 \
> --conf spark.locality.wait=30000 \
> --conf spark.yarn.queue=HighPool \
> --conf spark.scheduler.minRegisteredResourcesRatio=1
Python 2.7.13 |Anaconda 4.4.0 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/anaconda2/lib/python2.7/site-packages/pyspark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/12/10 15:00:20 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/12/10 15:00:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/12/10 15:00:21 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
17/12/10 15:00:24 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
SparkSession available as 'spark'.
>>> from pysparkling import *
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> import h2o
>>> hc = H2OContext.getOrCreate(sc)
/opt/sparkling-water-2.2.2/py/build/dist/h2o_pysparkling_2.2-2.2.2.zip/pysparkling/context.py:111: UserWarning: Method H2OContext.getOrCreate with argument of type SparkContext is deprecated and parameter of type SparkSession is preferred.
17/12/10 15:01:45 WARN h2o.H2OContext: Method H2OContext.getOrCreate with an argument of type SparkContext is deprecated and parameter of type SparkSession is preferred.
Connecting to H2O server at http://192.168.10.41:26000... successful.
--------------------------  -------------------------------
H2O cluster uptime:         12 secs
H2O cluster version:        3.14.0.7
H2O cluster version age:    1 month and 21 days
H2O cluster name:           WeidongH2O-Cluster
H2O cluster total nodes:    6
H2O cluster free memory:    63.35 Gb
H2O cluster total cores:    192
H2O cluster allowed cores:  192
H2O cluster status:         accepting new members, healthy
H2O connection url:         http://192.168.10.41:26000
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         Algos, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  -------------------------------

Sparkling Water Context:
 * H2O name: WeidongH2O-Cluster
 * cluster size: 6
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (3,enkbda1node12.enkitec.com,26000)
  (1,enkbda1node13.enkitec.com,26000)
  (6,enkbda1node10.enkitec.com,26000)
  (5,enkbda1node09.enkitec.com,26000)
  (4,enkbda1node08.enkitec.com,26000)
  (2,enkbda1node11.enkitec.com,26000)
  ------------------------

  Open H2O Flow in browser: http://192.168.10.41:26000 (CMD + click in Mac OSX)


>>>
>>> h2o
<module 'h2o' from '/opt/sparkling-water-2.2.2/py/build/dist/h2o_pysparkling_2.2-2.2.2.zip/h2o/__init__.py'>
>>> h2o.cluster_status()
[WARNING] in <stdin> line 1:
    >>> ????
        ^^^^ Deprecated, use ``h2o.cluster().show_status(True)``.
--------------------------  -------------------------------
H2O cluster uptime:         14 secs
H2O cluster version:        3.14.0.7
H2O cluster version age:    1 month and 21 days
H2O cluster name:           WeidongH2O-Cluster
H2O cluster total nodes:    6
H2O cluster free memory:    63.35 Gb
H2O cluster total cores:    192
H2O cluster allowed cores:  192
H2O cluster status:         locked, healthy
H2O connection url:         http://192.168.10.41:26000
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         Algos, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  -------------------------------
Nodes info:     Node 1                                         Node 2                                         Node 3                                         Node 4                                         Node 5                                         Node 6
--------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------
h2o             enkbda1node08.enkitec.com/192.168.10.44:26000  enkbda1node09.enkitec.com/192.168.10.45:26000  enkbda1node10.enkitec.com/192.168.10.46:26000  enkbda1node11.enkitec.com/192.168.10.47:26000  enkbda1node12.enkitec.com/192.168.10.48:26000  enkbda1node13.enkitec.com/192.168.10.49:26000
healthy         True                                           True                                           True                                           True                                           True                                           True
last_ping       1513108921427                                  1513108920821                                  1513108920602                                  1513108920876                                  1513108920718                                  1513108921149
num_cpus        32                                             32                                             32                                             32                                             32                                             32
sys_load        1.24                                           1.38                                           0.42                                           1.08                                           0.75                                           0.83
mem_value_size  0                                              0                                              0                                              0                                              0                                              0
free_mem        11339923456                                    11337239552                                    11339133952                                    11332368384                                    11331912704                                    11338491904
pojo_mem        113671168                                      116355072                                      114460672                                      121226240                                      121681920                                      115102720
swap_mem        0                                              0                                              0                                              0                                              0                                              0
free_disk       422764871680                                   389836439552                                   422211223552                                   418693251072                                   422237437952                                   422187106304
max_disk        491885953024                                   491885953024                                   491885953024                                   491885953024                                   491885953024                                   491885953024
pid             1172                                           801                                            15879                                          17866                                          28980                                          30818
num_keys        0                                              0                                              0                                              0                                              0                                              0
tcps_active     0                                              0                                              0                                              0                                              0                                              0
open_fds        441                                            441                                            441                                            441                                            441                                            441
rpcs_active     0                                              0                                              0                                              0                                              0                                              0

2. Accessing Existing H2O Cluster from a Edge Node
To access an existing H2O cluster from a edge node also requires to use pysparkling command. But you don’t have to specify a lot of parameters just like above step to start a H2O cluster. Running without any parameters is fine for the client purpose.

First start the pysparkling by running /opt/sparkling-water-2.2.2/bin/pysparkling command.

Once in the python prompt, run the following to connect to an existing cluster. The key is run h2o.init(ip=”enkbda1node05.enkitec.com”, port=26000). This command will connect to the existing H2O cluster.

from pysparkling import *
from pyspark import SparkContext
from pyspark.sql import SQLContext
import h2o

h2o.init(ip="enkbda1node05.enkitec.com", port=26000)
h2o.cluster_status()

Run some testing code there.

print("Importing hdfs data")
stock_data = h2o.import_file("hdfs://ENKBDA1-ns/user/wzhou/work/test/stock_price.txt")
stock_data

print("Spliting data")
train,test = stock_data.split_frame(ratios=[0.9])

h2o.ls()

The following is the sample output.

[wzhou@ham-lnx-vs-0086 ~]$ /opt/sparkling-water-2.2.2/bin/pysparkling
Python 2.7.13 |Anaconda 4.4.0 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/

Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
SparkSession available as 'spark'.
>>> from pysparkling import *
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> import h2o
>>>
>>> h2o.init(ip="enkbda1node05.enkitec.com", port=26000)
Warning: connecting to remote server but falling back to local... Did you mean to use `h2o.connect()`?
Checking whether there is an H2O instance running at http://enkbda1node05.enkitec.com:26000. connected.
--------------------------  ---------------------------------------
H2O cluster uptime:         5 mins 43 secs
H2O cluster version:        3.14.0.7
H2O cluster version age:    1 month and 21 days
H2O cluster name:           WeidongH2O-Cluster
H2O cluster total nodes:    6
H2O cluster free memory:    58.10 Gb
H2O cluster total cores:    192
H2O cluster allowed cores:  192
H2O cluster status:         locked, healthy
H2O connection url:         http://enkbda1node05.enkitec.com:26000
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         Algos, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  ---------------------------------------
>>> h2o.cluster_status()
[WARNING] in <stdin> line 1:
    >>> ????
        ^^^^ Deprecated, use ``h2o.cluster().show_status(True)``.
--------------------------  ---------------------------------------
H2O cluster uptime:         5 mins 43 secs
H2O cluster version:        3.14.0.7
H2O cluster version age:    1 month and 21 days
H2O cluster name:           WeidongH2O-Cluster
H2O cluster total nodes:    6
H2O cluster free memory:    58.10 Gb
H2O cluster total cores:    192
H2O cluster allowed cores:  192
H2O cluster status:         locked, healthy
H2O connection url:         http://enkbda1node05.enkitec.com:26000
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         Algos, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  ---------------------------------------
Nodes info:     Node 1                                         Node 2                                         Node 3                                         Node 4                                         Node 5                                         Node 6
--------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------  ---------------------------------------------
h2o             enkbda1node08.enkitec.com/192.168.10.44:26000  enkbda1node09.enkitec.com/192.168.10.45:26000  enkbda1node10.enkitec.com/192.168.10.46:26000  enkbda1node11.enkitec.com/192.168.10.47:26000  enkbda1node12.enkitec.com/192.168.10.48:26000  enkbda1node13.enkitec.com/192.168.10.49:26000
healthy         True                                           True                                           True                                           True                                           True                                           True
last_ping       1513109250525                                  1513109250725                                  1513109250400                                  1513109250218                                  1513109250709                                  1513109250536
num_cpus        32                                             32                                             32                                             32                                             32                                             32
sys_load        1.55                                           1.94                                           2.4                                            1.73                                           0.4                                            1.51
mem_value_size  0                                              0                                              0                                              0                                              0                                              0
free_mem        11339923456                                    10122031104                                    10076566528                                    10283636736                                    10280018944                                    10278896640
pojo_mem        113671168                                      1331563520                                     1377028096                                     1169957888                                     1173575680                                     1174697984
swap_mem        0                                              0                                              0                                              0                                              0                                              0
free_disk       422754385920                                   389839585280                                   422231146496                                   418692202496                                   422226952192                                   422176620544
max_disk        491885953024                                   491885953024                                   491885953024                                   491885953024                                   491885953024                                   491885953024
pid             1172                                           801                                            15879                                          17866                                          28980                                          30818
num_keys        0                                              0                                              0                                              0                                              0                                              0
tcps_active     0                                              0                                              0                                              0                                              0                                              0
open_fds        440                                            440                                            440                                            440                                            440                                            440
rpcs_active     0                                              0                                              0                                              0                                              0                                              0
>>> print("Importing hdfs data")
Importing hdfs data
>>> stock_data = h2o.import_file("hdfs://ENKBDA1-ns/user/wzhou/work/test/stock_price.txt")
Parse progress: | 100%

>>>
>>> stock_data
date                   close    volume    open    high     low
-------------------  -------  --------  ------  ------  ------
2016-09-23 00:00:00    24.05     56837   24.13  24.22   23.88
2016-09-22 00:00:00    24.1      56675   23.49  24.18   23.49
2016-09-21 00:00:00    23.38     70925   23.21  23.58   23.025
2016-09-20 00:00:00    23.07     35429   23.17  23.264  22.98
2016-09-19 00:00:00    23.12     34257   23.22  23.27   22.96
2016-09-16 00:00:00    23.16     83309   22.96  23.21   22.96
2016-09-15 00:00:00    23.01     43258   22.7   23.25   22.53
2016-09-14 00:00:00    22.69     33891   22.81  22.88   22.66
2016-09-13 00:00:00    22.81     59871   22.75  22.89   22.53
2016-09-12 00:00:00    22.85    109145   22.9   22.95   22.74

[65 rows x 6 columns]

>>> print("Spliting data")
Spliting data
>>> train,test = stock_data.split_frame(ratios=[0.9])
>>> h2o.ls()
                      key
0  py_1_sid_83d3_splitter
1         stock_price.hex

Let’s go to R Studio. We should see this H2O frame.

Let’s check it out from H2O Flow UI.

Cool, it’s there as well and we’re good here.

Weird Ref-count mismatch Message from H2O

H2O is a nice fast tools for data science work. I have discussed this topic in the following blogs:
H2O vs Sparkling Water
Sparking Water Shell: Cloud size under 12 Exception
Access Sparkling Water via R Studio
Running H2O Cluster in Background and at Specific Port Number
Recently we run into some weird issue when using H2O when using R Studio. For unknown reasons, it throws the following error messages:

score <- as.data.frame(main[,c('id', 'my_test_score')])

ERROR: Unexpected HTTP Status code: 500 Server Error (url = http://enkbda1node05.enkitec.com:26000/99/Rapids)

java.lang.IllegalStateException
 [1] "java.lang.IllegalStateException: Ref-count mismatch for vec $04ffc52a0000ffffffff$hdfs://ENKBDA1-ns/user/wzhou/test/data/test1/part-00000-8516f9f8-3d5f-164b-cf7e-18ca7e8d467f-d000.snappy.parquet: REFCNT = 2, should be 1"
 [2] "    water.rapids.Session.sanity_check_refs(Session.java:341)"                                                                                                                                                                                                     
 [3] "    water.rapids.Session.exec(Session.java:83)"                                                                                                                                                                                                                   
 [4] "    water.rapids.Rapids.exec(Rapids.java:93)"                                                                                                                                                                                                                     
 [5] "    water.api.RapidsHandler.exec(RapidsHandler.java:38)"                                                                                                                                                                                                          
 [6] "    sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)"                                                                                                                                                                                                 
 [7] "    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)"                                                                                                                                                                        
 [8] "    java.lang.reflect.Method.invoke(Method.java:498)"                                                                                                                                                                                                             
 [9] "    water.api.Handler.handle(Handler.java:63)"                                                                                                                                                                                                                    
[10] "    water.api.RequestServer.serve(RequestServer.java:448)"                                                                                                                                                                                                        
[11] "    water.api.RequestServer.doGeneric(RequestServer.java:297)"                                                                                                                                                                                                    
[12] "    water.api.RequestServer.doPost(RequestServer.java:223)"                                                                                                                                                                                                       
[13] "    javax.servlet.http.HttpServlet.service(HttpServlet.java:707)"                                                                                                                                                                                                 
[14] "    javax.servlet.http.HttpServlet.service(HttpServlet.java:790)"                                                                                                                                                                                                 
[15] "    ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)"                                                                                                                                                                                
[16] "    ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)"                                                                                                                                                                            
[17] "    ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)"                                                                                                                                                                    
[18] "    ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)"                                                                                                                                                                             
[19] "    ai.h2o.org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)"                                                                                                                                                                     
[20] "    ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)"                                                                                                                                                                         
[21] "    ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"                                                                                                                                                                 
[22] "    ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"                                                                                                                                                                       
[23] "    water.JettyHTTPD$LoginHandler.handle(JettyHTTPD.java:189)"                                                                                                                                                                                                    
[24] "    ai.h2o.org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"                                                                                                                                                                 
[25] "    ai.h2o.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"                                                                                                                                                                       
[26] "    ai.h2o.org.eclipse.jetty.server.Server.handle(Server.java:370)"                                                                                                                                                                                               
[27] "    ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)"                                                                                                                                                        
[28] "    ai.h2o.org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)"                                                                                                                                                         
[29] "    ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)"                                                                                                                                                              
[30] "    ai.h2o.org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)"                                                                                                                                              
[31] "    ai.h2o.org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)"                                                                                                                                                                                      
[32] "    ai.h2o.org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)"                                                                                                                                                                                 
[33] "    ai.h2o.org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)"                                                                                                                                                                
[34] "    ai.h2o.org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)"                                                                                                                                                          
[35] "    ai.h2o.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)"                                                                                                                                                                      
[36] "    ai.h2o.org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)"                                                                                                                                                                       
[37] "    java.lang.Thread.run(Thread.java:745)"                                                                                                                                                                                                                        

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  : 
  
ERROR MESSAGE:

Ref-count mismatch for vec $04ffc52a0000ffffffff$hdfs://ENKBDA1-ns/user/wzhou/test/data/test1/part-00000-8516f9f8-3d5f-164b-cf7e-18ca7e8d467f-d000.snappy.parquet: REFCNT = 2, should be 1

Tried different ways and still got the same error. Had to bounce the H2O cluster and get rid of this error. Then further investigation found out the steps we can reproduce this issue. If we delete some or all H2O frames from H2O UI, we could run into this issue.

Basically, just run getFrames command, then select the frames want to be deleted, click Delete selected frames at the bottom. Run getFrames again. It should show those frames deleted. But if I go to R Studio, run h2o.ls(), it will show the exact Ref-count mismatch error. It seems like the frames were deleted from H2O UI perspective, but not from R Studio’s perspective.

Ok, we found out the cause. How to resolve it? Check out H2O source code at https://github.com/h2oai/h2o-3/blob/master/h2o-core/src/main/java/water/rapids/Session.java.

we could see that this error is thrown during sanity_check_refs function. The function also has the following interesting description

Check that ref counts are in a consistent state.
* This should only be called between calls to Rapids expressions (otherwise may blow false-positives).

It seems there is something not handled well in other part of the code. If inconsistent in the object references, some other part of the code might fail. It looks bugs to me.

After some investigation, found out h2o.removeAll() running from R Studio should fix this issue.

Running H2O Cluster in Background and at Specific Port Number

In a few of my previous blog, I discussed H2O vs Sparkling Water, Sparking Water Shell: Cloud size under 12 Exception, and Access Sparkling Water via R Studio.

When using H2O Sparkling Water, there are two common issues. The first one is that the default port number is 54321. However, if the H2O cluster has been bounced multiple times, the assigned port could be assigned to a different port number, actually next available port number, 54323. If the cluster is used by many data analysts, it is inconvenient to inform all of them every time the port number is changed. You want users remember only one port number.

The second issue is that the sparkling shell session can not be running in the background. If close the putty session running the sparkling shell, the H2O cluster is terminated.

This blog discusses the solution to work around the above two issues.

For port number, there are actually two parameters related: spark.ext.h2o.client.port.base and spark.ext.h2o.node.port.base. The spark.ext.h2o.client.port.base is the port number for H2O UI while spark.ext.h2o.node.port.base is the port used by H2O cluster internally for the communication among H2O nodes. Make sure these two port numbers should be different. Having these two are the same will cause issue. I also add spark.ext.h2o.cloud.name for the name of my H2O cluster.

I created two separate scripts: run_sparkling_shell.sh for the running command for sparkling shell and sparkling-shell-init.scala for starting up commands for H2O cluster in scala.

[wzhou@enkbda1node05 ~]$ cat run_sparkling_shell.sh
bin/sparkling-shell \
--master yarn \
--conf spark.ext.h2o.cloud.name=WeidongH2O-Cluster \
--conf spark.ext.h2o.client.port.base=26000 \
--conf spark.ext.h2o.node.port.base=26005 \
--conf spark.executor.instances=10 \
--conf spark.executor.memory=12g \
--conf spark.driver.memory=8g \
--conf spark.executor.cores=4 \
--conf spark.yarn.executor.memoryOverhead=4g \
--conf spark.yarn.driver.memoryOverhead=4g \
--conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
--conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--conf spark.locality.wait=30000 \
--conf spark.yarn.queue=WZTestPool \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
-i  sparkling-shell-init.scala

The code for sparkling-shell-init.scala.

import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(spark)
import h2oContext._

To execute sparkling-shell in background, my first try was to use nohup. It didn’t work. When calling sparkling-shell-init.scala script, it automatically adds :quit command at the end and terminate H2O cluster.

When I did work on Exadata, I used to use screen command a lot. It is a very useful tool for protecting long running critical job execution, like patching/upgrade and import/export. Therefore, I use the same trick in screen to help me to get around the background issue. Here are the steps.

1. Start screen session
Use screen -ls command to check whether I have screen available.

[wzhou@enkbda1node05 ~]$ screen -ls
No Sockets found in /var/run/screen/S-wzhou.

Start a screen session.
[wzhou@enkbda1node05 ~]$ screen

2. Start H2O Cluster
Run the script to start H2O cluster.

[wzhou@enkbda1node05 ~]$ ./run_sparkling_shell.sh

-----
  Spark master (MASTER)     : yarn
  Spark home   (SPARK_HOME) :
  H2O build version         : 3.14.0.7 (weierstrass)
  Spark build version       : 2.2.0
  Scala version             : 2.11
----

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/anaconda2/lib/python2.7/site-packages/pyspark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/12/09 10:12:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/12/09 10:12:58 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
17/12/09 10:12:59 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://192.168.10.14:4040
Spark context available as 'sc' (master = yarn, app id = application_2567590118914_1007).
Spark session available as 'spark'.
Loading sparkling-shell-init.scala...
import org.apache.spark.h2o._
h2oContext: org.apache.spark.h2o.H2OContext =

Sparkling Water Context:
 * H2O name: WeidongH2O-Cluster
 * cluster size: 10
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (2,enkbda1node08.enkitec.com,26005)
  (4,enkbda1node12.enkitec.com,26005)
  (9,enkbda1node17.enkitec.com,26005)
  (5,enkbda1node13.enkitec.com,26005)
  (7,enkbda1node15.enkitec.com,26005)
  (1,enkbda1node09.enkitec.com,26005)
  (8,enkbda1node04.enkitec.com,26005)
  (6,enkbda1node10.enkitec.com,26005)
  (3,enkbda1node11.enkitec.com,26005)
  (10,enkbda1node05.enkitec.com,26005)
  ------------------------

  Open H2O Flow in browser: http://192.168.10.14:26000 (CMD + click in Mac OSX)


import h2oContext._

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
scala&gt;

3. Verify H2O Cluster
Right now, close the session. Open a new session. Check out the existing screen session.

[wzhou@enkbda1node05 ~]$ screen -ls
There is a screen on:
        19044.pts-0.enkbda1node05       (Detached)
1 Socket in /var/run/screen/S-wzhou.

Now, attach to the existing session
[wzhou@enkbda1node05 ~]$ screen -x 19044

You should see the original session that was running sparking shell. The UI is still working as expected.

Move Microsoft Office 2011 to macOS High Sierra

The latest version of Mac OS is called macOS High Sierra. There are some nice changes in the new OS. Instead of using HFS+ file system, a new file system, Apple File System (APFS), was introduced. Important enhancements in the APFS include faster speed in file copy, built-in encryption, crash-safe protections. Safari in this version not only becomes much faster (Apple claims it is the fastest desktop browser, I am not sure about this), but also has a feature to stop the most annoying feature on the web: autoplay videos. The new Safari also has a feature to allow you to view sites in Reader mode. Interesting, I didn’t know I can view site in writer mode before! There are also some improvements in video and icloud experience, which I don’t really care.

There was an important note about Office 2011 caught my attention. Microsoft announced that the company is no longer to support Office 2011 on macOS High Sierra and users should switch to Office 2016. I understand the company may want to get more revenue from selling Office 2016 to existing customers. While many other programs are still supported and running on macOS High Sierra, it looks bad if Office 2011 is not working in High Sierra, especially Word and Excel are some of the most popular applications people used daily. If this is the case, It would hesitate to upgrade my Macs to the new version: High Sierra.

I have been there before. I used to have Parallel software for my VMs on Mac many years ago and it worked great. Forgot from which version of new Mac OS release, Parallel announced the old version Parallel software would not work on the new MacOS and existing customers must pay extra fee to upgrade to new version if want to continue to use Parallel on the new Mac OS. If I remember correctly, Parallel has done the same trick almost every time a new Mac OS is released since then. Luckily, I switched to VirtualBox when Parallel did the first announcement and completely get rid of Parallel since then. Although I missed some nice features from Parallel, VirtualBox meets all of my requirements for my VM and it is free.

Similarly, if Microsoft Office 2011 is not working at all on macOS High Sierra, I will switch to Apple’s Pages and Numbers, or Google Docs. Anyway, my use of Word and Excel at home is very basic, just some simple Word and Excel documents. I don’t see paying annual subscription to new Office 2016 makes sense to me. Anyway my company laptop will have Office 2016 subscription. But if I can keep using my paid Office 2011 on my home Mac, it would be perfect.

I checked out Microsoft Office Support page, it has the following statement:

Office for Mac 2011

Word, Excel, PowerPoint, Outlook and Lync have not been tested on macOS 10.13 High Sierra, and no formal support for this configuration will be provided.

All applications in the Office for Mac 2011 suite* are reaching end of support on October 10th, 2017. As a reminder, after that date there will be no new security updates, non-security updates, free or paid assisted support options or technical content updates. Refer to the Microsoft Support Lifecycle for more information.

Interestingly, it sounds like that users of Office 2011 will be on their own if they continue to use Office 2011 on High Sierra. But the wording does not say the Office 2011 is not going to run on High Sierra. So if running into issue in Word and Excel, don’t call Microsoft support. This is understandable. Anyway, all applications in the Office for Mac 2011 suite are reached the end of support on Oct. 10th, 2017.

Ok, this is better. I will see whether Microsoft Office 2011 works or not in the new macOs High Sierra. I happen to have a Mac at home that indeed needs a complete refresh. After many years running and installation, many software were installed on the computer and the Mac was terribly slow. I want to do a complete new fresh installation long time ago. One of major concerns I had in the past is that I would still bring a lot of garbage back if I do a time-machine restore during the OS upgrade. But if I do a complete wipe out, I have to reinstall Office 2011 and I could not find my license key anymore. Although there are some articles about moving Office 2011 license files for application migration to new Mac, I was always skeptical and worried what happened if it didn’t work. I was stuck in this situation for several years. This time, I have to a good opportunity to do it. I must wipe out my Mac for a fresh installation and have to test the steps to move Office license files. If it fails, I will not use Office on this Mac.

After reading some articles about moving Office 2011 to a new Mac and did some preparation work, I finally made it successfully. Actually, I tested two scenarios by accident. When I did the upgrade, I didn’t wipe out my hard drive on my Mac because I was worried the OS Installation would not complete if I wiped out the hard drive. Actually I was wrong. There was another small partition containing OS boot and installer. Without wiping out my default hard drive, after installation, I found out I just did an upgrade to macOS High Sierra and all of my garbage stuff were still there. No wonder it was so slow during the installation. But at least I verified Microsoft Office 2011 was still working under scenario of upgrading to High Sierra. Then I tried the correct scenario, wiping out my default hard disk first, then do the OS installation. It was much faster and successful. On top of that, I did the office 2011 migration successfully. The followings are the steps:

Step 1. Backup Office 2011
Of course, I need to do a full time-machine backup before doing anything else. In addition to the backup, I also copied the followings to an external drive.

The following directories are mainly for backup purpose, not really used during the migration step. It’s good to backup these Office 2011 directories.

/Library/Application\ Support/Microsoft/
/Applications/Microsoft\ Office\ 2011/

Then backup the following three license related files.

/Library/PrivilegedHelperTools/com.microsoft.office.licensing.helper
/Library/LaunchDaemons/com.microsoft.office.licensing.helper.plist
/Library/Preferences/com.microsoft.office.licensing.plist

Step 2. Wipe out Default Hard Drive
If you just do the High Sierra upgrade, do not do this step. This step is going to wipe out everything on the default partition.
After reboot, immediately press Option + Command + R keys at the same time. When OSX Utilities screen shows up, uses Disk Utility to wipe out the default partition first.

Step 3. Install macOS High Sierra
Then click Reinstall OS X. Go through the regular mac OS installation screens to complete the High Sierra installation.

Step 4. Migrate Office 2011
Install Office 2011 software and it will show the screen to license information at the end.

Unfortunately, Quit in Office 2011 was not working and I had to do the force kill from Activity Monitor.

Copy the three backed up file from external drive to the same path on the new system.

/Library/PrivilegedHelperTools/com.microsoft.office.licensing.helper
/Library/LaunchDaemons/com.microsoft.office.licensing.helper.plist
/Library/Preferences/com.microsoft.office.licensing.plist

Start Word or Excel. The Office 2011 works on macOS High Sierra. I am sure there are certain features might not fully functional on the new OS. But basic functionality seems working and that’s enough for my usage.