Recently I was approached by one of my clients to help them to investigate a weird Sparklyr issue. sparklyr is an interface between R and Spark introduced by RStudio about a years ago. The following is the the sparklyr architecture.
When trying to do sc <- spark_connect in RStudio, we got two errors as follows:
Here is the detail message.
> library(sparklyr) > library(dplyr) > sc <- spark_connect(master = "yarn-client", config=spark_config(), version="1.6.0", spark_home = '/opt/cloudera/parcels/CDH/lib/spark/') Error in force(code) : Failed while connecting to sparklyr to port (8880) for sessionid (3859): Gateway in port (8880) did not respond. Path: /opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/lib/spark/bin/spark-submit Parameters: --class, sparklyr.Shell, --jars, '/usr/lib64/R/library/sparklyr/java/spark-csv_2.11-1.3.0.jar','/usr/lib64/R/library/sparklyr/java/commons-csv-1.1.jar','/usr/lib64/R/library/sparklyr/java/univocity-parsers-1.5.1.jar', '/usr/lib64/R/library/sparklyr/java/sparklyr-1.6-2.10.jar', 8880, 3859 Log: /tmp/RtmzpSIMln/file9e23246605df7_spark.log .... Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream at org.apache.spark.deploy.SparkSubmitArguments.handle(SparkSubmitArguments.scala:394) at org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:163) at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:97) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 5 more
Did some research and found many people having the similar issue. Ok, try their recommendations one by one as follows.
Try run Sys.setEnv(SPARK_HOME = “/opt/cloudera/parcels/CDH/lib/spark/”). No, not working.
My client installed sparklyr less than one month ago. I don’t see why this option makes sense. Don’t even pursue this path.
The R on the same server uses the same version of Java without any issue. I don’t see why Java installation become a major concern here. Ignore this one.
Someone said just Spark installation is not enough, not to have Hadoop Installation as well. Clearly it does not fit our situation. The server is an edge node and has hadoop installation.
Running system2(‘klist’) does show no kerberos ticket. Ok, I then open up a shell within RStudio Server by clicking tools -> shell, then issuing the kinit command.
Rerun system2(‘klist’) shows I have a valid kerberos ticket. Try again. still not working.
Note: even it is not working, this step is necessary for further action when the issue is fixed. So still need to run this one no matter what the result is.
Someone recommended to create a new configure and pass it in. It looks like a good idea. Unfortunately, just doesn’t work.
wzconfig <- spark_config() wzconfig$`sparklyr.shell.deploy-mode` <- "client" wzconfig$spark.driver.cores <- 1 wzconfig$spark.executor.cores <- 2 wzconfig$spark.executor.memory <- "4G" sc <- spark_connect(master = "yarn-client", config=wzconfig, version="1.6.0", spark_home = '/opt/cloudera/parcels/CDH/lib/spark/')
Actually this recommendation is missing another key parameter. By default the total number of executors launched is 2. I would usually bump up this number a little to get a better performance. You can use the following way to set up the
total number of executors.
wzconfig$spark.executor.instances <- 3
Although this approach looks promising, still not working. But this approach is definitely a way to use for other purpose to better control the Spark resource usage.
Someone mentioned to set remote address. I thought this could another potential option as I resolved issues in Spark related to local IP issue in the past. So I add the following code in the configuration from the previous example, note parameter sparklyr.gateway.address is the hostname of active Resource Manager.
wzconfig$sparklyr.gateway.remote <- TRUE wzconfig$sparklyr.gateway.address <- "cdhcluster01n03.mycompany.com"
Not working for this case.
This is probably the most unrealistic one. If connect as with master = “yarn-cluster”, the spark driver will be somewhere inside the Spark cluster. For our current case, I don’t believe this is the right solution. Don’t even try it.
Someone recommended to run a spark-submit to verify SparkPi can be run from the environment. This looks reasonable. The good thing I figured out the issue before executing this one. But this definitely a valid and good test to verify spark-submit.
/opt/cloudera/parcels/SPARK2/lib/spark2/bin/spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode client --master yarn /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples_2.11-2.1.0.jar 10
There is an interesting post Add support for `yarn-cluster` with high availability #905 discussing about the issue might relate to multiple resource managers. We use HA and this post is an interesting one. But might not fit into our case because I feel we have not reached to the HA part yet with Class Not Found message.
Verified it and we have it. So this is not the issue.
After reviewing or trying out some of above solutions, I like to go back my way of thinking. I must say I am not an expert in R or RStudio with very limited knowledge about how it works. But I did have extensive background in Spark tuning and trouble shooting.
I know the error message Gateway in port (8880) did not respond is always the first message shows up and looks like the cause of the issue. But I thought differently. I believe the 2nd error NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream looks more suspicious than the first one. Early this year I helped one of another clients on a weird Spark job issue, which is in the end, was caused by the incorrect path. It seems to me the path might not be right and cause Spark issue, then caused the first error of port not respond.
With this idea in mind, I focused more the path verification. Run the command Sys.getenv() to get the environment as follows.
> Sys.getenv() DISPLAY :0 EDITOR vi GIT_ASKPASS rpostback-askpass HADOOP_CONF_DIR /etc/hadoop/conf.cloudera.hdfs HADOOP_HOME /opt/cloudera/parcels/CDH HOME /home/wzhou JAVA_HOME /usr/java/jdk1.8.0_144/jre LANG en_US.UTF-8 LD_LIBRARY_PATH /usr/lib64/R/lib::/lib:/usr/java/jdk1.8.0_92/jre/lib/amd64/server LN_S ln -s LOGNAME wzhou MAKE make PAGER /usr/bin/less PATH /usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin R_BROWSER /usr/bin/xdg-open R_BZIPCMD /usr/bin/bzip2 R_DOC_DIR /usr/share/doc/R-3.3.2 R_GZIPCMD /usr/bin/gzip R_HOME /usr/lib64/R R_INCLUDE_DIR /usr/include/R R_LIBS_SITE /usr/local/lib/R/site-library:/usr/local/lib/R/library:/usr/lib64/R/library:/usr/share/R/library R_LIBS_USER ~/R/x86_64-redhat-linux-gnu-library/3.3 R_PAPERSIZE a4 R_PDFVIEWER /usr/bin/xdg-open R_PLATFORM x86_64-redhat-linux-gnu R_PRINTCMD lpr R_RD4PDF times,hyper R_SESSION_TMPDIR /tmp/RtmpZf9YMN R_SHARE_DIR /usr/share/R R_SYSTEM_ABI linux,gcc,gxx,gfortran,? R_TEXI2DVICMD /usr/bin/texi2dvi R_UNZIPCMD /usr/bin/unzip R_ZIPCMD RMARKDOWN_MATHJAX_PATH /usr/lib/rstudio-server/resources/mathjax-26 RS_RPOSTBACK_PATH /usr/lib/rstudio-server/bin/rpostback RSTUDIO 1 RSTUDIO_HTTP_REFERER http://hadoop-edge06.mycompany.com:8787/ RSTUDIO_PANDOC /usr/lib/rstudio-server/bin/pandoc RSTUDIO_SESSION_STREAM wzhou-d RSTUDIO_USER_IDENTITY wzhou RSTUDIO_WINUTILS bin/winutils SED /bin/sed SPARK_HOME /opt/cloudera/parcels/SPARK2/lib/spark2 SSH_ASKPASS rpostback-askpass TAR /bin/gtar USER wzhou YARN_CONF_DIR /etc/hadoop/conf.cloudera.yarn
Ahhh, I noticed the environment missed SPARK_DIST_CLASSPATH environment variable. Then I set it using the command below just before sc <- spark_connect.
Sys.setenv(SPARK_DIST_CLASSPATH = '/etc/hadoop/con:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop/.//*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-hdfs/./:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-yarn/.//*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/.//*')
Ok, try it again. Fantastic, it works!
Ok, here is the real cause of the issue. It’s unnecessary to specify java path for sparklyr as it does not require a java path. However, it does have dependency on spark-submit. When spark-submit is executed, it can read java path and then submit the jar files to Spark accordingly. The cause of the issue if SPARK_DIST_CLASSPATH is not set, spark-submit is not working and Spark executors can not be launched.
The following are some of useful commands:
spark_home_dir() or spark_home
config <- spark_config()
Also there are a few useful articles about sparklyr and Rstudio:
RStudio’s R Interface to Spark on Amazon EMR
How to Install RStudio Server on CentOS 7
Using R with Apache Spark
sparklyr: a test drive on YARN
Analyzing a billion NYC taxi trips in Spark