Recently I was approached by one of my clients to help them to investigate a weird Sparklyr issue. sparklyr is an interface between R and Spark introduced by RStudio about a years ago. The following is the the sparklyr architecture.

When trying to do sc <- spark_connect in RStudio, we got two errors as follows:
Failed while connecting to sparklyr to port (8880) for sessionid (3859): Gateway in port (8880) did not respond.
Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
Here is the detail message.
> library(sparklyr)
> library(dplyr)
> sc <- spark_connect(master = "yarn-client", config=spark_config(), version="1.6.0", spark_home = '/opt/cloudera/parcels/CDH/lib/spark/')
Error in force(code) :
Failed while connecting to sparklyr to port (8880) for sessionid (3859): Gateway in port (8880) did not respond.
Path: /opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/lib/spark/bin/spark-submit
Parameters: --class, sparklyr.Shell, --jars, '/usr/lib64/R/library/sparklyr/java/spark-csv_2.11-1.3.0.jar','/usr/lib64/R/library/sparklyr/java/commons-csv-1.1.jar','/usr/lib64/R/library/sparklyr/java/univocity-parsers-1.5.1.jar', '/usr/lib64/R/library/sparklyr/java/sparklyr-1.6-2.10.jar', 8880, 3859
Log: /tmp/RtmzpSIMln/file9e23246605df7_spark.log
....
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments.handle(SparkSubmitArguments.scala:394)
at org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:163)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:97)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more
Did some research and found many people having the similar issue. Ok, try their recommendations one by one as follows.
Set SPARK_HOME environment
Try run Sys.setEnv(SPARK_HOME = “/opt/cloudera/parcels/CDH/lib/spark/”). No, not working.
Install latest version sparklyr
My client installed sparklyr less than one month ago. I don’t see why this option makes sense. Don’t even pursue this path.
Check Java Installation
The R on the same server uses the same version of Java without any issue. I don’t see why Java installation become a major concern here. Ignore this one.
No Hadoop Installation
Someone said just Spark installation is not enough, not to have Hadoop Installation as well. Clearly it does not fit our situation. The server is an edge node and has hadoop installation.
Do not have a valid kerberos ticket
Running system2(‘klist’) does show no kerberos ticket. Ok, I then open up a shell within RStudio Server by clicking tools -> shell, then issuing the kinit command.
Rerun system2(‘klist’) shows I have a valid kerberos ticket. Try again. still not working.
Note: even it is not working, this step is necessary for further action when the issue is fixed. So still need to run this one no matter what the result is.
Create a different configure and pass to spark_connect
Someone recommended to create a new configure and pass it in. It looks like a good idea. Unfortunately, just doesn’t work.
wzconfig <- spark_config()
wzconfig$`sparklyr.shell.deploy-mode` <- "client"
wzconfig$spark.driver.cores <- 1
wzconfig$spark.executor.cores <- 2
wzconfig$spark.executor.memory <- "4G"
sc <- spark_connect(master = "yarn-client", config=wzconfig, version="1.6.0", spark_home = '/opt/cloudera/parcels/CDH/lib/spark/')
Actually this recommendation is missing another key parameter. By default the total number of executors launched is 2. I would usually bump up this number a little to get a better performance. You can use the following way to set up the
total number of executors.
wzconfig$spark.executor.instances <- 3
Although this approach looks promising, still not working. But this approach is definitely a way to use for other purpose to better control the Spark resource usage.
Add remote address
Someone mentioned to set remote address. I thought this could another potential option as I resolved issues in Spark related to local IP issue in the past. So I add the following code in the configuration from the previous example, note parameter sparklyr.gateway.address is the hostname of active Resource Manager.
wzconfig$sparklyr.gateway.remote <- TRUE
wzconfig$sparklyr.gateway.address <- "cdhcluster01n03.mycompany.com"
Not working for this case.
Change deployment mode to yarn-cluster
This is probably the most unrealistic one. If connect as with master = “yarn-cluster”, the spark driver will be somewhere inside the Spark cluster. For our current case, I don’t believe this is the right solution. Don’t even try it.
Run Spark example
Someone recommended to run a spark-submit to verify SparkPi can be run from the environment. This looks reasonable. The good thing I figured out the issue before executing this one. But this definitely a valid and good test to verify spark-submit.
/opt/cloudera/parcels/SPARK2/lib/spark2/bin/spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode client --master yarn /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples_2.11-2.1.0.jar 10
HA for yarn-cluster
There is an interesting post Add support for `yarn-cluster` with high availability #905 discussing about the issue might relate to multiple resource managers. We use HA and this post is an interesting one. But might not fit into our case because I feel we have not reached to the HA part yet with Class Not Found message.
Need to set JAVA_HOME
Verified it and we have it. So this is not the issue.
My Solution
After reviewing or trying out some of above solutions, I like to go back my way of thinking. I must say I am not an expert in R or RStudio with very limited knowledge about how it works. But I did have extensive background in Spark tuning and trouble shooting.
I know the error message Gateway in port (8880) did not respond is always the first message shows up and looks like the cause of the issue. But I thought differently. I believe the 2nd error NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream looks more suspicious than the first one. Early this year I helped one of another clients on a weird Spark job issue, which is in the end, was caused by the incorrect path. It seems to me the path might not be right and cause Spark issue, then caused the first error of port not respond.
With this idea in mind, I focused more the path verification. Run the command Sys.getenv() to get the environment as follows.
> Sys.getenv()
DISPLAY :0
EDITOR vi
GIT_ASKPASS rpostback-askpass
HADOOP_CONF_DIR /etc/hadoop/conf.cloudera.hdfs
HADOOP_HOME /opt/cloudera/parcels/CDH
HOME /home/wzhou
JAVA_HOME /usr/java/jdk1.8.0_144/jre
LANG en_US.UTF-8
LD_LIBRARY_PATH /usr/lib64/R/lib::/lib:/usr/java/jdk1.8.0_92/jre/lib/amd64/server
LN_S ln -s
LOGNAME wzhou
MAKE make
PAGER /usr/bin/less
PATH /usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
R_BROWSER /usr/bin/xdg-open
R_BZIPCMD /usr/bin/bzip2
R_DOC_DIR /usr/share/doc/R-3.3.2
R_GZIPCMD /usr/bin/gzip
R_HOME /usr/lib64/R
R_INCLUDE_DIR /usr/include/R
R_LIBS_SITE /usr/local/lib/R/site-library:/usr/local/lib/R/library:/usr/lib64/R/library:/usr/share/R/library
R_LIBS_USER ~/R/x86_64-redhat-linux-gnu-library/3.3
R_PAPERSIZE a4
R_PDFVIEWER /usr/bin/xdg-open
R_PLATFORM x86_64-redhat-linux-gnu
R_PRINTCMD lpr
R_RD4PDF times,hyper
R_SESSION_TMPDIR /tmp/RtmpZf9YMN
R_SHARE_DIR /usr/share/R
R_SYSTEM_ABI linux,gcc,gxx,gfortran,?
R_TEXI2DVICMD /usr/bin/texi2dvi
R_UNZIPCMD /usr/bin/unzip
R_ZIPCMD
RMARKDOWN_MATHJAX_PATH
/usr/lib/rstudio-server/resources/mathjax-26
RS_RPOSTBACK_PATH /usr/lib/rstudio-server/bin/rpostback
RSTUDIO 1
RSTUDIO_HTTP_REFERER http://hadoop-edge06.mycompany.com:8787/
RSTUDIO_PANDOC /usr/lib/rstudio-server/bin/pandoc
RSTUDIO_SESSION_STREAM
wzhou-d
RSTUDIO_USER_IDENTITY wzhou
RSTUDIO_WINUTILS bin/winutils
SED /bin/sed
SPARK_HOME /opt/cloudera/parcels/SPARK2/lib/spark2
SSH_ASKPASS rpostback-askpass
TAR /bin/gtar
USER wzhou
YARN_CONF_DIR /etc/hadoop/conf.cloudera.yarn
Ahhh, I noticed the environment missed SPARK_DIST_CLASSPATH environment variable. Then I set it using the command below just before sc <- spark_connect.
Sys.setenv(SPARK_DIST_CLASSPATH = '/etc/hadoop/con:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop/.//*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-hdfs/./:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/libexec/../../hadoop-yarn/.//*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/.//*')
Ok, try it again. Fantastic, it works!
Cause
Ok, here is the real cause of the issue. It’s unnecessary to specify java path for sparklyr as it does not require a java path. However, it does have dependency on spark-submit. When spark-submit is executed, it can read java path and then submit the jar files to Spark accordingly. The cause of the issue if SPARK_DIST_CLASSPATH is not set, spark-submit is not working and Spark executors can not be launched.
Other Note
The following are some of useful commands:
ls()
spark_installed_versions()
sessionInfo()
spark_home_dir() or spark_home
path.expand(“~”)
Sys.getenv(“SPARK_HOME”)
spark_home_dir()
character(0)
config <- spark_config()
spark_install_dir()
sc
backend
monitor
output_file
spark_context
java_context
hive_context
master
method
app_name
config
config$sparklyr.cores.local
config$spark.sql.shuffle.partitions.local
config$spark.env.SPARK_LOCAL_IP.local
config$sparklyr.csv.embedded
config$`sparklyr.shell.driver-class-path`
Also there are a few useful articles about sparklyr and Rstudio:
RStudio’s R Interface to Spark on Amazon EMR
How to Install RStudio Server on CentOS 7
Using R with Apache Spark
sparklyr: a test drive on YARN
Analyzing a billion NYC taxi trips in Spark