How to prevent Spark Executors from getting Lost when using YARN client mode?

Apache SparkHadoop Yarn

Apache Spark Problem Overview


I have one Spark job which runs fine locally with less data but when I schedule it on YARN to execute I keep on getting the following error and slowly all executors get removed from UI and my job fails

15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 8 on myhost1.com: remote Rpc client disassociated
15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 6 on myhost2.com: remote Rpc client disassociated

I use the following command to schedule Spark job in yarn-client mode

 ./spark-submit --class com.xyz.MySpark --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" --driver-java-options -XX:MaxPermSize=512m --driver-memory 3g --master yarn-client --executor-memory 2G --executor-cores 8 --num-executors 12  /home/myuser/myspark-1.0.jar

What is the problem here? I am new to Spark.

Apache Spark Solutions


Solution 1 - Apache Spark

I had a very similar problem. I had many executors being lost no matter how much memory we allocated to them.

The solution if you're using yarn was to set --conf spark.yarn.executor.memoryOverhead=600, alternatively if your cluster uses mesos you can try --conf spark.mesos.executor.memoryOverhead=600 instead.

In spark 2.3.1+ the configuration option is now --conf spark.yarn.executor.memoryOverhead=600

It seems like we were not leaving sufficient memory for YARN itself and containers were being killed because of it. After setting that we've had different out of memory errors, but not the same lost executor problem.

Solution 2 - Apache Spark

You can follow this AWS post to calculate memory overhead (and other spark configs to tune): best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr

Solution 3 - Apache Spark

When I had the same issue, deleting logs and free up more hdfs space worked.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionUmesh KView Question on Stackoverflow
Solution 1 - Apache SparkwhalebergView Answer on Stackoverflow
Solution 2 - Apache SparkpiritocleView Answer on Stackoverflow
Solution 3 - Apache SparkKarn_wayView Answer on Stackoverflow