What does "Stage Skipped" mean in Apache Spark web UI?

Apache SparkRdd

Apache Spark Problem Overview


From my Spark UI. What does it mean by skipped?

enter image description here

Apache Spark Solutions


Solution 1 - Apache Spark

Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling (reduceByKey). Whenever there is shuffling involved Spark automatically caches generated data:

> Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed.

Solution 2 - Apache Spark

Suppose you have a initial data frame with some data. Now you perform couple of transformations on top of it and perform multiple actions on the final data frame. If you had cache a data frame then it would materialize it when you call an action and keep it in memory in materialize form. So when an next action gets called it would go through the whole DAG and in doing that it will see that the data frame was cached so it will skip those stages by utilizing the already ready state that it has in materialized form in the memory.

When it skip the stage then you will see it as skipped in the spark UI and it speeds up your operation as it does not have to calculate the dag from the root and can start its operation after the cache data frame.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAravind YarramView Question on Stackoverflow
Solution 1 - Apache Sparkzero323View Answer on Stackoverflow
Solution 2 - Apache SparkNikunj KakadiyaView Answer on Stackoverflow