What do the numbers on the progress bar mean in spark-shell?

Apache Spark

Apache Spark Problem Overview


In my spark-shell, what do entries like the below mean when I execute a function ?

[Stage7:===========>                              (14174 + 5) / 62500]

Apache Spark Solutions


Solution 1 - Apache Spark

What you get is a Console Progress Bar, [Stage 7: shows the stage you are in now, and (14174 + 5) / 62500] is (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage]. The progress bar shows numCompletedTasks / totalNumOfTasksInThisStage.

It will be shown when both spark.ui.showConsoleProgress is true (by default) and log level in conf/log4j.properties is ERROR or WARN (!log.isInfoEnabled is true).

Let's see the code in ConsoleProgressBar.scala that shows it out:

private def show(now: Long, stages: Seq[SparkStageInfo]) {
  val width = TerminalWidth / stages.size
  val bar = stages.map { s =>
    val total = s.numTasks()
    val header = s"[Stage ${s.stageId()}:"
    val tailer = s"(${s.numCompletedTasks()} + ${s.numActiveTasks()}) / $total]"
    val w = width - header.length - tailer.length
    val bar = if (w > 0) {
      val percent = w * s.numCompletedTasks() / total
      (0 until w).map { i =>
        if (i < percent) "=" else if (i == percent) ">" else " "
      }.mkString("")
    } else {
    ""
    }
    header + bar + tailer
  }.mkString("")

  // only refresh if it's changed of after 1 minute (or the ssh connection will be closed
  // after idle some time)
  if (bar != lastProgressBar || now - lastUpdateTime > 60 * 1000L) {
    System.err.print(CR + bar)
    lastUpdateTime = now
  }
  lastProgressBar = bar
}

Solution 2 - Apache Spark

Let's assume you see the following (X,A,B,C are always non negative integers):

[Stage X:==========>            (A + B) / C]

(for example in the question X=7, A=14174, B=5 and C=62500)

Here is what is going on at a high level: Spark breaks the work in stages and tasks in each stage. This progress indicator means that Stage X is comprised of C tasks. During the execution, A and B start at zero and keep changing. A is always the number of tasks already finished and B is the number of tasks currently executing. For a stage with many tasks (way more than the workers you have) you should expect to see B grow to a number that corresponds to how many workers you have in the cluster, then you should start seeing A increase as tasks complete. Towards the end, as the last few tasks execute, B will start decreasing until it reaches 0, at which point A should equal C, the stage is done, and spark moves to the next stage. C will stay constant during the whole time, remember it is the total number of tasks in the stage and never changes.

The ====> shows the percentage of work done based on what I described above. At the beginning the > will be towards the left and will be moving to the right as tasks are completed.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionrmckeownView Question on Stackoverflow
Solution 1 - Apache SparkyjshenView Answer on Stackoverflow
Solution 2 - Apache Sparkgae123View Answer on Stackoverflow