Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling (`reduceByKey`). Whenever there is shuffling involved Spark [automatically caches generated data](https://spark.apache.org/docs/1.5.0/programming-guide.html#performance-impact): 

&gt;  Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed. 

Suppose you have a initial data frame with some data. Now you perform couple of transformations on top of it and perform multiple actions on the final data frame. If you had cache a data frame then it would materialize it when you call an action and keep it in memory in materialize form. So when an next action gets called it would go through the whole DAG and in doing that it will see that the data frame was cached so it will skip those stages by utilizing the already ready state that it has in materialized form in the memory.

When it skip the stage then you will see it as skipped in the spark UI and it speeds up your operation as it does not have to calculate the dag from the root and can start its operation after the cache data frame.

My `Application` class looks like this:
    
    public class Test extends Application {
    
    	private static Logger logger = LogManager.getRootLogger();
        
    	@Override
    	public void start(Stage primaryStage) throws Exception {
    
    		String resourcePath = &quot;/resources/fxml/MainView.fxml&quot;;
    		URL location = getClass().getResource(resourcePath);
    		FXMLLoader fxmlLoader = new FXMLLoader(location);
    
    		Scene scene = new Scene(fxmlLoader.load(), 500, 500);
    
    		primaryStage.setScene(scene);
    		primaryStage.show();
    	}
        
    	public static void main(String[] args) {
    		launch(args);
    	}
    }
    
The `FXMLLoader` creates an instance of the corresponding controller (given in the `FXML` file via `fx:controller`) by invoking first the default constructor and then the `initialize` method:
    
    public class MainViewController {
    
    	public MainViewController() {
    		System.out.println(&quot;first&quot;);
    	}
    
    	@FXML
    	public void initialize() {
    		System.out.println(&quot;second&quot;);
    	}
    }

The output is:

    first
    second

So, why does the `initialize` method exist? What is the difference between using a constructor or the `initialize` method to initialize the controller required things?

Thanks for your suggestions!

JavaFX FXML controller - constructor vs initialize method

I am currently developing a NodeJS project and found out that there is no built in functionality to create JSDoc comments for functions/methods.

I am aware of the TypeScript definitions that exist but I couldn&#39;t really find anything to match what I need.

WebStorm, for example, has some pretty neat JSDoc functionalities. Can one somehow achieve a similar functionality?

Is there a way to generate JSDoc comments in Visual Studio Code

From my Spark UI. What does it mean by skipped?

[![enter image description here][1]][1]


  [1]: http://i.stack.imgur.com/cyvd1.png

What does &quot;Stage Skipped&quot; mean in Apache Spark web UI?

<p>From my Spark UI. What does it mean by skipped?</p>
<p><a href="http://i.stack.imgur.com/cyvd1.png" target="_blank" rel="noopener noreferrer"><img src="http://i.stack.imgur.com/cyvd1.png" alt="enter image description here"></a></p>


I tried to start spark 1.6.0 (spark-1.6.0-bin-hadoop2.4) on Mac OS Yosemite 10.10.5 using 

    &quot;./bin/spark-shell&quot;. 

It has the error below. I also tried to install different versions of Spark but all have the same error. This is the second time I&#39;m running Spark. My previous run works fine. 
    
    log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    Using Spark&#39;s repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
    To adjust logging level use sc.setLogLevel(&quot;INFO&quot;)
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  &#39;_/
       /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
          /_/
    
    Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79)
    Type in expressions to have them evaluated.
    Type :help for more information.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 WARN Utils: Service &#39;sparkDriver&#39; could not bind on port 0. Attempting port 1.
    16/01/04 13:49:40 ERROR SparkContext: Error initializing SparkContext.
    java.net.BindException: Can&#39;t assign requested address: Service &#39;sparkDriver&#39; failed after 16 retries!
    	at sun.nio.ch.Net.bind0(Native Method)
    	at sun.nio.ch.Net.bind(Net.java:444)
    	at sun.nio.ch.Net.bind(Net.java:436)
    	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
    	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:125)
    	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:485)
    	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1089)
    	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:430)
    	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:415)
    	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:903)
    	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:198)
    	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:348)
    	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
    	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
    	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    	at java.lang.Thread.run(Thread.java:745)
    java.net.BindException: Can&#39;t assign requested address: Service &#39;sparkDriver&#39; failed after 16 retries!
    	at sun.nio.ch.Net.bind0(Native Method)
    	at sun.nio.ch.Net.bind(Net.java:444)
    	at sun.nio.ch.Net.bind(Net.java:436)
    	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
    	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:125)
    	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:485)
    	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1089)
    	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:430)
    	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:415)
    	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:903)
    	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:198)
    	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:348)
    	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
    	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
    	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    	at java.lang.Thread.run(Thread.java:745)
    
    java.lang.NullPointerException
    	at org.apache.spark.sql.SQLContext$.createListenerAndUI(SQLContext.scala:1367)
    	at org.apache.spark.sql.hive.HiveContext.&lt;init&gt;(HiveContext.scala:101)
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    	at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
    	at $iwC$$iwC.&lt;init&gt;(&lt;console&gt;:15)
    	at $iwC.&lt;init&gt;(&lt;console&gt;:24)
    	at &lt;init&gt;(&lt;console&gt;:26)
    	at .&lt;init&gt;(&lt;console&gt;:30)
    	at .&lt;clinit&gt;(&lt;console&gt;)
    	at .&lt;init&gt;(&lt;console&gt;:7)
    	at .&lt;clinit&gt;(&lt;console&gt;)
    	at $print(&lt;console&gt;)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:606)
    	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
    	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
    	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
    	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
    	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
    	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
    	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
    	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
    	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
    	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
    	at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
    	at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)
    	at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
    	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
    	at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159)
    	at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
    	at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108)
    	at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
    	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
    	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
    	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
    	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
    	at org.apache.spark.repl.Main$.main(Main.scala:31)
    	at org.apache.spark.repl.Main.main(Main.scala)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:606)
    	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    
    &lt;console&gt;:16: error: not found: value sqlContext
             import sqlContext.implicits._
                    ^
    &lt;console&gt;:16: error: not found: value sqlContext
             import sqlContext.sql

Then I add 

    export SPARK_LOCAL_IP=&quot;127.0.0.1&quot;

to spark-env.sh, error changes to:

   

     ERROR : No route to host
        java.net.ConnectException: No route to host
        	at java.net.Inet6AddressImpl.isReachable0(Native Method)
        	at java.net.Inet6AddressImpl.isReachable(Inet6AddressImpl.java:77)
        	at java.net.InetAddress.isReachable(InetAddress.java:475)
    ...
    &lt;console&gt;:10: error: not found: value sqlContext
           import sqlContext.implicits._
                  ^
    &lt;console&gt;:10: error: not found: value sqlContext
           import sqlContext.sql

Mac spark-shell Error initializing SparkContext

I&#39;m new with apache spark and apparently I installed apache-spark with homebrew in my macbook:

    Last login: Fri Jan  8 12:52:04 on console
    user@MacBook-Pro-de-User-2:~$ pyspark
    Python 2.7.10 (default, Jul 13 2015, 12:05:58)
    [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
    Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.
    Using Spark&#39;s default log4j profile: org/apache/spark/log4j-defaults.properties
    16/01/08 14:46:44 INFO SparkContext: Running Spark version 1.5.1
    16/01/08 14:46:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    16/01/08 14:46:47 INFO SecurityManager: Changing view acls to: user
    16/01/08 14:46:47 INFO SecurityManager: Changing modify acls to: user
    16/01/08 14:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)
    16/01/08 14:46:50 INFO Slf4jLogger: Slf4jLogger started
    16/01/08 14:46:50 INFO Remoting: Starting remoting
    16/01/08 14:46:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.64:50199]
    16/01/08 14:46:51 INFO Utils: Successfully started service &#39;sparkDriver&#39; on port 50199.
    16/01/08 14:46:51 INFO SparkEnv: Registering MapOutputTracker
    16/01/08 14:46:51 INFO SparkEnv: Registering BlockManagerMaster
    16/01/08 14:46:51 INFO DiskBlockManager: Created local directory at /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/blockmgr-769e6f91-f0e7-49f9-b45d-1b6382637c95
    16/01/08 14:46:51 INFO MemoryStore: MemoryStore started with capacity 530.0 MB
    16/01/08 14:46:52 INFO HttpFileServer: HTTP File server directory is /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/spark-8e4749ea-9ae7-4137-a0e1-52e410a8e4c5/httpd-1adcd424-c8e9-4e54-a45a-a735ade00393
    16/01/08 14:46:52 INFO HttpServer: Starting HTTP Server
    16/01/08 14:46:52 INFO Utils: Successfully started service &#39;HTTP file server&#39; on port 50200.
    16/01/08 14:46:52 INFO SparkEnv: Registering OutputCommitCoordinator
    16/01/08 14:46:52 INFO Utils: Successfully started service &#39;SparkUI&#39; on port 4040.
    16/01/08 14:46:52 INFO SparkUI: Started SparkUI at http://192.168.1.64:4040
    16/01/08 14:46:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
    16/01/08 14:46:53 INFO Executor: Starting executor ID driver on host localhost
    16/01/08 14:46:53 INFO Utils: Successfully started service &#39;org.apache.spark.network.netty.NettyBlockTransferService&#39; on port 50201.
    16/01/08 14:46:53 INFO NettyBlockTransferService: Server created on 50201
    16/01/08 14:46:53 INFO BlockManagerMaster: Trying to register BlockManager
    16/01/08 14:46:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50201 with 530.0 MB RAM, BlockManagerId(driver, localhost, 50201)
    16/01/08 14:46:53 INFO BlockManagerMaster: Registered BlockManager
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  &#39;_/
       /__ / .__/\_,_/_/ /_/\_\   version 1.5.1
          /_/
    
    Using Python version 2.7.10 (default, Jul 13 2015 12:05:58)
    SparkContext available as sc, HiveContext available as sqlContext.
    &gt;&gt;&gt;

I would like start playing in order to learn more about MLlib. However, I use Pycharm to write scripts in python. The problem is: when I go to Pycharm and try to call pyspark, Pycharm can not found the module. I tried adding the path to Pycharm as follows:

[![cant link pycharm with spark][1]][1]

Then from a [blog][2] I tried this:

    import os
    import sys
    
    # Path for spark source folder
    os.environ[&#39;SPARK_HOME&#39;]=&quot;/Users/user/Apps/spark-1.5.2-bin-hadoop2.4&quot;
    
    # Append pyspark  to Python Path
    sys.path.append(&quot;/Users/user/Apps/spark-1.5.2-bin-hadoop2.4/python/pyspark&quot;)
    
    try:
        from pyspark import SparkContext
        from pyspark import SparkConf
        print (&quot;Successfully imported Spark Modules&quot;)
    
    except ImportError as e:
        print (&quot;Can not import Spark Modules&quot;, e)
        sys.exit(1)

And still can not start using PySpark with Pycharm, any idea of how to &quot;link&quot; PyCharm with apache-pyspark?.

**Update:**

Then I search for apache-spark and python path in order to set the environment variables of Pycharm:

apache-spark path:

    user@MacBook-Pro-User-2:~$ brew info apache-spark
    apache-spark: stable 1.6.0, HEAD
    Engine for large-scale data processing
    https://spark.apache.org/
    /usr/local/Cellar/apache-spark/1.5.1 (649 files, 302.9M) *
      Poured from bottle
    From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb


python path:

    user@MacBook-Pro-User-2:~$ brew info python
    python: stable 2.7.11 (bottled), HEAD
    Interpreted, interactive, object-oriented programming language
    https://www.python.org
    /usr/local/Cellar/python/2.7.10_2 (4,965 files, 66.9M) *

Then with the above information I tried to set the environment variables as follows:

[![configuration 1][3]][3]

**Any idea of how to correctly link Pycharm with pyspark?**

Then when I run a python script with the above configuration I have this exception:

    /usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/spark_examples/test_1.py
    Traceback (most recent call last):
      File &quot;/Users/user/PycharmProjects/spark_examples/test_1.py&quot;, line 1, in &lt;module&gt;
        from pyspark import SparkContext
    ImportError: No module named pyspark


**UPDATE:**
Then I tried this configurations proposed by @zero323 

Configuration 1:

    /usr/local/Cellar/apache-spark/1.5.1/ 

[![conf 1][4]][4]


out:


     user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1$ ls
    CHANGES.txt           NOTICE                libexec/
    INSTALL_RECEIPT.json  README.md
    LICENSE               bin/



Configuration 2:

    /usr/local/Cellar/apache-spark/1.5.1/libexec 

[![enter image description here][5]][5]

out:

    user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1/libexec$ ls
    R/        bin/      data/     examples/ python/
    RELEASE   conf/     ec2/      lib/      sbin/


  [1]: http://i.stack.imgur.com/SCMrY.png
  [2]: http://renien.github.io/blog/accessing-pyspark-pycharm/
  [3]: http://i.stack.imgur.com/TOsDo.png
  [4]: http://i.stack.imgur.com/i9dZu.png
  [5]: http://i.stack.imgur.com/Bq2YP.png




How to link PyCharm with PySpark?

I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition.  I would also like to use the Spark SQL partitionBy API.  So I could do that like this:

    df.coalesce(1)
        .write
        .partitionBy(&quot;entity&quot;, &quot;year&quot;, &quot;month&quot;, &quot;day&quot;, &quot;status&quot;)
        .mode(SaveMode.Append)
        .parquet(s&quot;$location&quot;)

I&#39;ve tested this and it doesn&#39;t seem to perform well.  This is because there is only one partition to work on in the dataset and all the partitioning, compression and saving of files has to be done by one CPU core.

I could rewrite this to do the partitioning manually (using filter with the distinct partition values for example) before calling coalesce.  

But is there a better way to do this using the standard Spark SQL API?

DataFrame partitionBy to a single Parquet file (per partition)

I have a spark data frame `df`. Is there a way of sub selecting a few columns using a list of these columns?

    scala&gt; df.columns
    res0: Array[String] = Array(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;)

I know I can do something like `df.select(&quot;b&quot;, &quot;c&quot;) `. But suppose I have  a list containing a few column names `val cols = List(&quot;b&quot;, &quot;c&quot;)`, is there a way to pass this to df.select? `df.select(cols)` throws an error. Something like `df.select(*cols)` as in python

Upacking a list to select multiple columns from a spark data frame

When I run the parsing code with 1 GB dataset it completes without any error. But, when I attempt 25 gb of data at a time I get below errors. I&#39;m trying to understand how can I avoid below failures. Happy to hear any suggestions or ideas.

Differnt errors,
							
	org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

	org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-xxxxxxxx

	org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer{file=/mnt/yarn/nm/usercache/xxxx/appcache/application_1450751731124_8446/blockmgr-8a7b17b8-f4c3-45e7-aea8-8b0a7481be55/08/shuffle_0_224_0.data, offset=12329181, length=2104094}

Cluster Details:
&gt;Yarn: 8 Nodes  
Total cores: 64  
Memory: 500 GB  
Spark Version: 1.5  

Spark submit statement:

	spark-submit --master yarn-cluster \
							--conf spark.dynamicAllocation.enabled=true \
							--conf spark.shuffle.service.enabled=true \
							--executor-memory 4g \
							--driver-memory 16g \
							--num-executors 50 \
							--deploy-mode cluster \
							--executor-cores 1 \
							--class my.parser \
							myparser.jar \
							-input xxx \
							-output xxxx \
	
One of stack trace:

	at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:460)
	at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:456)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
	at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:456)
	at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:183)
	at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:47)
	at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:88)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)




FetchFailedException or MetadataFetchFailedException when processing big data set

I am trying to leverage spark partitioning. I was trying to do something like

    data.write.partitionBy(&quot;key&quot;).parquet(&quot;/location&quot;)

The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from the root directory.

To avoid that I tried

    data.coalese(numPart).write.partitionBy(&quot;key&quot;).parquet(&quot;/location&quot;)

This however creates numPart number of parquet files in each partition.
Now my partition size is different. SO I would ideally like to have separate coalesce per partition. This is however doesn&#39;t look like an easy thing. I need to visit all the partition coalesce to a certain number and store at a separate location.

How should I use partitioning to avoid many files after write?


Content Type	Original Author	Original Content on Stackoverflow
Question	Aravind Yarram	View Question on Stackoverflow
Solution 1 - Apache Spark	zero323	View Answer on Stackoverflow
Solution 2 - Apache Spark	Nikunj Kakadiya	View Answer on Stackoverflow

What does "Stage Skipped" mean in Apache Spark web UI?

Apache Spark Problem Overview

Apache Spark Solutions

Solution 1 - Apache Spark

Solution 2 - Apache Spark

Is there a way to generate JSDoc comments in Visual Studio Code

JavaFX FXML controller - constructor vs initialize method

Attributions