You can do it the same way [`SQLContext.createDataFrame`][1] does it:

    import org.apache.spark.sql.catalyst.ScalaReflection
    val schema = ScalaReflection.schemaFor[TestCase].dataType.asInstanceOf[StructType]


  [1]: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L349

I know this question is almost a year old but I came across it and thought others who do also might want to know that I have just learned to use this approach:

&lt;!-- language: scala --&gt;

    import org.apache.spark.sql.Encoders
    val mySchema = Encoders.product[MyCaseClass].schema



in case someone wants to do this for a custom Java bean:

    ExpressionEncoder.javaBean(Event.class).schema().json()

Instead of manually reproducing the logic for creating the implicit `Encoder` object that gets passed to `toDF`, one can use that directly (or, more precisely, implicitly in the same way as `toDF`):

&lt;!-- language-all: lang-scala --&gt;

```
// spark: SparkSession

import spark.implicits._

implicitly[Encoder[MyCaseClass]].schema
```

Unfortunately, this actually suffers from the same problem as using `org.apache.spark.sql.catalyst` or `Encoders` as in the other answers: [the `Encoder` trait][4] is experimental.

**How does this work?** The `toDF` method on `Seq` comes from a `DatasetHolder`, which is created via the implicit [`localSeqToDatasetHolder `][1] that is imported via `spark.implicits._`. That function is defined like:

```
implicit def localSeqToDatasetHolder[T](s: Seq[T])(implicit arg0: Encoder[T]): DatasetHolder[T]
```

As you can see, it takes an `implicit` `Encoder[T]` argument, which, for a `case class`, can be computed via [`newProductEncoder`][2] (also imported via `spark.implicits._`). We can reproduce this implicit logic to get an `Encoder` for our case class, via the convenience [`scala.Predef.implicitly`][3] (in scope by default, because it&#39;s from `Predef`) that will just returns its requested implicit argument:

```
def implicitly[T](implicit e: T): T
```

[1]: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits@localSeqToDatasetHolder[T](s:Seq[T])(implicitevidence$7:org.apache.spark.sql.Encoder[T]):org.apache.spark.sql.DatasetHolder[T]
[2]: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits@newProductEncoder[T&lt;:Product](implicitevidence$8:reflect.runtime.universe.TypeTag[T]):org.apache.spark.sql.Encoder[T]
[3]: https://www.scala-lang.org/api/2.11.12/index.html#scala.Predef$@implicitly[T](implicite:T):T
[4]: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Encoder

Our React Native Redux app uses JWT tokens for authentication. There are many actions that require such tokens and a lot of them are dispatched simultaneously e.g. when app loads.

E.g.

    componentDidMount() {
        dispath(loadProfile());
        dispatch(loadAssets());
        ...
    }

Both `loadProfile` and `loadAssets` require JWT. We save the token in the state and `AsyncStorage`. My question is how to handle token expiration.

Originally I was going to use middleware for handling token expiration


    // jwt-middleware.js

    export function refreshJWTToken({ dispatch, getState }) {

      return (next) =&gt; (action) =&gt; {
        if (isExpired(getState().auth.token)) {
          return dispatch(refreshToken())
              .then(() =&gt; next(action))
              .catch(e =&gt; console.log(&#39;error refreshing token&#39;, e));
        }
        return next(action);
    };
}


The problem that I ran into was that refreshing of the token will happen for both `loadProfile` and `loadAssets` actions because at the time when they are dispatch the token will be expired. Ideally I would like to &quot;pause&quot; actions that require authentication until the token is refreshed. Is there a way to do that with middleware?





How to use Redux to refresh JWT token?

Let&#39;s say there are two arrays...

    var array1 = [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;]
    var array2 = [&quot;b&quot;, &quot;c&quot;, &quot;a&quot;]

I&#39;d like the result of the comparison of these two arrays to be true, and the following...

    var array1 = [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;]
    var array2 = [&quot;b&quot;, &quot;c&quot;, &quot;a&quot;, &quot;d&quot;]

...to be false. How can I achieve that in Swift? I tried to convert both arrays to sets but for some reason Set() keeps removing some (usually duplicated) objects that the array contains.

Any help would be appreciated.

How do I check in Swift if two arrays contain the same elements regardless of the order in which those elements appear in?

If I wanted to create a `StructType` (i.e. a `DataFrame.schema`) out of a `case class`, is there a way to do it without creating a `DataFrame`? I can easily do:

    case class TestCase(id: Long)
    val schema = Seq[TestCase]().toDF.schema

But it seems overkill to actually create a `DataFrame` when all I want is the schema.

(If you are curious, the reason behind the question is that I am defining a `UserDefinedAggregateFunction`, and to do so you override a couple of methods that return `StructTypes` and I use case classes.)

Generate a Spark StructType / Schema from a case class

If I wanted to create a <code>StructType</code> (i.e. a <code>DataFrame.schema</code>) out of a <code>case class</code>, is there a way to do it without creating a <code>DataFrame</code>? I can easily do:
<pre><code class="hljs language-perl">case class TestCase(id: Long)
val schema = Seq[TestCase]().toDF.schema
</code></pre>
But it seems overkill to actually create a <code>DataFrame</code> when all I want is the schema.
(If you are curious, the reason behind the question is that I am defining a <code>UserDefinedAggregateFunction</code>, and to do so you override a couple of methods that return <code>StructTypes</code> and I use case classes.)

I have a PySpark dataframe

    +-------+--------------+----+----+
    |address|          date|name|food|
    +-------+--------------+----+----+
    |1111111|20151122045510| Yin|gre |
    |1111111|20151122045501| Yin|gre |
    |1111111|20151122045500| Yln|gra |
    |1111112|20151122065832| Yun|ddd |
    |1111113|20160101003221| Yan|fdf |
    |1111111|20160703045231| Yin|gre |
    |1111114|20150419134543| Yin|fdf |
    |1111115|20151123174302| Yen|ddd |
    |2111115|      20123192| Yen|gre |
    +-------+--------------+----+----+

that I want to transform to use with pyspark.ml. I can use a StringIndexer to convert the name column to a numeric category:

    indexer = StringIndexer(inputCol=&quot;name&quot;, outputCol=&quot;name_index&quot;).fit(df)
    df_ind = indexer.transform(df)
    df_ind.show()
    +-------+--------------+----+----------+----+
    |address|          date|name|name_index|food|
    +-------+--------------+----+----------+----+
    |1111111|20151122045510| Yin|       0.0|gre |
    |1111111|20151122045501| Yin|       0.0|gre |
    |1111111|20151122045500| Yln|       2.0|gra |
    |1111112|20151122065832| Yun|       4.0|ddd |
    |1111113|20160101003221| Yan|       3.0|fdf |
    |1111111|20160703045231| Yin|       0.0|gre |
    |1111114|20150419134543| Yin|       0.0|fdf |
    |1111115|20151123174302| Yen|       1.0|ddd |
    |2111115|      20123192| Yen|       1.0|gre |
    +-------+--------------+----+----------+----+


How can I transform several columns with StringIndexer (for example, `name` and `food`, each with its own `StringIndexer`) and then use [VectorAssembler](https://stackoverflow.com/questions/32606294/create-feature-vector-programmatically-in-spark-ml-pyspark) to generate a feature vector? Or do I have to create a `StringIndexer` for each column?

** EDIT **: This is not a dupe because I need to to this programatically for several data frames with different column names. I can&#39;t use `VectorIndexer` or `VectorAssembler` because the columns are not numerical.

** EDIT 2**: A tentative solution is

    indexers = [StringIndexer(inputCol=column, outputCol=column+&quot;_index&quot;).fit(df).transform(df) for column in df.columns ]

where I create a list now with three dataframes, each identical to the original plus the transformed column. Now I need to join then to form the final dataframe, but that&#39;s very inefficient.

Apply StringIndexer to several columns in a PySpark Dataframe

I&#39;d like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What&#39;s the quickest way to do this? 

In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:

    id     address
    1       2 foo lane
    2       10 bar lane
    3       24 pants ln

Would become

    id     address
    1       2 foo ln
    2       10 bar ln
    3       24 pants ln


Pyspark replace strings in Spark dataframe column

True... it has been discussed quite a lot.

However, there is a lot of ambiguity and some of the answers provided ... including duplicating JAR references in the jars/executor/driver configuration or options.

### The ambiguous and/or omitted details

The following ambiguity, unclear, and/or omitted details should be clarified for each option:

- How ClassPath is affected
  - Driver
  - Executor (for tasks running)
  - Both
  - not at all
- Separation character: comma, colon, semicolon
- If provided files are automatically distributed
  - for the tasks (to each executor)
  - for the remote Driver (if ran in cluster mode)
- type of URI accepted: local file, [HDFS][1], HTTP, etc.
- If copied *into* a common location, where that location is (HDFS, local?)

### The options which it affects:

1. `--jars`
2. [`SparkContext.addJar(...)`][2] method
3. [`SparkContext.addFile(...)`][3] method
4. `--conf spark.driver.extraClassPath=...` or `--driver-class-path ...`
5. `--conf spark.driver.extraLibraryPath=...`, or `--driver-library-path ...`
6. `--conf spark.executor.extraClassPath=...`
7. `--conf spark.executor.extraLibraryPath=...`
8. not to forget, the last parameter of the spark-submit is also a .jar file.

I am aware where I can find the [main Apache Spark documentation][4], and specifically about [how to submit][5], the [options][6] available, and also the [JavaDoc][7]. However, that left for me still quite some holes, although it was answered partially too.

I hope that it is not all that complex, and that someone can give me a clear and concise answer.

If I were to guess from documentation, it seems that `--jars`, and the `SparkContext` `addJar` and `addFile` methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.

Would it be safe to assume that for simplicity, I can add additional application JAR files using the three main options at the same time?

```lang-none
spark-submit --jar additional1.jar,additional2.jar \
  --driver-library-path additional1.jar:additional2.jar \
  --conf spark.executor.extraLibraryPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar
```

I found a nice article on [an answer to another posting][8]. However, nothing new was learned. The poster does make a good remark on the difference between a *local driver* (yarn-client) and *remote driver* (yarn-cluster). It is definitely important to keep in mind.

  [1]: https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system
  [2]: http://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#addJar(java.lang.String)
  [3]: http://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#addFile(java.lang.String)
  [4]: http://spark.apache.org/docs/latest/quick-start.html
  [5]: http://spark.apache.org/docs/latest/submitting-applications.html
  [6]: http://spark.apache.org/docs/latest/configuration.html
  [7]: http://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html
  [8]: https://stackoverflow.com/a/34516023/744133






Add JAR files to a Spark job - spark-submit

I&#39;m trying to filter a PySpark dataframe that has `None` as a row value:

    df.select(&#39;dt_mvmt&#39;).distinct().collect()
    
    [Row(dt_mvmt=u&#39;2016-03-27&#39;),
     Row(dt_mvmt=u&#39;2016-03-28&#39;),
     Row(dt_mvmt=u&#39;2016-03-29&#39;),
     Row(dt_mvmt=None),
     Row(dt_mvmt=u&#39;2016-03-30&#39;),
     Row(dt_mvmt=u&#39;2016-03-31&#39;)]

and I can filter correctly with an string value:

    df[df.dt_mvmt == &#39;2016-03-31&#39;]
    # some results here

but this fails:

    df[df.dt_mvmt == None].count()
    0
    df[df.dt_mvmt != None].count()
    0

But there are definitely values on each category. What&#39;s going on?

Filter Pyspark dataframe column with None value

I&#39;m trying to concatenate two PySpark dataframes with some columns that are only on one of them:

    from pyspark.sql.functions import randn, rand

    df_1 = sqlContext.range(0, 10)

    +--+
    |id|
    +--+
    | 0|
    | 1|
    | 2|
    | 3|
    | 4|
    | 5|
    | 6|
    | 7|
    | 8|
    | 9|
    +--+

    df_2 = sqlContext.range(11, 20)

    +--+
    |id|
    +--+
    | 10|
    | 11|
    | 12|
    | 13|
    | 14|
    | 15|
    | 16|
    | 17|
    | 18|
    | 19|
    +--+

    df_1 = df_1.select(&quot;id&quot;, rand(seed=10).alias(&quot;uniform&quot;), randn(seed=27).alias(&quot;normal&quot;))
    df_2 = df_2.select(&quot;id&quot;, rand(seed=10).alias(&quot;uniform&quot;), randn(seed=27).alias(&quot;normal_2&quot;))

and now I want to generate a third dataframe. I would like something like pandas `concat`:

    df_1.show()
    +---+--------------------+--------------------+
    | id|             uniform|              normal|
    +---+--------------------+--------------------+
    |  0|  0.8122802274304282|  1.2423430583597714|
    |  1|  0.8642043127063618|  0.3900018344856156|
    |  2|  0.8292577771850476|  1.8077401259195247|
    |  3|   0.198558705368724| -0.4270585782850261|
    |  4|0.012661361966674889|   0.702634599720141|
    |  5|  0.8535692890157796|-0.42355804115129153|
    |  6|  0.3723296190171911|  1.3789648582622995|
    |  7|  0.9529794127670571| 0.16238718777444605|
    |  8|  0.9746632635918108| 0.02448061333761742|
    |  9|   0.513622008243935|  0.7626741803250845|
    +---+--------------------+--------------------+

    df_2.show()
    +---+--------------------+--------------------+
    | id|             uniform|            normal_2|
    +---+--------------------+--------------------+
    | 11|  0.3221262660507942|  1.0269298899109824|
    | 12|  0.4030672316912547|   1.285648175568798|
    | 13|  0.9690555459609131|-0.22986601831364423|
    | 14|0.011913836266515876|  -0.678915153834693|
    | 15|  0.9359607054250594|-0.16557488664743034|
    | 16| 0.45680471157575453| -0.3885563551710555|
    | 17|  0.6411908952297819|  0.9161177183227823|
    | 18|  0.5669232696934479|  0.7270125277020573|
    | 19|   0.513622008243935|  0.7626741803250845|
    +---+--------------------+--------------------+

    #do some concatenation here, how?

    df_concat.show()

    | id|             uniform|              normal| normal_2   |
    +---+--------------------+--------------------+------------+
    |  0|  0.8122802274304282|  1.2423430583597714| None       |
    |  1|  0.8642043127063618|  0.3900018344856156| None       |
    |  2|  0.8292577771850476|  1.8077401259195247| None       |
    |  3|   0.198558705368724| -0.4270585782850261| None       |
    |  4|0.012661361966674889|   0.702634599720141| None       |
    |  5|  0.8535692890157796|-0.42355804115129153| None       |
    |  6|  0.3723296190171911|  1.3789648582622995| None       |
    |  7|  0.9529794127670571| 0.16238718777444605| None       |
    |  8|  0.9746632635918108| 0.02448061333761742| None       |
    |  9|   0.513622008243935|  0.7626741803250845| None       |
    | 11|  0.3221262660507942|  None              | 0.123      |
    | 12|  0.4030672316912547|  None              |0.12323     |
    | 13|  0.9690555459609131|  None              |0.123       |
    | 14|0.011913836266515876|  None              |0.18923     |
    | 15|  0.9359607054250594|  None              |0.99123     |
    | 16| 0.45680471157575453|  None              |0.123       |
    | 17|  0.6411908952297819|  None              |1.123       |
    | 18|  0.5669232696934479|  None              |0.10023     |
    | 19|   0.513622008243935|  None              |0.916332123 |
    +---+--------------------+--------------------+------------+
    

Is that possible?

Concatenate two PySpark dataframes

All,

Is there an elegant and accepted way to flatten a Spark SQL table (Parquet) with columns that are of nested `StructType`

For example

If my schema is:

    foo
     |_bar
     |_baz
    x
    y
    z

How do I select it into a flattened tabular form without resorting to manually running 

    df.select(&quot;foo.bar&quot;,&quot;foo.baz&quot;,&quot;x&quot;,&quot;y&quot;,&quot;z&quot;)

In other words, how do I obtain the result of the above code programmatically given just a `StructType` and a `DataFrame`




Automatically and Elegantly flatten DataFrame in Spark SQL

For a set of dataframes

    val df1 = sc.parallelize(1 to 4).map(i =&gt; (i,i*10)).toDF(&quot;id&quot;,&quot;x&quot;)
    val df2 = sc.parallelize(1 to 4).map(i =&gt; (i,i*100)).toDF(&quot;id&quot;,&quot;y&quot;)
    val df3 = sc.parallelize(1 to 4).map(i =&gt; (i,i*1000)).toDF(&quot;id&quot;,&quot;z&quot;)

to union all of them I do

    df1.unionAll(df2).unionAll(df3)

Is there a more elegant and scalable way of doing this for any number of dataframes, for example from

    Seq(df1, df2, df3) 

Spark unionAll multiple dataframes

Consider I have a defined schema for loading 10 csv files in a folder. Is there a way to automatically load tables using Spark SQL. I know this can be performed by using an individual dataframe for each file [given below], but can it be automated with a single command rather than pointing a file can I point a folder?

    df = sqlContext.read
           .format(&quot;com.databricks.spark.csv&quot;)
           .option(&quot;header&quot;, &quot;true&quot;)
           .load(&quot;../Downloads/2008.csv&quot;)



Content Type	Original Author	Original Content on Stackoverflow
Question	David Griffin	View Question on Stackoverflow
Solution 1 - Apache Spark	Tzach Zohar	View Answer on Stackoverflow
Solution 2 - Apache Spark	Kurt	View Answer on Stackoverflow
Solution 3 - Apache Spark	Art	View Answer on Stackoverflow
Solution 4 - Apache Spark	huon	View Answer on Stackoverflow

Generate a Spark StructType / Schema from a case class

Apache Spark Problem Overview

Apache Spark Solutions

Solution 1 - Apache Spark

Solution 2 - Apache Spark

Solution 3 - Apache Spark

Solution 4 - Apache Spark

How do I check in Swift if two arrays contain the same elements regardless of the order in which those elements appear in?

How to use Redux to refresh JWT token?

Attributions