How to get name of dataframe column in PySpark?

Apache SparkPysparkApache Spark-SqlColumnname

Apache Spark Problem Overview


In pandas, this can be done by column.name.

But how to do the same when it's a column of Spark dataframe?

E.g. the calling program has a Spark dataframe: spark_df

>>> spark_df.columns
['admit', 'gre', 'gpa', 'rank']

This program calls my function: my_function(spark_df['rank'])
In my_function, I need the name of the column, i.e. 'rank'.

If it was pandas dataframe, we could use this:

>>> pandas_df['rank'].name
'rank'

Apache Spark Solutions


Solution 1 - Apache Spark

You can get the names from the schema by doing

spark_df.schema.names

Printing the schema can be useful to visualize it as well

spark_df.printSchema()

Solution 2 - Apache Spark

The only way is to go an underlying level to the JVM.

df.col._jc.toString().encode('utf8')

This is also how it is converted to a str in the pyspark code itself.

From pyspark/sql/column.py:

def __repr__(self):
    return 'Column<%s>' % self._jc.toString().encode('utf8')

Solution 3 - Apache Spark

If you want the column names of your dataframe, you can use the pyspark.sql class. I'm not sure if the SDK supports explicitly indexing a DF by column name. I received this traceback:

>>> df.columns['High'] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not str

However, calling the columns method on your dataframe, which you have done, will return a list of column names:

df.columns will return ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

If you want the column datatypes, you can call the dtypes method:

df.dtypes will return [('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]

If you want a particular column, you'll need to access it by index:

df.columns[2] will return 'High'

Solution 4 - Apache Spark

Python

As @numeral correctly said, column._jc.toString() works fine in case of unaliased columns.

In case of aliased columns (i.e. column.alias("whatever") ) the alias can be extracted, even without the usage of regular expressions: str(column).split(" AS ")[1].split("`")[1] .

I don't know Scala syntax, but I'm sure It can be done the same.

Solution 5 - Apache Spark

I found the answer is very very simple...

// It is in Java, but it should be same in PySpark
Column col = ds.col("colName"); //the column object
String theNameOftheCol = col.toString();

The variable theNameOftheCol is "colName"

Solution 6 - Apache Spark

I hope these options may serve more like universal ones. Cases covered:

  • column not having an alias
  • column having an alias
  • column having several consecutive aliases
  • column names surrounded with backticks

No regex:

str(col).replace("`", "").split("'")[-2].split(" AS ")[-1])

Using regex:

import re
re.search(r"'.*?`?(\w+)`?'", str(col)).group(1)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionKaushik AcharyaView Question on Stackoverflow
Solution 1 - Apache SparkDavidView Answer on Stackoverflow
Solution 2 - Apache SparknumeralView Answer on Stackoverflow
Solution 3 - Apache SparkPatView Answer on Stackoverflow
Solution 4 - Apache SparkciurlaroView Answer on Stackoverflow
Solution 5 - Apache SparkYucci MelView Answer on Stackoverflow
Solution 6 - Apache SparkZygDView Answer on Stackoverflow