get datatype of column using pyspark

Apache SparkPysparkApache Spark-Sql

Apache Spark Problem Overview


We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ).

I am trying to get a datatype using pyspark.

My problem is some columns have different datatype.

Assume quantity and weight are the columns

quantity           weight
---------          --------
12300              656
123566000000       789.6767
1238               56.22
345                23
345566677777789    21

Actually we didn't defined data type for any column of mongo collection.

When I query to the count from pyspark dataframe

dataframe.count()

I got exception like this

"Cannot cast STRING into a DoubleType (value: BsonString{value='200.0'})"

Apache Spark Solutions


Solution 1 - Apache Spark

Your question is broad, thus my answer will also be broad.

To get the data types of your DataFrame columns, you can use dtypes i.e :

>>> df.dtypes
[('age', 'int'), ('name', 'string')]

This means your column age is of type int and name is of type string.

Solution 2 - Apache Spark

For anyone else who came here looking for an answer to the exact question in the post title (i.e. the data type of a single column, not multiple columns), I have been unable to find a simple way to do so.

Luckily it's trivial to get the type using dtypes:

def get_dtype(df,colname):
    return [dtype for name, dtype in df.dtypes if name == colname][0]

get_dtype(my_df,'column_name')

(note that this will only return the first column's type if there are multiple columns with the same name)

Solution 3 - Apache Spark

import pandas as pd
pd.set_option('max_colwidth', -1) # to prevent truncating of columns in jupyter

def count_column_types(spark_df):
    """Count number of columns per type"""
    return pd.DataFrame(spark_df.dtypes).groupby(1, as_index=False)[0].agg({'count':'count', 'names': lambda x: " | ".join(set(x))}).rename(columns={1:"type"})

Example output in jupyter notebook for a spark dataframe with 4 columns:

count_column_types(my_spark_df)

enter image description here

Solution 4 - Apache Spark

I don't know how are you reading from mongodb, but if you are using the mongodb connector, the datatypes will be automatically converted to spark types. To get the spark sql types, just use schema atribute like this:

df.schema

Solution 5 - Apache Spark

Looks like your actual data and your metadata have different types. The actual data is of type string while the metadata is double.

As a solution I would recommend you to recreate the table with the correct datatypes.

Solution 6 - Apache Spark

df.dtypes to get a list of (colname, dtype) pairs, ex.

[('age', 'int'), ('name', 'string')]

df.schema to get a schema as StructType of StructField, ex.

StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))

df.printSchema() to get a tree view of the schema, ex.

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)

Solution 7 - Apache Spark

I am assuming you are looking to get the data type of the data you read.

input_data = [Read from Mongo DB operation]

You can use

type(input_data) 

to inspect the data type

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSreenuvasuluView Question on Stackoverflow
Solution 1 - Apache SparkeliasahView Answer on Stackoverflow
Solution 2 - Apache SparkropeladderView Answer on Stackoverflow
Solution 3 - Apache SparkgenchView Answer on Stackoverflow
Solution 4 - Apache SparkLuis A.G.View Answer on Stackoverflow
Solution 5 - Apache SparkHenrique FlorencioView Answer on Stackoverflow
Solution 6 - Apache SparkqwrView Answer on Stackoverflow
Solution 7 - Apache SparkganeiyView Answer on Stackoverflow