添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
玩滑板的小虾米  ·  How to: Sort and ...·  11 月前    · 
逆袭的创口贴  ·  Nasıl yapılır: Bir ...·  1 年前    · 
首发于 Spark

PySpark之DataFrame数据类型转换

PySpark中的数据类型有:ArrayType, BinaryType, BooleanType, CalendarIntervalType, DateType, HiveStringType, MapType, NullType, NumbericType, ObjectType, StringType, StructType, TimestampType


转换DataFrame列的数据类型方式有三种:

# 将StringType转化为Type(int)
from pyspark.sql.types import IntegerType, BooleanType, DateType, ArrayType
# Convert String to Integer Type
df.withColumn("age", df.age.cast(IntegerType()))
df.withColumn("age", df.age.cast("int"))
df.withColumn("age", df.age.cast("integer"))
# Using select
df.select(col("age").cast("int").alias("age"))
# Using selectExpr()
df.selectExpr("cast(age as int) age")
# Using with spark.sql()
spark.sql("SELECT INT(age), BOOLEAN(isGraduated), DATE(jobStarDate) from CastExample")

代码示例:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = [("James",34,"2006-01-01","true","M",3000.60),
    ("Michael",33,"1980-01-10","true","F",3300.80),
    ("Robert",37,"06-01-1992","false","M",5000.50)
columns = ["firstname","age","jobStartDate","isGraduated","gender","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
from pyspark.sql.functions import col
from pyspark.sql.types import StringType,BooleanType,DateType
df2 = df.withColumn("age",col("age").cast(StringType())) \
    .withColumn("isGraduated",col("isGraduated").cast(BooleanType())) \
    .withColumn("jobStartDate",col("jobStartDate").cast(DateType()))
df2.printSchema()
df3 = df2.selectExpr("cast(age as int) age",
    "cast(isGraduated as string) isGraduated",
    "cast(jobStartDate as string) jobStartDate")
df3.printSchema()
df3.show(truncate=False)
df3.createOrReplaceTempView("CastExample")
df4 = spark.sql("SELECT STRING(age),BOOLEAN(isGraduated),DATE(jobStartDate) from CastExample")
df4.printSchema()
df4.show(truncate=False)

列表形式的字符串转列表格式

有待解决:

# 如何把字符串转化为列表有待解决
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, BooleanType, DateType, ArrayType, StructType, StructField, StringType
spark = SparkSession.builder.appName('SparkByExamples.com').\
    getOrCreate()
data = [(("1", ['test', 'test2', 'test3'])), (("2", ['test4', 'test', 'test6'])), 
    (("3", ['test6', 'test9', 'test7']))]