添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

spark-sql处理json字符串的函数

整理了spark-sql处理json字符串的几个函数

  1. from_json
  2. schema_of_json
  3. explode


from_json

from_json(column, schema_string):用schema_string的格式,来解析column。 用schema_string的格式可以用schema_of_json获取。


例子:

select from_json('[{"text":"Tea"},{"text":"Apple"}]', 'ARRAY<STRUCT<text: STRING>>')[0]['text'] as q;
Tea


schema_of_json

schema_of_json(s):解析字符串s,返回描述json格式的字符串。


字典:STRUCT,字典里是一些Key-Value键值对。

列表:ARRAY,列表里是一些相同的元素。


例子:

select schema_of_json('[{"text":"a"},{"b":"a"}, {"c":1}]')
ARRAY<STRUCT<`b`: STRING, `c`: BIGINT, `text`: STRING>>


select schema_of_json('{"text":"a","b":1.2, "c":1}')
STRUCT<`b`: DOUBLE, `c`: BIGINT, `text`: STRING>


select schema_of_json('{"text":["a", 2, "b", "a"],"b":1.2, "c":1}')
STRUCT<`b`: DOUBLE, `c`: BIGINT, `text`: ARRAY<STRING>>


对于复杂的字符串,如果python json.loads报错,schema_of_json可能报错。



explode

explode(from_json(...)):将from_json的内容展成多行(应用于ARRAY,每个元素1行)


例子:

select explode(from_json('[{"text":"Tea"},{"text":"Apple"}]', 'ARRAY<STRUCT<text: STRING>>')) as q;
{"text":"Tea"}
{"text":"Apple"}



SELECT explode(r.json) FROM (SELECT from_json('[{"Attr_INT":1, "ATTR_DOUBLE":10.201, "ATTR_DATE": "2021-01-01"},{"Attr_INT":1, "ATTR_DOUBLE":10.201, "ATTR_DATE": "2021-02-01"}]','array<struct<ATTR_DATE:string,ATTR_DOUBLE:double,Attr_INT:bigint>>') AS json) r;
{"ATTR_DATE":"2021-01-01","ATTR_DOUBLE":10.201,"Attr_INT":1}
{"ATTR_DATE":"2021-02-01","ATTR_DOUBLE":10.201,"Attr_INT":1}


SELECT r1.col.Attr_INT, r1.col.ATTR_DATE, r1.col.ATTR_DOUBLE FROM (SELECT explode(r.json) AS col FROM (SELECT from_json('[{"Attr_INT":1, "ATTR_DOUBLE":10.201, "ATTR_DATE": "2021-01-01"},{"Attr_INT":1, "ATTR_DOUBLE":10.201, "ATTR_DATE": "2021-02-01"}]','array<struct<ATTR_DATE:string,ATTR_DOUBLE:double,Attr_INT:bigint>>') AS json) r) AS r1;
Attr_INT	ATTR_DATE	ATTR_DOUBLE