python - Pyspark - converting json string to DataFrame

link之家

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have a test2.json file that contains simple json:

{  "Name": "something",  "Url": "https://stackoverflow.com",  "Author": "jangcy",  "BlogEntries": 100,  "Caller": "jangcy"}
I have uploaded my file to blob storage and I create a DataFrame from it:
df = spark.read.json("/example/data/test2.json")
then I can see it without any problems:
df.show()
+------+-----------+------+---------+--------------------+
|Author|BlogEntries|Caller|     Name|                 Url|
+------+-----------+------+---------+--------------------+
|jangcy|        100|jangcy|something|https://stackover...|
+------+-----------+------+---------+--------------------+
Second scenario:
I have really the same json string declared within my notebook:
newJson = '{  "Name": "something",  "Url": "https://stackoverflow.com",  "Author": "jangcy",  "BlogEntries": 100,  "Caller": "jangcy"}'
I can print it etc. But now if I'd like to create a DataFrame from it:
df = spark.read.json(newJson)
I get the 'Relative path in absolute URI' error:
'java.net.URISyntaxException: Relative path in absolute URI: {  "Name":%20%22something%22,%20%20%22Url%22:%20%22https:/stackoverflow.com%22,%20%20%22Author%22:%20%22jangcy%22,%20%20%22BlogEntries%22:%20100,%20%20%22Caller%22:%20%22jangcy%22%7D'
Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 249, in json
    return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'java.net.URISyntaxException: Relative path in absolute URI: {  "Name":%20%22something%22,%20%20%22Url%22:%20%22https:/stackoverflow.com%22,%20%20%22Author%22:%20%22jangcy%22,%20%20%22BlogEntries%22:%20100,%20%20%22Caller%22:%20%22jangcy%22%7D'
Should I apply additional transformations to the newJson string? If yes, what should them be? Please forgive me, if this is too trivial, as I am very new to Python and Spark.
I am using Jupyter notebook with PySpark3 Kernel.
Thanks in advance.
                It is apprently part of the "ingestion pipeline" help section, Therefore, renaming the field @ indexing time, not querying time
– Thierry Barnier
                Jan 15, 2020 at 13:25
You can do the following
newJson = '{"Name":"something","Url":"https://stackoverflow.com","Author":"jangcy","BlogEntries":100,"Caller":"jangcy"}'
df = spark.read.json(sc.parallelize([newJson]))
df.show(truncate=False)
which should give 
+------+-----------+------+---------+-------------------------+
|Author|BlogEntries|Caller|Name     |Url                      |
+------+-----------+------+---------+-------------------------+
|jangcy|100        |jangcy|something|https://stackoverflow.com|
+------+-----------+------+---------+-------------------------+
                I have the same json in dataframe like a column . I am unable to parse it . How can I achieve this. Please help
– Naveen Srikanth
                Nov 26, 2018 at 6:10
                there is an inbuilt function spark.apache.org/docs/latest/api/java/org/apache/spark/sql/… @NaveenSrikanth
– Ramesh Maharjan
                Nov 26, 2018 at 6:35
                Thanks Ramesh. I am trying work on the same logic. will there be any performance impact on reading 400-500 millions of json's  ?
– Naveen Srikanth
                Nov 26, 2018 at 6:48
                I used spark.read.json(sc.wholeTextFiles("s3a://jsonparser_coding/").values()) , Ramesh . I was able to read successfully :) :)
– Naveen Srikanth
                Nov 26, 2018 at 8:35
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.