How to export a Hive table into a CSV file?

link之家
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Stack Internal
Knowledge at work
Bring the best of human thought and AI automation together at your work.
Explore Stack Internal
I used this Hive query to export a table into a CSV file.
INSERT OVERWRITE DIRECTORY '/user/data/output/test' select column1, column2 from table1;
The file generated '000000_0' does not have comma separator
Is this the right way to generate CSV file? If no, please let me know how can I generate the CSV file?
                another question, when i save a big hive table to several blocks on hdfs , i found there schema becomes different sometimes ,i.e. the type of columns maybe changed, how to prevent this problem?
                2023-07-21 03:07:39 +00:00
                    Commented
                    Jul 21, 2023 at 3:07
You can also specify property set hive.cli.print.header=true before the SELECT to ensure that header along with data is created and copied to file. 
For example:
hive -e 'set hive.cli.print.header=true; select * from your_Table' | sed 's/[\t]/,/g'  > /home/yourfile.csv
If you don't want to write to local file system, pipe the output of sed command back into HDFS using the hadoop fs -put command.
It may also be convenient to SFTP to your files using something like Cyberduck, or you can use scp to connect via terminal / command prompt.
    4 Comments
 
 
 
 Aman Mathur
 Aman Mathur Over a year ago  
  By using this command the hive data types such as 'double' are not carried forward in CSV. So when I read the CSV all are read as a string.
 2015-06-25T12:17:49.933Z+00:00     
    
 
 
  
 Arthur Lekane
 Arthur Lekane Over a year ago  
  in version 3 of hive where hive cli is replaced by beeline, the output of queries is slightly different because it contains formatting
 2019-02-26T16:16:46.63Z+00:00     
    
 
 
  
 Albin Chandy
 Albin Chandy Over a year ago  
  I tried exporting this for exporting a hive query to local and hdfs files but the same file can't be read from spark session - header not identified properly!!
 2021-10-11T12:48:54.567Z+00:00     
    
 
 
  
 Jianwu Chen
 Jianwu Chen Over a year ago  
  While this approach works most of the time, but if the's a '\t' in the query result value. It will break. How can we solve this issue?
 2022-01-24T21:09:01.117Z+00:00     
    
 
 
  
   
 

If you're using Hive 11 or better you can use the INSERT statement with the LOCAL keyword.
Example:
insert overwrite local directory '/home/carter/staging' row format delimited fields terminated by ',' select * from hugetable;
Note that this may create multiple files and you may want to concatenate them on the client side after it's done exporting.
Using this approach means you don't need to worry about the format of the source tables, can export based on arbitrary SQL query, and can select your own delimiters and output formats.
    3 Comments
 
 
 
 mike
 mike Over a year ago  
  Thank you, this created folder with multiple csv files. Is there anyway to put everything into one file? Also is there anyway to include header (column name) in the csv file?
 2017-06-14T13:36:43.467Z+00:00     
    
 
 
  
 user2205916
 user2205916 Over a year ago  
  How do you concatenate them on the client side after exporting?
 2018-05-24T20:45:54.38Z+00:00     
    
 
 
  
 Ravi Chandra
 Ravi Chandra Over a year ago  
  For me this command has produced a bunch of files ending with the extension .snappy which looks like a compressed format. I am not sure how to convert un-compress them. I know how to merge files locally using the command cat file1 file2 > file on my local machine.
 2018-11-27T06:51:55.55Z+00:00     
    
 
 
  
   
 

    4 Comments
 
 
 
 Brett Bonner
 Brett Bonner Over a year ago  




    
  this will export as tab-separated
 2015-08-02T02:08:41.45Z+00:00     
    
 
 
  
 JGS
 JGS Over a year ago  
  It is working: hive -e 'use <database or schema name>; select * from <table_name>;' > <absolute path for the csv file>/<csv file name>.csv
 2016-05-12T10:30:22.103Z+00:00     
    
 
 
  
 Li haonan
 Li haonan Over a year ago  
  Note in a large company normally you have to assign a queuename for a job like this, where -hiveconf gets into play  ,otherwise you can't run it.
 2019-10-23T00:15:10.87Z+00:00     
    
 
 
  
 lboniotti
 lboniotti Over a year ago  
  @Lihaonan, how I assign a queuname in query?
 2020-03-05T20:28:17.427Z+00:00     
    
 
 
  
   
 

You can not have a delimiter for query output,after generating the report (as you did).
you can change the delimiter to comma.
It comes with default delimiter \001 (inivisible character).
hadoop fs -cat /user/data/output/test/* |tr "\01" "," >>outputwithcomma.csv
check this also
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select * from table; 
is the correct answer.
If the number of records is really big, based on the number of files generated 
the following command would give only partial result.
hive -e 'select * from some_table' > /home/yourfile.csv
    2 Comments
 
 
 
 sAguinaga
 sAguinaga Over a year ago  
  how do I deal with this error msg: User user_id does not have privileges for QUERY?
 2019-05-31T17:14:51.633Z+00:00     
    
 
 
  
 Petro
 Petro Over a year ago  
  Check Ranger's policies for permission errors with hive
 2019-12-04T18:26:38.253Z+00:00     
    
 
 
  
  

Recent versions of hive comes with this feature.
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
select * from table;
this way you can choose your own delimiter and file name.
Just be careful with the "OVERWRITE" it will try to delete everything from the mentioned folder.
I have used simple linux shell piping + perl to convert hive generated output from tsv to csv.
hive -e "SELECT col1, col2, … FROM table_name" | perl -lpe 's/"/\\"/g; s/^|$/"/g; s/\t/","/g' > output_file.csv
(I got the updated perl regex from someone in stackoverflow some time ago)
The result will be like regular csv:
"col1","col2","col3"... and so on
#!/bin/bash
hive -e "insert overwrite local directory '/LocalPath/'
row format delimited fields terminated by ','
select * from Mydatabase,Mytable limit 100"
cat /LocalPath/* > /LocalPath/table.csv
I used limit 100 to limit the size of data since I had a huge table, but you can delete it to export the entire table.
Here using Hive warehouse dir you can export data instead of Hive table. 
first give hive warehouse path and after local path where you want to store the .csv file
For this command is bellow :-
hadoop fs -cat /user/hdusr/warehouse/HiveDb/tableName/* > /users/hadoop/test/nilesh/sample.csv
set hive.execution.engine=tez;
set hive.merge.tezfiles=true;
set hive.exec.compress.output=false;
INSERT OVERWRITE DIRECTORY '/tmp/job/'
ROW FORMAT DELIMITED
FIELDS TERMINATED by ','
NULL DEFINED AS ''
STORED AS TEXTFILE
SELECT * from table;
I had a similar issue and this is how I was able to address it.
Step 1 - Loaded the data from hive table into another table as follows
  DROP TABLE IF EXISTS TestHiveTableCSV;
  CREATE TABLE TestHiveTableCSV ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' AS
  SELECT Column List FROM TestHiveTable;
Step 2 - Copied the blob from hive warehouse to the new location with appropriate extension
  Start-AzureStorageBlobCopy 
  -DestContext $destContext
  -SrcContainer "Source Container" 
  -SrcBlob "hive/warehouse/TestHiveTableCSV/000000_0"
  -DestContainer "Destination Container" `
  -DestBlob "CSV/TestHiveTable.csv"
Hope this helps!
Best Regards,
Dattatrey Sindol (Datta)
http://dattatreysindol.com
There are ways to change the default delimiter, as shown by other answers.
There are also ways to convert the raw output to csv with some bash scripting. There are 3 delimiters to consider though, not just \001. Things get a bit more complicated when your hive table has maps. 
I wrote a bash script that can handle all 3 default delimiters (\001 \002 and \003) from hive and output a csv. The script and some more info are here:
  Hive Default Delimiters to CSV
  Hive's default delimiters are
Row Delimiter => Control-A ('\001')
Collection Item Delimiter => Control-B ('\002')
Map Key Delimiter => Control-C ('\003')
  There are ways to change these delimiters when exporting tables but
  sometimes you might still get stuck needing to convert this to csv. 
  Here's a quick bash script that can handle a DB export that's
  segmented in multiple files and has the default delimiters. It will
  output a single CSV file.
  It is assumed that the segments all have the naming convention 000*_0
INDIRECTORY="path/to/input/directory"
for f in $INDIRECTORY/000*_0; do 
  echo "Processing $f file.."; 
  cat -v $f | 
      LC_ALL=C sed -e "s/^/\"/g" | 
      LC_ALL=C sed -e "s/\^A/\",\"/g" | 
      LC_ALL=C sed -e "s/\^C\^B/\"\":\"\"\"\",\"\"/g" | 
      LC_ALL=C sed -e "s/\^B/\"\",\"\"/g" |  
      LC_ALL=C sed -e "s/\^C/\"\":\"\"/g" | 
      LC_ALL=C sed -e "s/$/\"/g" > $f-temp
echo "you,can,echo,your,header,here,if,you,like" > $INDIRECTORY/final_output.csv
cat $INDIRECTORY/*-temp >> $INDIRECTORY/final_output.csv
rm $INDIRECTORY/*-temp
More explanation on the gist
In case you are doing it from Windows you can use Python script hivehoney to extract table data to local CSV file.
It will:
Login to bastion host. 
pbrun. 
kinit. 
beeline (with your query). 
echo from beeline to a file on Windows.
Execute it like this:
set PROXY_HOST=your_bastion_host
set SERVICE_USER=you_func_user
set LINUX_USER=your_SOID
set LINUX_PWD=your_pwd
python hh.py --query_file=query.sql
As Carter Shanklin said, with this command we will obtain a csv file with the results of the query in the path specified:
insert overwrite local directory '/home/carter/staging' row format delimited fields terminated by ',' select * from hugetable;
The problem with this solution is that the csv obtained won´t have headers and will create a file that is not a CSV (so we have to rename it).

As user1922900 said, with the following command we will obtain a CSV files with the results of the query in the specified file and with headers:
hive -e 'select * from some_table' | sed 's/[\t]/,/g' > /home/yourfile.csv
With this solution we will get a CSV file with the result rows of our query, but with log messages between these rows too. As a solution of this problem I tried this, but without results.

So, to solve all these issues I created a script that execute a list of queries, create a folder (with a timestamp) where it stores the results, rename the files obtained, remove the unnecesay files and it also add the respective headers.
 #!/bin/sh
 QUERIES=("select * from table1" "select * from table2")
 IFS=""
 directoryname=$(echo "ScriptResults$timestamp")
 mkdir $directoryname 
 counter=1 
for query in ${QUERIES[*]}
     tablename="query"$counter 
     hive -S -e "INSERT OVERWRITE LOCAL DIRECTORY '/data/2/DOMAIN_USERS/SANUK/users/$USER/$tablename' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' $query ;"
     hive -S -e "set hive.cli.print.header=true; $query limit 1" | head -1 | sed 's/[\t]/,/g' >> /data/2/DOMAIN_USERS/SANUK/users/$USER/$tablename/header.csv
     mv $tablename/000000_0 $tablename/$tablename.csv
     cat $tablename/$tablename.csv >> $tablename/header.csv.
     rm $tablename/$tablename.csv
     mv $tablename/header.csv $tablename/$tablename.csv 
     mv $tablename/$tablename.csv $directoryname
     counter=$((counter+1))
     rm -rf $tablename/ 
Below is the end-to-end solution that I use to export Hive table data to HDFS as a single named CSV file with a header.

(it is unfortunate that it's not possible to do with one HQL statement)

It consists of several commands, but it's quite intuitive, I think, and it does not rely on the internal representation of Hive tables, which may change from time to time.

Replace "DIRECTORY" with "LOCAL DIRECTORY" if you want to export the data to a local filesystem versus HDFS.
# cleanup the existing target HDFS directory, if it exists
sudo -u hdfs hdfs dfs -rm -f -r /tmp/data/my_exported_table_name/*
# export the data using Beeline CLI (it will create a data file with a surrogate name in the target HDFS directory)
beeline -u jdbc:hive2://my_hostname:10000 -n hive -e "INSERT OVERWRITE DIRECTORY '/tmp/data/my_exported_table_name' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM my_exported_table_name"
# set the owner of the target HDFS directory to whatever UID you'll be using to run the subsequent commands (root in this case)
sudo -u hdfs hdfs dfs -chown -R root:hdfs /tmp/data/my_exported_table_name
# write the CSV header record to a separate file (make sure that its name is higher in the sort order than for the data file in the target HDFS directory)
# also, obviously, make sure that the number and the order of fields is the same as in the data file
echo 'field_name_1,field_name_2,field_name_3,field_name_4,field_name_5' | hadoop fs -put - /tmp/data/my_exported_table_name/.header.csv
# concatenate all (2) files in the target HDFS directory into the final CSV data file with a header
# (this is where the sort order of the file names is important)
hadoop fs -cat /tmp/data/my_exported_table_name/* | hadoop fs -put - /tmp/data/my_exported_table_name/my_exported_table_name.csv
# give the permissions for the exported data to other users as necessary
sudo -u hdfs hdfs dfs -chmod -R 777 /tmp/data/hive_extr/drivers
None of the above options work perfect for me. Few issues I want to solve
If there's tab in the value, it shouldn't break CSV output
I need the head to be automatically added without any manual work
Struct, array or map field should be JSON encoded
So I create the UDF to do that. (A bit surprised Hive didn't have this build in support)
Usage:
ADD JAR ivy://org.jsonex:HiveUDF:0.1.24?transitive=true;
CREATE TEMPORARY FUNCTION to_csv AS 'org.jsonex.hiveudf.ToCSVUDF';
SELECT to_csv(*) FROM someTable;  -- Default separator and headers
SELECT to_csv('{noHead:true}', *) FROM someTable;  -- No headers
SELECT to_csv('{headers:[,,,col3,]}', *) FROM someTable; -- Custom Headers
SELECT to_csv('{fieldSep:|,quoteChar:\"\\'\"}', *) FROM someTable" -- Custom fieldSep and quoteChar