python - After using Dask pivot_table I lose the index column

link之家

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am loosing index column after I use pivot_table for Dask Dataframe and save data to Parquet file.

import dask.dataframe as dd
import pandas as pd
df=pd.DataFrame()
df["Index"]=[1,2,3,1,2,3]
df["Field"]=["A","A","A","B","B","B"]
df["Value"]=[10,20,30,100,120,130]
My dataframe:
   Index Field  Value
0      1     A     10
1      2     A     20
2      3     A     30
3      1     B    100
4      2     B    120
5      3     B    130
Dask code:
ddf=dd.from_pandas(df,2)
ddf=ddf.categorize("Field")
ddf=ddf.pivot_table(values="Value", index="Index", columns="Field")
dd.to_parquet("1.parq",ddf)
dd.read_parquet("1.parq").compute()
This gives an error:
  ValueError: Multiple possible indexes exist: ['A', 'B'].  Please
  select one with index='index-name'
I can select A or B as index, but I am missing the Index column.
I tried dd.to_parquet("1.parq",ddf, write_index=True), but it gives me the following error:
  TypeError: cannot insert an item into a CategoricalIndex that is not
  already an existing category
Can someone help me save the table with the column "Index" into the Parquet file? 
ddf.pivot_table(values="Value", index="Index", columns="Field").compute() gives result as expected:
Field     A      B
Index             
1      10.0  100.0
2      20.0  120.0
3      30.0  130.0
And using Pandas is not a solution, because my data is 20 GB.
EDIT:
I tried 
ddf.columns = list(ddf.columns)
dd.to_parquet("1.parq",ddf, write_index=True)
And it gives me a new error:
  dask.async.TypeError: expected list of bytes
Google shows that those kind of errors arise from Tornado asynchronous library.
There are two problems here:
pivot_table produces a column index which is categorical, because you made the original column "Field" categorical. Writing the index to parquet calls reset_index on the data-frame, and pandas cannot add a new value to the columns index, because it is categorical. You can avoid this using ddf.columns = list(ddf.columns).
The index column has object dtype but actually contains integers. Integers are not one of the types expected in an object column, therefore you should convert it.
The whole block now looks like:
ddf = dd.from_pandas(df,2)
ddf = ddf.categorize("Field")
ddf = ddf.pivot_table(values="Value", index="Index", columns="Field")
ddf.columns = list(ddf.columns)
ddf = ddf.reset_index()
ddf['index'] = ddf.index.astype('int64')
dd.to_parquet("1.parq", ddf)
                1. Without categorize I have the following error for pivot_table: ValueError: 'columns' must be category dtype.  2. ddf.columns = list(ddf.columns) and write_index  gave another error: dask.async.TypeError: expected list of bytes. So both did not help.
– keiv.fly
                Mar 7, 2017 at 17:10
                I also tried ddf.columns=pd.Index(list(ddf.columns)). Now the class is the same as before, but it still throws an error: dask.async.TypeError: expected list of bytes. Do you have any other ideas?
– keiv.fly
                Mar 7, 2017 at 17:26
                That worked, thanks! And it is great that there is someone who can answer questions about dask! As I understand you are connected to the development of dask. I think it would be a good idea to reduce the lines to ddf.pivot_table(); dd.to_parquet()
– keiv.fly
                Mar 7, 2017 at 21:09
                I agree completely - please feel free to raise an issue. I must be honest and say I don't exactly understand what happens within dataframe.pivot_table.
– mdurant
                Mar 7, 2017 at 21:43
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.