Reshaping Dataframe using Pivot and Melt in Apache Spark and pandas
Data cleaning is one of the most important and tedious part of data science workflow often mentioned but least discussed topic. Reflecting on my daily workflow, task of reshaping DataFrame is the very common operation I often do to get the data in desired format. Reshaping dataframe means transformation of the table structure, may be remove/adding of columns/rows or doing some aggregations on certains rows and produce a new column to summerize the aggregation result. In this post I won’t cover everything about reshaping, but I will discuss two most frequently used operations i.e. pivot and melt. The solutions I discuss are in spark to be more specific pyspark and I will give you brief solution for pandas but if you want detail explanation of pandas solution I would recommend you to read this post.
Data is usually stored in stacked format or record format, which can sometimes have repeated values which means the data is not normalized. When you want to summarize the data in a tabular view, pivot can be a very use transformation. Let take an example of online business which sell music and books and sales data(over simplified) is shown table below.
|how to build a house||book||q4||40|
|mind and body||music||q4||40|
Above table has four columns, product and category are pretty much self explanatory, columns quarter and profit columns describe the profit contributed by the each product for that quarter. Some observation that can be made about table is that it is aggregation for the each product profit for each quarter but difficult to grasp in first look. This format of the table can be seen as:
- stacked format : individual observation for each product is stacked on each other.
- record format : each row is the record for a song or a book.
- long format : if there are million music tables will be long and the columns are not too much.
If we want to summarize the quarterly profit for each category of product in tabular format then the table below would have been more appropriate.
The table above is much more intuitive compared to TABLE A. This is what pivot operation will help us to achieve. Pivot will take unique value of a specific column/columns and turn it into one or more columns with the unique values of that column as the name of the columns, in our example q1-q4 were the unique value of the column quarter so a new columns is created for each quarter has column, this newly added columns are called pivot columns. Spark provides pivot functions in DataFrame object to for pivot transformation. Pivot functions requires four parameters the on which as as follows:
- Pivot column is the column who’s unique values will become pivot columns. In case of our example category column is the pivot.
- Value column is the column whos value will be aggregated and mapped to index column, profit columns is what we are aggregating for this example so its value column.
- Index column is the column which you want to use it as a index for the pivot columns. For this example category is the index column.
- Aggregation function in case if there are more the one row for a the column we are pivoting on. We have used sum function for this example, if there are more then one row for books category for q4 then we will sum profit for q4, but can change the aggregating function depending on question you are trying to answer. One could use average function to find average cost of book sold in each quarter.
Lets try some code example, below is the pyspark implementation of pivot transformation:
from pyspark.sql import SQLContext
We first create a RDD of the product then, create a DataFrame from that RDD. In the next step pivot transformation is applied. As you might have noticed that you don’t exclusively pass parameters to pivot function, pivot function only take the name of the pivot column. The result of the above code snippet is the TABLE B.
Pivot method is also available in pandas library which take the same four parameters we described above, difference is in pandas just one method call will be provided with all the four parameters. The code below first converts the Spark’s Dataframe to pandas DataFrame and then apply pivot__table function on the DataFrame, resulting DataFrame will look like TABLE B
# convert from spark DataFrame to pandas DataFrame
One of the common mistake people make while using pivot transformation is that they try to apply it on column with numeric values, while pivot function is suppose to be used on column with categorical values. In case of our example, we applied in on quarter column which has all categorical values.
Melt transformation is opposite of of pivot transformation. With this operation data is converted from wide(unstacked) format to stacked/long format. Pivot operation help you to give a quick overview(tabular view) of the data, which is good for human analysis but difficult to do complex operations like grouping, etc. While the tables in wide format are to pretty to summarize data but difficult for analysis that where melt operation help us. Lets take and example of pewforum.org Income data of various religious group in the US. Tabular view of the data is as follows
This table gives a good summarization of income of a particular religious group, but it would be good to have table with three column religion income and number of individual for that income, this format can help us to do some complex analysis like grouping the table by income category and finding which religious tradition is missing or present in a particular income category.
Melt operation API is not provided by spark, but its not that difficult to create this operation. Below is the code to that implementations the melt function.
from pyspark.sql.functions import array, col, explode, lit, struct
lets try the apply melt_df function on the religions group income dataset. melt_df function takes five parameters:
- df : The Dataframe on which the operation will be carried out.
- id_vars : array of columns which will be the index to which the values of the columns to which matched to. In out example religion is the only id_vars, as we want to map it to various income class.
- value_vars: while id_vars help use to find the index of the values, this is the actual values will be extracted from these columns.
- var_name: the name of the variable column in the resulting DataFrame.
- value_name: this is the name of the value variable in the resulting DataFrame.
from pyspark.sql import SQLContext
The above code will give the output as shown in below table
pandas provides melt operator which is the snippet as below the parameters are same as explained previously. In the below example we have create a pandas dataframe and the applied melt operation the results are the same for the previous example.
# religious_df is the dataframe which stores above table
We just saw how to implement the pivot and melt transformation which reshapes the DataFrame. To summarize what we have learnt pivot operation can be helpful to quickly summerize the table in tabular format which is easy for the human analysis. While on contrast data is tabular format is not quite helpful for complex analysis, this is where this melt operation converts the table from wide format to long format. There is quite a bit to say about which format is suitable in what situation which is topic of this post, I have discussed the details of various heustics of tidying the data.