Manipulating data frame with Deedle in FSharp - Part 2
Manipulating data frame with Deedle in FSharp - Part 2
Today, I would like to show more functionalities of Deedle
and how they can be used in a real life scenario.
If you never heard of Deedle
before, you can find A primer on Deedle
(Part 1) here.
In real life, data rarely (if not never) come in a format exploitable for analysis therefore, a first step of manipulation of data is always needed before analysis. This step involves taking data from CSVs or databases, renaming, categorizing, reordering, transposing or pivoting tables and data. Without any tools or libraries, it would be a pain to do these common operations.
Deedle
is a library which simplifies data exploration by providing functions to execute common manipulation on dataframes and timeseries.
http://bluemountaincapital.github.io/Deedle/
Even though I went through some of the feature in Part 1, I would like to share more features with you which are extremely useful when manipulating dataframes.
This post is composed by three parts:
- A reminder on what is a Deedle Frame and Series
- Statistical calculations
- Pivot table
1. A reminder on what is a Deedle Frame and Series
Deedle works with the concept of dataframe and series.
A dataframe is a table which can contain elements of different type (held as obj
).
1.1 Frame
The type of a frame is Frame<R, C>
where R
is the type of the row key and C
is the type of the column name.
Do not confuse it with the type of the content of the cells.
Using the frame, you can get the rows in two ways Frame.rows
and Frame.getRows
.
The difference is that one returns RowSeries
and the other one returns a Series<R, Series<C, T>>
where T
is the type of your data.
I tend to ues Frame.rows
. Frame.rows
transform your frame into a RowSeries<R, C>
. It is more practical as we don’t need to care about the individual type of each Series
.
In fact there is a special type for Series
where the content is of type obj
; ObjectSeries<K>
where K
is the key type.
Here are the functionalities I use the most with Frame.
Frame.filterRowValues
expenses
|> Frame.filterRowValues(fun c -> c.GetAs<string>("Category") = "Supermarket")
Takes a function as parameter which takes the row ObjectSeries<C>
, where C
is the column type, as input and filter all rows for which the function return true.
Frame.fillMissingWith
expenses
|> Frame.fillMissingWith 0.
Fills all the missing values with the value provided.
Frame.getCol - Frame.getNumericCols
expenses
|> Frame.getCol "Title"
expenses
|> Frame.getNumericCols
getCol
gets a particular column and returns a series. getNumericCols
gets all the numeric columns and drop all other columns.
It has the advantage of making the whole frame content of type float
.
Frame.groupRowsByString - Frame.groupRowsByUsing
expenses
|> Frame.groupRowsByString "Category"
expenses
|> Frame.groupRowsUsing(fun _ c -> monthToString (c.GetAs<DateTime>("Date").Month) + " " + string (c.GetAs<DateTime>("Date").Year))
groupRowsByString
groups the frame by a column where the content is of type string
.
groupRowsUsing
groups the frame using a predicate which takes as input the row key and the row as a ObjectSeries<C>
where C
is the type of the column key.
Frame.nest
expenses
|> Frame.groupByString "Category"
|> Frame.nest
|> Series.observations
|> Seq.map (fun (category, subFrame) -> ...do something clever...)
Returns a Series which has the the first key frame as key and the grouped part of the frame as value.
1.2 Series
The type of a series is Series<K,V>
where K
is the type of the key and V
is the type of the value.
This is important to understand; frame type constraints to not include the type of the content whereas series type constraints contain the type of the content.
As mentioned earlier, there is also ObjectSeries<K>
where K
is the type of the key and all content is obj.
ObjectSeries
is returned when using Frame.rows
.
Here are the functionalities I use the most with Series.
Series.mapValues
amounts
|> Series.mapValues (fun amount -> Math.Abs amount)
Maps over each values of the series.
Series.dropMissing
amounts
|> Series.dropMissing
Drops all pais for which the value is <missing>
.
Series.observations
amounts
|> Series.observations
|> Seq.map (fun (date, amount) -> ...do something clever...)
Returns a sequence of key value pair.
2. Statistical calculations
Deedle
provide a Stats
module which contains commonly used statistical calculations and provide three variantes for each operation.
For example, mean
has Stats.mean
, Stats.expendingMean
and Stats.movingMean
.
The normal calculation applies to all values of the series.
Expending
calculation is done by taking the range from inception till the current value [0, x]
and apply the calculation.
Moving
calculation are done by taking n values before the current value [x - (n -1), x]
and apply the calculation.
The stats available are:
Stats.count - gets the count
Stats.kurt - calculates the kurtosis
Stats.max - gets the max
Stats.min - gets the min
Stats.mean - calculates the mean
Stats.skew - calculates the skewness
Stats.stdDev - calculates the standard deviation
Stats.variance - calculates the variance
When a Frame is passed to Stats
, the calculation is applied to every column of the Frame and a Series of the result is returned.
When a Series is passed to Stats
, the calculation is applied to the Series and the result is returned.
Stats.level
Another interesting function with Stats
is levelX
.
For example, Stats.levelSum
definition is level:('K -> 'L) -> series:(Series<'K, float>) -> Series<'L, float>
where K
is the key type, L
is the transformed key based on the level selected.
It describes on which key level should the calculation be applied.
It is used after a groupBy
but it needs to operate on Series, therefore I usually get the numeric columns out of the frame before applying a Stats.level
.
The following will calculate the sum for all expenses grouped by date and return a series where the key is a DateTime
and the value the sum of all expenses for that particular date.
expenses
|> Frame.groupRowsBy "Date"
|> Frame.getNumericCols
|> Series.mapValues (Stats.levelSum fst)
3. Pivot table
Pivot table is one of the killer feature of Deedle.
pivotTable
allows you to produce a new frame by grouping cells based on predicate applied on rows and columns.
expenses
|> Frame.pivotTable
(fun _ r -> r.GetAs<string>("Category"))
(fun _ r -> r.GetAs<DateTime>("Date"))
(Stats.sum >> Series.get "Amount")
pivotTable
takes 3 functions as argument:
- The first function is used as a selector for the row key
- The second function is used as a selector for the column key
- The third function is used as an aggregate function to be applied to the subframe resulting from the filtering of the row and column keys
For example, we have Expense type:
type Expense = {
Date: DateTime
Label: string
Amount: float
} with
static member Create(date, label, amount) =
{ Date = date; Label = label; Amount = amount }
And we have Frame of expenses:
[ Expense.Create(new DateTime(2016, 2, 1), "Supermarket", 15.)
Expense.Create(new DateTime(2016, 2, 10), "Supermarket", 25.)
Expense.Create(new DateTime(2016, 2, 16), "Clothes", 15.)
Expense.Create(new DateTime(2016, 3, 10), "Supermarket", 65.) ]
|> Frame.ofRecords
Date Label Amount
0 -> 10/02/2016 00:00:00 Supermarket 15
1 -> 10/02/2016 00:00:00 Supermarket 25
2 -> 10/02/2016 00:00:00 Clothes 15
3 -> 10/03/2016 00:00:00 Supermarket 65
A common operation would be to get the number of expenses for a category per day.
To do that, we use pivotTable
:
[ Expense.Create(new DateTime(2016, 2, 10), "Supermarket", 15.)
Expense.Create(new DateTime(2016, 2, 10), "Supermarket", 25.)
Expense.Create(new DateTime(2016, 2, 10), "Clothes", 15.)
Expense.Create(new DateTime(2016, 3, 15), "Supermarket", 65.) ]
|> Frame.ofRecords
|> Frame.pivotTable
(fun _ r -> r.GetAs<DateTime>("Date").ToShortDateString())
(fun _ r -> r.GetAs<string>("Label"))
Frame.countRows
|> Frame.fillMissingWith 0
This operation results in the following frame:
Supermarket Clothes
10/02/2016 -> 2 1
15/03/2016 -> 1 0
Thanks to pivotTable
, we get a Frame which exactly match our requirements.
We could also get the sum of all amounts. We could also pivot by the month or even pivot on the amount itself.
There are endless possibilities to extract data and visualize it in a way that will help us understand it better.
Conclusion
Deedle
is a really powerful tool to manipulate data through dataframe and series.
A lot more can be done, what I showed here is only a small subset of the functionalities.
I have used Deedle in one of my personal project, so if you need examples, you can refer to the part where I manipulate the Frames.
I hope this tutorial was useful for you, if you liked it let me know and I will continue to post more on that subject.
Like always, if you have any comment leave it here or hit me on Twitter @Kimserey_Lam.
See you next time!
Hi!
ReplyDeleteYour last example, how would you be able to show the sum of your 'Amount' in each cell instead of the count of occurrences? So the first cell in the top row would be 40. Then 15, 65, 0.
Thank you
Hi, the last example is different as it is show casing pivoting the frame. Your question, as I understand it, is:
ReplyDelete"How can we show the amount per label?"
In order to achieve this, we could use the grouping per label which would give us a way to sum the inner groups. I actually discussed in Part 1 https://kimsereyblog.blogspot.sg/2016/04/a-primer-on-manipulating-data-frame.html
Hope this helps, thanks.