Manipulating data frame with Deedle in FSharp - Part 2

Manipulating data frame with Deedle in FSharp - Part 2

Today, I would like to show more functionalities of Deedle and how they can be used in a real life scenario. If you never heard of Deedle before, you can find A primer on Deedle (Part 1) here.

In real life, data rarely (if not never) come in a format exploitable for analysis therefore, a first step of manipulation of data is always needed before analysis. This step involves taking data from CSVs or databases, renaming, categorizing, reordering, transposing or pivoting tables and data. Without any tools or libraries, it would be a pain to do these common operations.

Deedle is a library which simplifies data exploration by providing functions to execute common manipulation on dataframes and timeseries.

http://bluemountaincapital.github.io/Deedle/

Even though I went through some of the feature in Part 1, I would like to share more features with you which are extremely useful when manipulating dataframes.

This post is composed by three parts:

  1. A reminder on what is a Deedle Frame and Series
  2. Statistical calculations
  3. Pivot table

1. A reminder on what is a Deedle Frame and Series

Deedle works with the concept of dataframe and series. A dataframe is a table which can contain elements of different type (held as obj).

1.1 Frame

The type of a frame is Frame<R, C> where R is the type of the row key and C is the type of the column name. Do not confuse it with the type of the content of the cells.

Using the frame, you can get the rows in two ways Frame.rows and Frame.getRows. The difference is that one returns RowSeries and the other one returns a Series<R, Series<C, T>> where T is the type of your data.

I tend to ues Frame.rows. Frame.rows transform your frame into a RowSeries<R, C>. It is more practical as we don’t need to care about the individual type of each Series. In fact there is a special type for Series where the content is of type obj; ObjectSeries<K> where K is the key type.

Here are the functionalities I use the most with Frame.

Frame.filterRowValues

 expenses
|> Frame.filterRowValues(fun c -> c.GetAs<string>("Category") = "Supermarket")

Takes a function as parameter which takes the row ObjectSeries<C>, where C is the column type, as input and filter all rows for which the function return true.

Frame.fillMissingWith

expenses
|> Frame.fillMissingWith 0.

Fills all the missing values with the value provided.

Frame.getCol - Frame.getNumericCols

expenses
|> Frame.getCol "Title"

expenses
|> Frame.getNumericCols

getCol gets a particular column and returns a series. getNumericCols gets all the numeric columns and drop all other columns. It has the advantage of making the whole frame content of type float.

Frame.groupRowsByString - Frame.groupRowsByUsing

expenses
|> Frame.groupRowsByString "Category"

expenses
|> Frame.groupRowsUsing(fun _ c ->  monthToString (c.GetAs<DateTime>("Date").Month) + " " + string (c.GetAs<DateTime>("Date").Year))

groupRowsByString groups the frame by a column where the content is of type string. groupRowsUsing groups the frame using a predicate which takes as input the row key and the row as a ObjectSeries<C> where C is the type of the column key.

Frame.nest

expenses
|> Frame.groupByString "Category"
|> Frame.nest
|> Series.observations
|> Seq.map (fun (category, subFrame) -> ...do something clever...)

Returns a Series which has the the first key frame as key and the grouped part of the frame as value.

1.2 Series

The type of a series is Series<K,V> where K is the type of the key and V is the type of the value. This is important to understand; frame type constraints to not include the type of the content whereas series type constraints contain the type of the content.

As mentioned earlier, there is also ObjectSeries<K> where K is the type of the key and all content is obj. ObjectSeries is returned when using Frame.rows.

Here are the functionalities I use the most with Series.

Series.mapValues

amounts
|> Series.mapValues (fun amount -> Math.Abs amount) 

Maps over each values of the series.

Series.dropMissing

amounts
|> Series.dropMissing

Drops all pais for which the value is <missing>.

Series.observations

amounts
|> Series.observations
|> Seq.map (fun (date, amount) -> ...do something clever...)

Returns a sequence of key value pair.

2. Statistical calculations

Deedle provide a Stats module which contains commonly used statistical calculations and provide three variantes for each operation. For example, mean has Stats.mean, Stats.expendingMean and Stats.movingMean.

The normal calculation applies to all values of the series. Expending calculation is done by taking the range from inception till the current value [0, x] and apply the calculation. Moving calculation are done by taking n values before the current value [x - (n -1), x] and apply the calculation.

The stats available are:

Stats.count - gets the count
Stats.kurt - calculates the kurtosis 
Stats.max - gets the max
Stats.min - gets the min
Stats.mean - calculates the mean
Stats.skew - calculates the skewness
Stats.stdDev - calculates the standard deviation
Stats.variance - calculates the variance

When a Frame is passed to Stats, the calculation is applied to every column of the Frame and a Series of the result is returned. When a Series is passed to Stats, the calculation is applied to the Series and the result is returned.

Stats.level

Another interesting function with Stats is levelX. For example, Stats.levelSum definition is level:('K -> 'L) -> series:(Series<'K, float>) -> Series<'L, float> where K is the key type, L is the transformed key based on the level selected. It describes on which key level should the calculation be applied. It is used after a groupBy but it needs to operate on Series, therefore I usually get the numeric columns out of the frame before applying a Stats.level.

The following will calculate the sum for all expenses grouped by date and return a series where the key is a DateTime and the value the sum of all expenses for that particular date.

expenses
|> Frame.groupRowsBy "Date"
|> Frame.getNumericCols
|> Series.mapValues (Stats.levelSum fst)

3. Pivot table

Pivot table is one of the killer feature of Deedle. pivotTable allows you to produce a new frame by grouping cells based on predicate applied on rows and columns.

expenses
|> Frame.pivotTable
    (fun _ r -> r.GetAs<string>("Category"))
    (fun _ r -> r.GetAs<DateTime>("Date"))
    (Stats.sum >> Series.get "Amount")

pivotTable takes 3 functions as argument:

  1. The first function is used as a selector for the row key
  2. The second function is used as a selector for the column key
  3. The third function is used as an aggregate function to be applied to the subframe resulting from the filtering of the row and column keys

For example, we have Expense type:

type Expense = {
    Date: DateTime
    Label: string
    Amount: float
} with
    static member Create(date, label, amount) =
        { Date = date; Label = label; Amount = amount }

And we have Frame of expenses:

[ Expense.Create(new DateTime(2016, 2, 1), "Supermarket", 15.)
  Expense.Create(new DateTime(2016, 2, 10), "Supermarket", 25.)
  Expense.Create(new DateTime(2016, 2, 16), "Clothes", 15.)
  Expense.Create(new DateTime(2016, 3, 10), "Supermarket", 65.) ]
|> Frame.ofRecords
     Date                Label       Amount 
0 -> 10/02/2016 00:00:00 Supermarket 15     
1 -> 10/02/2016 00:00:00 Supermarket 25     
2 -> 10/02/2016 00:00:00 Clothes     15     
3 -> 10/03/2016 00:00:00 Supermarket 65       

A common operation would be to get the number of expenses for a category per day. To do that, we use pivotTable:

[ Expense.Create(new DateTime(2016, 2, 10), "Supermarket", 15.)
  Expense.Create(new DateTime(2016, 2, 10), "Supermarket", 25.)
  Expense.Create(new DateTime(2016, 2, 10), "Clothes", 15.)
  Expense.Create(new DateTime(2016, 3, 15), "Supermarket", 65.) ]
|> Frame.ofRecords
|> Frame.pivotTable
    (fun _ r -> r.GetAs<DateTime>("Date").ToShortDateString())
    (fun _ r -> r.GetAs<string>("Label"))
    Frame.countRows
|> Frame.fillMissingWith 0

This operation results in the following frame:

              Supermarket Clothes 
10/02/2016 -> 2           1       
15/03/2016 -> 1           0          

Thanks to pivotTable, we get a Frame which exactly match our requirements. We could also get the sum of all amounts. We could also pivot by the month or even pivot on the amount itself. There are endless possibilities to extract data and visualize it in a way that will help us understand it better.

Conclusion

Deedle is a really powerful tool to manipulate data through dataframe and series. A lot more can be done, what I showed here is only a small subset of the functionalities. I have used Deedle in one of my personal project, so if you need examples, you can refer to the part where I manipulate the Frames. I hope this tutorial was useful for you, if you liked it let me know and I will continue to post more on that subject. Like always, if you have any comment leave it here or hit me on Twitter @Kimserey_Lam. See you next time!

Comments

  1. Hi!
    Your last example, how would you be able to show the sum of your 'Amount' in each cell instead of the count of occurrences? So the first cell in the top row would be 40. Then 15, 65, 0.

    Thank you

    ReplyDelete
  2. Hi, the last example is different as it is show casing pivoting the frame. Your question, as I understand it, is:

    "How can we show the amount per label?"

    In order to achieve this, we could use the grouping per label which would give us a way to sum the inner groups. I actually discussed in Part 1 https://kimsereyblog.blogspot.sg/2016/04/a-primer-on-manipulating-data-frame.html

    Hope this helps, thanks.

    ReplyDelete

Post a Comment

Popular posts from this blog

Microsoft Orleans logs warnings and errors

A complete SignalR with ASP Net Core example with WSS, Authentication, Nginx

SDK-Style project and project.assets.json