### Manipulating data frame with Deedle in FSharp - Part 2

# Manipulating data frame with Deedle in FSharp - Part 2

Today, I would like to show more functionalities of `Deedle`

and how they can be used in a real life scenario.
If you never heard of `Deedle`

before, you can find A primer on `Deedle`

(Part 1) here.

In real life, data rarely (if not never) come in a format exploitable for analysis therefore, a first step of manipulation of data is always needed before analysis. This step involves taking data from CSVs or databases, renaming, categorizing, reordering, transposing or pivoting tables and data. Without any tools or libraries, it would be a pain to do these common operations.

`Deedle`

is a library which simplifies data exploration by providing functions to execute common manipulation on dataframes and timeseries.

http://bluemountaincapital.github.io/Deedle/

Even though I went through some of the feature in Part 1, I would like to share more features with you which are extremely useful when manipulating dataframes.

This post is composed by three parts:

- A reminder on what is a Deedle Frame and Series
- Statistical calculations
- Pivot table

## 1. A reminder on what is a Deedle Frame and Series

Deedle works with the concept of dataframe and series.
A dataframe is a table which can contain elements of different type (held as `obj`

).

### 1.1 Frame

The type of a frame is `Frame<R, C>`

where `R`

is the type of the **row key** and `C`

is the type of the **column name**.
Do not confuse it with the type of the content of the cells.

Using the frame, you can get the rows in two ways `Frame.rows`

and `Frame.getRows`

.
The difference is that one returns `RowSeries`

and the other one returns a `Series<R, Series<C, T>>`

where `T`

is the type of your data.

I tend to ues `Frame.rows`

. `Frame.rows`

transform your frame into a `RowSeries<R, C>`

. It is more practical as we don’t need to care about the individual type of each `Series`

.
In fact there is a special type for `Series`

where the content is of type `obj`

; `ObjectSeries<K>`

where `K`

is the **key type**.

Here are the functionalities I use the most with Frame.

#### Frame.filterRowValues

```
expenses
|> Frame.filterRowValues(fun c -> c.GetAs<string>("Category") = "Supermarket")
```

Takes a function as parameter which takes the row `ObjectSeries<C>`

, where `C`

is the column type, as input and filter all rows for which the function return true.

#### Frame.fillMissingWith

```
expenses
|> Frame.fillMissingWith 0.
```

Fills all the missing values with the value provided.

#### Frame.getCol - Frame.getNumericCols

```
expenses
|> Frame.getCol "Title"
expenses
|> Frame.getNumericCols
```

`getCol`

gets a particular column and returns a series. `getNumericCols`

gets all the numeric columns and drop all other columns.
It has the advantage of making the whole frame content of type `float`

.

#### Frame.groupRowsByString - Frame.groupRowsByUsing

```
expenses
|> Frame.groupRowsByString "Category"
expenses
|> Frame.groupRowsUsing(fun _ c -> monthToString (c.GetAs<DateTime>("Date").Month) + " " + string (c.GetAs<DateTime>("Date").Year))
```

`groupRowsByString`

groups the frame by a column where the content is of type `string`

.
`groupRowsUsing`

groups the frame using a predicate which takes as input the **row key** and the row as a `ObjectSeries<C>`

where `C`

is the type of the **column key**.

#### Frame.nest

```
expenses
|> Frame.groupByString "Category"
|> Frame.nest
|> Series.observations
|> Seq.map (fun (category, subFrame) -> ...do something clever...)
```

Returns a Series which has the the **first key frame** as key and the **grouped part of the frame** as value.

### 1.2 Series

The type of a series is `Series<K,V>`

where `K`

is the type of the **key** and `V`

is the type of the **value**.
This is important to understand; frame type constraints to not include the type of the content whereas series type constraints contain the type of the content.

As mentioned earlier, there is also `ObjectSeries<K>`

where `K`

is the type of the **key** and all content is obj.
`ObjectSeries`

is returned when using `Frame.rows`

.

Here are the functionalities I use the most with Series.

#### Series.mapValues

```
amounts
|> Series.mapValues (fun amount -> Math.Abs amount)
```

Maps over each values of the series.

#### Series.dropMissing

```
amounts
|> Series.dropMissing
```

Drops all pais for which the value is `<missing>`

.

#### Series.observations

```
amounts
|> Series.observations
|> Seq.map (fun (date, amount) -> ...do something clever...)
```

Returns a sequence of key value pair.

## 2. Statistical calculations

`Deedle`

provide a `Stats`

module which contains commonly used statistical calculations and provide three variantes for each operation.
For example, `mean`

has `Stats.mean`

, `Stats.expendingMean`

and `Stats.movingMean`

.

The normal calculation applies to all values of the series.
`Expending`

calculation is done by taking the range from inception till the current value `[0, x]`

and apply the calculation.
`Moving`

calculation are done by taking n values before the current value `[x - (n -1), x]`

and apply the calculation.

The stats available are:

```
Stats.count - gets the count
Stats.kurt - calculates the kurtosis
Stats.max - gets the max
Stats.min - gets the min
Stats.mean - calculates the mean
Stats.skew - calculates the skewness
Stats.stdDev - calculates the standard deviation
Stats.variance - calculates the variance
```

When a Frame is passed to `Stats`

, the calculation is applied to every column of the Frame and a Series of the result is returned.
When a Series is passed to `Stats`

, the calculation is applied to the Series and the result is returned.

#### Stats.level

Another interesting function with `Stats`

is `levelX`

.
For example, `Stats.levelSum`

definition is `level:('K -> 'L) -> series:(Series<'K, float>) -> Series<'L, float>`

where `K`

is the key type, `L`

is the transformed key based on the level selected.
It describes on which key level should the calculation be applied.
It is used after a `groupBy`

but it needs to operate on Series, therefore I usually get the numeric columns out of the frame before applying a `Stats.level`

.

The following will calculate the sum for all expenses grouped by date and return a series where the key is a `DateTime`

and the value the sum of all expenses for that particular date.

```
expenses
|> Frame.groupRowsBy "Date"
|> Frame.getNumericCols
|> Series.mapValues (Stats.levelSum fst)
```

## 3. Pivot table

Pivot table is one of the killer feature of Deedle.
`pivotTable`

allows you to produce a new frame by grouping cells based on predicate applied on rows and columns.

```
expenses
|> Frame.pivotTable
(fun _ r -> r.GetAs<string>("Category"))
(fun _ r -> r.GetAs<DateTime>("Date"))
(Stats.sum >> Series.get "Amount")
```

`pivotTable`

takes 3 functions as argument:

- The first function is used as a selector for the row key
- The second function is used as a selector for the column key
- The third function is used as an aggregate function to be applied to the subframe resulting from the filtering of the row and column keys

For example, we have Expense type:

```
type Expense = {
Date: DateTime
Label: string
Amount: float
} with
static member Create(date, label, amount) =
{ Date = date; Label = label; Amount = amount }
```

And we have Frame of expenses:

```
[ Expense.Create(new DateTime(2016, 2, 1), "Supermarket", 15.)
Expense.Create(new DateTime(2016, 2, 10), "Supermarket", 25.)
Expense.Create(new DateTime(2016, 2, 16), "Clothes", 15.)
Expense.Create(new DateTime(2016, 3, 10), "Supermarket", 65.) ]
|> Frame.ofRecords
```

```
Date Label Amount
0 -> 10/02/2016 00:00:00 Supermarket 15
1 -> 10/02/2016 00:00:00 Supermarket 25
2 -> 10/02/2016 00:00:00 Clothes 15
3 -> 10/03/2016 00:00:00 Supermarket 65
```

A common operation would be to get the number of expenses for a category per day.
To do that, we use `pivotTable`

:

```
[ Expense.Create(new DateTime(2016, 2, 10), "Supermarket", 15.)
Expense.Create(new DateTime(2016, 2, 10), "Supermarket", 25.)
Expense.Create(new DateTime(2016, 2, 10), "Clothes", 15.)
Expense.Create(new DateTime(2016, 3, 15), "Supermarket", 65.) ]
|> Frame.ofRecords
|> Frame.pivotTable
(fun _ r -> r.GetAs<DateTime>("Date").ToShortDateString())
(fun _ r -> r.GetAs<string>("Label"))
Frame.countRows
|> Frame.fillMissingWith 0
```

This operation results in the following frame:

```
Supermarket Clothes
10/02/2016 -> 2 1
15/03/2016 -> 1 0
```

Thanks to `pivotTable`

, we get a Frame which exactly match our requirements.
We could also get the sum of all amounts. We could also pivot by the month or even pivot on the amount itself.
There are endless possibilities to extract data and visualize it in a way that will help us understand it better.

## Conclusion

`Deedle`

is a really powerful tool to manipulate data through dataframe and series.
A lot more can be done, what I showed here is only a small subset of the functionalities.
I have used Deedle in one of my personal project, so if you need examples, you can refer to the part where I manipulate the Frames.
I hope this tutorial was useful for you, if you liked it let me know and I will continue to post more on that subject.
Like always, if you have any comment leave it here or hit me on Twitter @Kimserey_Lam.
See you next time!

Hi!

ReplyDeleteYour last example, how would you be able to show the sum of your 'Amount' in each cell instead of the count of occurrences? So the first cell in the top row would be 40. Then 15, 65, 0.

Thank you

Hi, the last example is different as it is show casing pivoting the frame. Your question, as I understand it, is:

ReplyDelete"How can we show the amount per label?"

In order to achieve this, we could use the grouping per label which would give us a way to sum the inner groups. I actually discussed in Part 1 https://kimsereyblog.blogspot.sg/2016/04/a-primer-on-manipulating-data-frame.html

Hope this helps, thanks.