Recently a colleague asked me for advice about using LINQ to group and sort a table in a specific way. It contains multiple columns by which to group, and a single time-stamp column by which to sort. Consider this data as an example:
| Column1 | Column2 | Column3 | Timestamp |
|---|---|---|---|
| C | C | C | 6 |
| A | B | C | 4 |
| C | A | A | 5 |
| B | B | C | 1 |
| C | C | C | 3 |
| C | C | C | 1 |
| C | B | C | 7 |
| B | A | A | 5 |
| A | C | B | 7 |
| A | B | B | 9 |
The first step is to group by Column1, and then sort the groups by their lowest time-stamp. For equal values of the time-stamp, the order is arbitrary. Repeat the step for Column2 and Column3, preserving the previous grouping. The result would be
| Column1 | Column2 | Column3 | Timestamp |
|---|---|---|---|
| C | C | C | 6 |
| C | C | C | 3 |
| C | C | C | 1 |
| C | A | A | 5 |
| C | B | C | 7 |
| B | B | C | 1 |
| B | A | A | 5 |
| A | B | C | 4 |
| A | B | B | 9 |
| A | C | B | 7 |
The requirements can be quite literally translated into the following LINQ query
IEnumerable<DataRow> sorted =
// Column1
from row in table.AsEnumerable()
group row by row.Field<string>(0)
into grouped1
orderby grouped1.Min(x => x.Field<int>("Timestamp"))
from row in
// Column2
from row in grouped1
group row by row.Field<string>(1)
into grouped2
orderby grouped2.Min(x => x.Field<int>("Timestamp"))
from row in
// Column3
from row in grouped2
group row by row.Field<string>(2)
into grouped3
orderby grouped3.Min(x => x.Field<int>("Timestamp"))
from row in grouped3
select row
select row
select row;
While this does indeed achieve the desired results, there are a few drawbacks to this solution.
Most importantly the code is not easily comprehensible, which is always a major concern for production code. One of the most compelling features of LINQ, the possibility to express data queries in a clear fashion, is lost here. The nested grouping/sorting can be somewhat clarified by suitable comments and indentation. However, while excessive comments are better than none, sometimes they are a sign that the code is badly structured. Even worse, the indentation is actually flattened by the Visual Studio code formatting.
Another shortcoming is that the number of columns is fixed. While this was fine by the requirements, a more general solution would be welcome.
Let’s take a closer look at the query to understand how it can be improved. The crucial piece of the query is the rather obscure construct
from row in
from row in grouped1
[...]
select row
which is nested three times in the query.
At first this is quite confusing because the first from gets translated into a call to SelectMany, while the second one only marks the beginning of a sub-query. All becomes clearer once one realizes that the first from row in together with its matching select row does nothing more but to flatten the enclosed sequence. This is best visible with the innermost appearance of this pattern
from row in grouped3 select row
which gets translated into the idiom SelectMany(grouped3 => grouped3) 1. Remember that SelectMany is the LINQ equivalent of the monadic Bind. As such it flattens the sequence of the grouped values that grouped3 takes on, which are each of type IGrouping<string, DataRow> and therefore also of type IEnumerable<DataRow>, into a single IEnumerable<DataRow>.
Equipped with this insight we can reformulate the query as a recursive method. The parameter rows represents the rows to be grouped and sorted, groupColumns are the columns by which to group, and colIndex is the current index into groupColumns.
static IEnumerable<DataRow> RecursiveGroupSort(IEnumerable<DataRow> rows, IList<DataColumn> groupColumns, int colIndex)
{
return
from row in rows
group row by row.Field<string>(groupColumns[colIndex])
into grouped
orderby grouped.Min(x => x.Field<int>("Timestamp"))
from row in (colIndex < groupColumns.Count - 1 ? RecursiveGroupSort(grouped, groupColumns, colIndex + 1) : grouped)
select row;
}
The expression
(colIndex < groupColumns.Count - 1 ? RecursiveGroupSort(grouped, groupColumns, colIndex + 1) : grouped)
represents the sub-query, which is either a recursive call, or the innermost sequence of grouped rows. Actually the parentheses here are not necessary, but they make the code slightly more legible.
The method RecursiveGroupSort takes an arbitrary number of columns to group by. Furthermore the code is much clearer than the original unrolled query. One might go even further from here and implement this as an anonymous recursion, but the resulting code would most certainly be much less comprehensible.
- Actually for performance reasons it gets translated into the overload
SelectMany (grouped3 => grouped3, (grouped3, row) => row), see Wes Dyers marvelous blog post for further explanation. ↩
