Recently a colleague asked me for advice about using LINQ to group and sort a table in a specific way. It contains multiple columns by which to group, and a single time-stamp column by which to sort. Consider this data as an example:
Column1 | Column2 | Column3 | Timestamp |
---|---|---|---|
C | C | C | 6 |
A | B | C | 4 |
C | A | A | 5 |
B | B | C | 1 |
C | C | C | 3 |
C | C | C | 1 |
C | B | C | 7 |
B | A | A | 5 |
A | C | B | 7 |
A | B | B | 9 |
The first step is to group by Column1, and then sort the groups by their lowest time-stamp. For equal values of the time-stamp, the order is arbitrary. Repeat the step for Column2 and Column3, preserving the previous grouping. The result would be
Column1 | Column2 | Column3 | Timestamp |
---|---|---|---|
C | C | C | 6 |
C | C | C | 3 |
C | C | C | 1 |
C | A | A | 5 |
C | B | C | 7 |
B | B | C | 1 |
B | A | A | 5 |
A | B | C | 4 |
A | B | B | 9 |
A | C | B | 7 |
The requirements can be quite literally translated into the following LINQ query
IEnumerable<DataRow> sorted = // Column1 from row in table.AsEnumerable() group row by row.Field<string>(0) into grouped1 orderby grouped1.Min(x => x.Field<int>("Timestamp")) from row in // Column2 from row in grouped1 group row by row.Field<string>(1) into grouped2 orderby grouped2.Min(x => x.Field<int>("Timestamp")) from row in // Column3 from row in grouped2 group row by row.Field<string>(2) into grouped3 orderby grouped3.Min(x => x.Field<int>("Timestamp")) from row in grouped3 select row select row select row;
While this does indeed achieve the desired results, there are a few drawbacks to this solution.
Most importantly the code is not easily comprehensible, which is always a major concern for production code. One of the most compelling features of LINQ, the possibility to express data queries in a clear fashion, is lost here. The nested grouping/sorting can be somewhat clarified by suitable comments and indentation. However, while excessive comments are better than none, sometimes they are a sign that the code is badly structured. Even worse, the indentation is actually flattened by the Visual Studio code formatting.
Another shortcoming is that the number of columns is fixed. While this was fine by the requirements, a more general solution would be welcome.
Let’s take a closer look at the query to understand how it can be improved. The crucial piece of the query is the rather obscure construct
from row in from row in grouped1 [...] select row
which is nested three times in the query.
At first this is quite confusing because the first from
gets translated into a call to SelectMany
, while the second one only marks the beginning of a sub-query. All becomes clearer once one realizes that the first from row in
together with its matching select row
does nothing more but to flatten the enclosed sequence. This is best visible with the innermost appearance of this pattern
from row in grouped3 select row
which gets translated into the idiom SelectMany(grouped3 => grouped3)
1. Remember that SelectMany
is the LINQ equivalent of the monadic Bind
. As such it flattens the sequence of the grouped values that grouped3
takes on, which are each of type IGrouping<string, DataRow>
and therefore also of type IEnumerable<DataRow>
, into a single IEnumerable<DataRow>
.
Equipped with this insight we can reformulate the query as a recursive method. The parameter rows
represents the rows to be grouped and sorted, groupColumns
are the columns by which to group, and colIndex
is the current index into groupColumns
.
static IEnumerable<DataRow> RecursiveGroupSort(IEnumerable<DataRow> rows, IList<DataColumn> groupColumns, int colIndex) { return from row in rows group row by row.Field<string>(groupColumns[colIndex]) into grouped orderby grouped.Min(x => x.Field<int>("Timestamp")) from row in (colIndex < groupColumns.Count - 1 ? RecursiveGroupSort(grouped, groupColumns, colIndex + 1) : grouped) select row; }
The expression
(colIndex < groupColumns.Count - 1 ? RecursiveGroupSort(grouped, groupColumns, colIndex + 1) : grouped)
represents the sub-query, which is either a recursive call, or the innermost sequence of grouped rows. Actually the parentheses here are not necessary, but they make the code slightly more legible.
The method RecursiveGroupSort
takes an arbitrary number of columns to group by. Furthermore the code is much clearer than the original unrolled query. One might go even further from here and implement this as an anonymous recursion, but the resulting code would most certainly be much less comprehensible.
- Actually for performance reasons it gets translated into the overload
SelectMany (grouped3 => grouped3, (grouped3, row) => row)
, see Wes Dyers marvelous blog post for further explanation. ↩