Recursive LINQ

Recently a colleague asked me for advice about using LINQ to group and sort a table in a specific way. It contains multiple columns by which to group, and a single time-stamp column by which to sort. Consider this data as an example:

Column1	Column2	Column3	Timestamp
C	C	C	6
A	B	C	4
C	A	A	5
B	B	C	1
C	C	C	3
C	C	C	1
C	B	C	7
B	A	A	5
A	C	B	7
A	B	B	9

The unsorted table.

The first step is to group by Column1, and then sort the groups by their lowest time-stamp. For equal values of the time-stamp, the order is arbitrary. Repeat the step for Column2 and Column3, preserving the previous grouping. The result would be

Column1	Column2	Column3	Timestamp
C	C	C	6
C	C	C	3
C	C	C	1
C	A	A	5
C	B	C	7
B	B	C	1
B	A	A	5
A	B	C	4
A	B	B	9
A	C	B	7

The grouped/sorted result.

The requirements can be quite literally translated into the following LINQ query

IEnumerable<DataRow> sorted =
// Column1
    from row in table.AsEnumerable()
    group row by row.Field<string>(0)
        into grouped1
    orderby grouped1.Min(x => x.Field<int>("Timestamp"))
    from row in
// Column2
        from row in grouped1
        group row by row.Field<string>(1)
            into grouped2
        orderby grouped2.Min(x => x.Field<int>("Timestamp"))
        from row in
// Column3
            from row in grouped2
            group row by row.Field<string>(2)
                into grouped3
            orderby grouped3.Min(x => x.Field<int>("Timestamp"))
            from row in grouped3
            select row
        select row
    select row;

While this does indeed achieve the desired results, there are a few drawbacks to this solution.

Most importantly the code is not easily comprehensible, which is always a major concern for production code. One of the most compelling features of LINQ, the possibility to express data queries in a clear fashion, is lost here. The nested grouping/sorting can be somewhat clarified by suitable comments and indentation. However, while excessive comments are better than none, sometimes they are a sign that the code is badly structured. Even worse, the indentation is actually flattened by the Visual Studio code formatting.

Another shortcoming is that the number of columns is fixed. While this was fine by the requirements, a more general solution would be welcome.

Let’s take a closer look at the query to understand how it can be improved. The crucial piece of the query is the rather obscure construct

from row in
    from row in grouped1
[...]
select row

which is nested three times in the query.

At first this is quite confusing because the first from gets translated into a call to SelectMany, while the second one only marks the beginning of a sub-query. All becomes clearer once one realizes that the first from row in together with its matching select row does nothing more but to flatten the enclosed sequence. This is best visible with the innermost appearance of this pattern

from row in grouped3
select row

which gets translated into the idiom SelectMany(grouped3 => grouped3) ¹. Remember that SelectMany is the LINQ equivalent of the monadic Bind. As such it flattens the sequence of the grouped values that grouped3 takes on, which are each of type IGrouping<string, DataRow> and therefore also of type IEnumerable<DataRow>, into a single IEnumerable<DataRow>.

Equipped with this insight we can reformulate the query as a recursive method. The parameter rows represents the rows to be grouped and sorted, groupColumns are the columns by which to group, and colIndex is the current index into groupColumns.

static IEnumerable<DataRow> RecursiveGroupSort(IEnumerable<DataRow> rows, IList<DataColumn> groupColumns, int colIndex)
{
    return
        from row in rows
        group row by row.Field<string>(groupColumns[colIndex])
            into grouped
        orderby grouped.Min(x => x.Field<int>("Timestamp"))
        from row in (colIndex < groupColumns.Count - 1 ? RecursiveGroupSort(grouped, groupColumns, colIndex + 1) : grouped)
        select row;
}

The expression

(colIndex < groupColumns.Count - 1 ? RecursiveGroupSort(grouped, groupColumns, colIndex + 1) : grouped)

represents the sub-query, which is either a recursive call, or the innermost sequence of grouped rows. Actually the parentheses here are not necessary, but they make the code slightly more legible.

The method RecursiveGroupSort takes an arbitrary number of columns to group by. Furthermore the code is much clearer than the original unrolled query. One might go even further from here and implement this as an anonymous recursion, but the resulting code would most certainly be much less comprehensible.

Actually for performance reasons it gets translated into the overload SelectMany (grouped3 => grouped3, (grouped3, row) => row), see Wes Dyers marvelous blog post for further explanation. ↩

Recursive LINQ

Like this:

Leave a ReplyCancel reply

Recursive LINQ

Share this:

Like this:

Leave a ReplyCancel reply