Home > Articles > Programming > C#

More Effective C#: Item 36: Understand How Query Expressions Map to Method Calls

  • Print
  • + Share This
  • 💬 Discuss
LINQ is built on two concepts: A query language and a translation from that query language to a set of methods. Every query expression has a mapping to some method call or calls. You should understand this mapping from two different perspectives: From the perspective of a class user and as a class designer.

LINQ is built on two concepts: A query language and a translation from that query language to a set of methods. The C# compiler converts query expressions in that query language into method calls. Every query expression has a mapping to some method call or calls. You should understand this mapping from two different perspectives. From the perspective of a class user, you need to understand that your query expressions are nothing more than method calls. A where clause translates to a call to a method named Where(), with the proper set of parameters. As a class designer, you should evaluate the implementations of those methods provided by the base framework and determine if you can create better implementations for your types. If not, you should simply defer to the base library versions. However, when you can create a better version, you must make sure that you fully understand the translation from query expressions into method calls. It’s your responsibility to ensure that your method signatures correctly handle every translation case. For some of the query expressions, that’s rather obvious. However, a couple of the more complicated expressions are a little more difficult to comprehend.

The full Query Expression Pattern contains eleven different methods. The following is the definition from the C# language Specification, §7.15.3 (reprinted with permission from Microsoft Corporation):

delegate R Func<T1,R>(T1 arg1);
delegate R Func<T1,T2,R>(T1 arg1, T2 arg2);
class C
{
	public C<T> Cast<T>();
}
class C<T> : C
{
	public C<T> Where(Func<T,bool> predicate);
	public C<U> Select<U>(Func<T,U> selector);
	public C<V> SelectMany<U,V>(Func<T,C<U>> selector,
		Func<T,U,V> resultSelector);
	public C<V> Join<U,K,V>(C<U> inner, Func<T,K> outerKeySelector,
		Func<U,K> innerKeySelector, Func<T,U,V> resultSelector);
	public C<V> GroupJoin<U,K,V>(C<U> inner, 
            Func<T,K> outerKeySelector,
		Func<U,K> innerKeySelector, Func<T,C<U>,V> resultSelector);
	public O<T> OrderBy<K>(Func<T,K> keySelector);
	public O<T> OrderByDescending<K>(Func<T,K> keySelector);
	public C<G<K,T>> GroupBy<K>(Func<T,K> keySelector);
	public C<G<K,E>> GroupBy<K,E>(Func<T,K> keySelector,
		Func<T,E> elementSelector);
}
class O<T> : C<T>

{
	public O<T> ThenBy<K>(Func<T,K> keySelector);
	public O<T> ThenByDescending<K>(Func<T,K> keySelector);
}
class G<K,T> : C<T>

{
	public K Key { get; }
}

The .NET base library provides two general purpose reference implementations of this pattern. System.Linq.Enumerable provides extension methods on IEnumerable<T> that implement the query expression pattern. System.Linq.Queryable provides a similar set of extension method on IQueryable<T> that support a query provider’s ability to translate queries into some other format for execution. (For example, the LINQ to SQL implementation converts query expressions to SQL queries that are executed by the SQL database engine). As a class user, that means you are probably using one of those two reference implementations for most of your queries. Secondly, as a class author, it means that if you create a data source that implements IEnumerable<T>, or IQueryable<T> (or a closed generic type from IEnumerable<T> or IQueryable<T>) they your type already implements the Query Expression Pattern. You’re type has that implementation using the extension methods defined in the base library.

Before we go farther, you should understand that the C# language does not enforce any execution semantics on the query expression pattern. You could create a method that matches the signature of one of the query methods and does anything internally. The compiler cannot verify that your Where method satisfies the expectations of the query expression pattern. All it can do is insure that the syntactic contract is satisfied. This isn’t any different than any interface method. You could create an interface method that did anything, whether or not that met users’ expectations. Of course, that doesn’t mean you should ever consider such a plan. If you implement any of the Query Expression Pattern methods, you should do ensure that its behavior is consistent with the reference implementations, both syntactically and semantically. Other than performance differences, callers should not be able to determine if your method is being used, or if the reference implementations are being used.

Translating from query expressions to method invocations is a rather complicated iterative process. The compiler repeatedly translates expressions to methods until all expressions have been translated. Furthermore, the compiler has a specified order for which query expressions are translated to method calls. I’m not explaining them in that order. The compiler order is easy for the compiler, and is documented in the C# Language Specification. I chose an order that makes it easier to explain to humans. For our purposes, I’ll discuss some of the translations in smaller, simpler examples.

Looking at this query, we’ll examine where, select, and range variables:

int[] numbers = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
var smallNumbers = from n in numbers
                   where n < 5
                   select n;

from n in numbers’ binds the range variable ‘n’ to each value in numbers. The where clause defines a filter, which would be translated into a where method. ‘where n < 5’ translates to the following:

numbers.Where((n) => n < 5);

Where is nothing more than a filter. The output of Where is a proper subset of the input sequence, containing only those elements that satisfy the predicate. The input and output sequences must contain the same type, and a correct Where method must not modify the items in the input sequence. (User defined Predicates may modify items, but that’s not the responsibility of the query expression patthern.)

That where method can be implemented as either an instance method accessible to numbers, or an extension method matching the type of numbers. From the example above, numbers is an array of int. Therefore, n, in the method call above must be an integer.

Where is certainly the simplest of the translations from query expression to method call. Before we go on, let’s dig a little deeper into how this works, and what that means for the translations. The compiler completes its translation from query expression to method call before any overload resolution or type binding. The compiler does not know if there are any candidate methods when the compiler translates the query expression to a method call. It doesn’t examine the type, it doesn’t look for any candidate extension methods. It simply translates the query expression into the method call. After all queries have been translated into method call syntax, the compiler performs the work of searching for candidate methods and then determining the best match.

Next, you can grow that simple example to include the select expression in the query. Select clauses are translated into Select methods. However, there are special cases when the Select method can be optimized away. The query above is a degenerate select, selecting the range variable. Degenerate select queries can be optimized away, because the output sequence is not equal to the input sequence. The query above has a where clause, which breaks that identity relationship between the input sequence and the output sequence. Therefore, the final method call version of the query above is this:

var smallNumbers = numbers.Where(n => n < 5);

The select clause is removed because it is redundant. That’s safe because the select operates on an immediate result from another query expression (where in this example). When the select does not operate on the immediate result of another expression, it cannot be optimized away. Therefore this query:

var allNumbers = from n in numbers select n;

Will be translated into this method call:

var allNumbers = numbers.Select(n => n);

While we’re covering select, select will often be used to transform or project one input element into a different element, or into a different type. This query modifies the value of the result:

int[] numbers = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
var smallNumbers = from n in numbers
                   where n < 5
                   select n * n;

Or, you could transform the input sequence into a different type as follows:

int [] numbers = {0,1,2,3,4,5,6,7,8,9};
var squares = from n in numbers
              select new { Number = n, Square = n * n};

The select clause will map to a Select method matching the signature in the Query Expression Pattern:

var squares = numbers.Select(n => 
    new { Number = n, Square = n * n});

Select will transform the input type into the output type. A proper select method must produce exactly one output element for every input element. Also, a proper implementation of Select must not modify the items in the input sequence.

That’s the end of the simpler query expressions. Now, we’ll discuss some of the less obvious transformations.

Ordering relations map to OrderBy and ThenBy methods, or OrderByDescending, and ThenByDescending. This query:

var people = from e in employees
             where e.Age > 30
             orderby e.LastName, e.FirstName, e.Age
             select e;

Translates into:

var people = employees.Where(e => e.Age > 30).
    OrderBy(e => e.LastName).
    ThenBy(e => e.FirstName).
    ThenBy(e => e.Age);

Notice in the definition of the query expression pattern that ThenBy operates on a sequence returned by OrderBy or ThenBy. Those sequences can contain markers that enable ThenBy to operate on the sorted subranges where the sort keys are equal.

This transformation is not the same if the orderby clauses are expressed as different clauses. The following query sorts the sequence entirely by LastName, then sorts the entire sequence again by FirstName, and then sorts again by Age:

// Not correct. Sorts the entire sequence 3 times.
var people = from e in employees
             where e.Age > 30
             orderby e.LastName
             orderby e.FirstName
             orderby e.Age
             select e;

As separate queries, you could specify that any of the orderby clauses use descending order:

var people = from e in employees
             where e.Age > 30
             orderby e.LastName descending
             thenby e.FirstName
             thenby e.Age
             select e; 

Orderby creates a different sequence type as its output so that thenby clauses can be more efficient. Also, so that the types are correct for the overall query. OrderBy cannot operate on an unordered sequence, only on a sorted sequence (typed as O<T> above). Subranges are already sorted and marked. If you create your own orderby and thenby methods for a type, you must adhere to this rule. You’ll need to add some identifier to each sorted subrange so that any subsequence thenby clause can work properly. ThenBy methods need to be typed to take the output of an OrderBy or ThenBy method, and sort each sub-range correctly. Everything I’ve said about OrderBy and ThenBy also apply to OrderByDescending and ThenByDesecnding. In fact, if your type should have a custom version of any of those methods, you should almost always implement all four of them.

The remaining expression translations involve multiple steps. Those queries involve groupings, or multiple from clauses that introduce continuations. Query expressions that contain continuations will be transformed into nested queries. Then those nested queries will be transformed into methods. This is a reasonably simple query with a continuation:

var results = from e in employees
              group e by e.Department into d
              select new { Department = d.Key, Size = d.Count() }; 

Before any other translations are performed, the continuation is translated into a nested query:

var results = from d in 
    from e in employees group e by e.Department
    select new { Department = d.Key, Size = d.Count()};

Once the nested query is created, the methods translate into:

var results = employees.GroupBy(e => e.Department).
    Select(d => new { Department = d.Key, Size = d.Count()});

The query I used above shows a GroupBy that returns a single sequence. The other GroupBy method in the query expression pattern will return a sequence of groups where each group contains a key and a list of values:

var results = from e in employees
    group e by e.Department into d
    select new { Department = d.Key, Employees = d.AsEnumerable()}; 

Obviously, that maps to the following method calls:

var results2 = employees.GroupBy(e => e.Department).
    Select(d => new { Department = d.Key, 
        Employees = d.AsEnumerable()});

GroupBy methods produce a seqeunce of Key / Value List pairs where the keys are the group selectors, and the values are the sequence of items in that group. The query select clause may create new objects for the values in each group. However, the output should always be a sequence of key / value pairs where the value contains some element created by each item in the input sequence that belongs to that particular group.

The last methods to understand are the SelectMany, Join, and GroupJoin. These three methods get more complicated because they all work with multiple input sequences. The methods that implement these translations perform the enumerations across multiple sequences and then flatten the resulting sequences into a single output sequence. SelectMany performs a cross join on the two source sequences. For example, this query:

int[] odds = {1,3,5,7};
int[] evens = {2,4,6,8};
var pairs = from oddNumber in odds
    from evenNumber  in evens
    select new {oddNumber, evenNumber, Sum=oddNumber+evenNumber};

Produces a sequence with 16 elements:

1,2, 3
1,4, 5
1,6, 7
1,8, 9
3,2, 5
3,4, 7
3,6, 9
3,8, 11
5,2, 7
5,4, 9
5,6, 11
5,8, 13
7,2, 9
7,4, 11
7,6, 13
7,8, 15

Query expressions that contain multiple select clauses are translated into a SelectMany method call. The example query above would be translated into the follow SelectMany call:

int[] odds = { 1, 3, 5, 7 };
int[] evens = { 2, 4, 6, 8 };
var values = odds.SelectMany(oddNumber => evens,
    (oddNumber, evenNumber) =>
    new { oddNumber, evenNumber, Sum = oddNumber + evenNumber });

The first parameter to the SelectMany is a function that maps each element in the first source sequence to the sequence of elements in the second source sequence. The second parameter, the output selector, creates the projections from the pairs of items in both sequences.

SelectMany() must iterate the first sequence. For each value in the first sequence, it iterates the second sequence, producing the result value from the pair of input values. The output selected will be called for each element in a flattened sequence of every combination of values from both sequences. One possible implementation of SelectMany is:

static IEnumerable<TOutput> SelectMany<T1, T2, TOutput>(this IEnumerable<T1> src,
    Func<T1, IEnumerable<T2>> inputSelector,
    Func<T1, T2, TOutput> resultSelector)
{
    foreach (T1 first in src)
    {
        foreach (T2 second in inputSelector(first))
            yield return resultSelector(first, second);
    }
}

The first input sequence is iterated. The second input sequence is iterated, using the current value on the input sequence. That’s important because the input selector on the second sequence may depend on the current value in the first sequence. Then, as each pair of elements is generated, the result selector is called on each pair.

If your query has more expressions, and SelectMany does not create the final result, SelectMany creates a tuple that contains one item from each input sequence. Sequences of that tuple are the input sequence for later expressions. For example, this modified version of the original query:

int[] odds = { 1, 3, 5, 7 };
int[] evens = { 2, 4, 6, 8 };
var values = from oddNumber in odds
            from evenNumber in evens
            where oddNumber > evenNumber
            select new { oddNumber, evenNumber, 
            Sum = oddNumber + evenNumber };

Produces this SelectMany method call:

odds.SelectMany(oddNumber => evens, 
    (oddNumber, evenNumber) => 
    new {oddNumber, evenNumber});

And the full query is then translated into this statement:

var values = odds.SelectMany(oddNumber => evens,
    (oddNumber, evenNumber) =>
    new { oddNumber, evenNumber })
    .Where(pair => pair.oddNumber > pair.evenNumber).
    Select(pair => new { 
        pair.oddNumber, 
        pair.evenNumber, 
        Sum = pair.oddNumber + pair.evenNumber });

You can see another interesting property in how SelectMany gets treated when the compiler translates multiple from clauses into SelectMany method calls. SelectMany composes well. More than two from clauses will produce more than one SelectMany() method call. The resulting pair from the first SelectMany() call will be fed into the second SelectMany() which will produce a triple. The triple will contain all combinations of all three sequences. This query:

var triples = from n in new int[] { 1, 2, 3 }
              from s in new string[] { "one", "two", "three" }
              from r in new string[] { "I", "II", "III" }
              select new { Arabic = n, Word = s, Roman = r };

Will be translated into the following method calls:

var numbers = new int[] {1,2,3};
var words = new string[] {"one", "two", "three"};
var romanNumerals = new string[] { "I", "II", "III" };
var triples = numbers.SelectMany(n => words,
    (n, s) => new { n, s}).
    SelectMany(pair => romanNumerals,
    (pair,n) => new { Arabic = pair.n, Word = pair.s, Roman = n });

You can see how you can extend from 3 to any arbitrary number of input sequences by applying more SelectMany() calls. These later examples also demonstrate how SelectMany can introduce anonymous types into your queries. The sequence returned from SelectMany() is a sequence of some anonymous type.

Two more translations to understand: Join and GroupJoin. Join and GroupJoin both obviously will be applied on join expressions. GroupJoin will always be used when the join expression contains an into clause. Join will be used when the join expression does not contain an into clause.

A join without an into looks like this:

var numbers = new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
var labels = new string[] { "0", "1", "2", "3", "4", "5" };
var query = from num in numbers
            join label in labels on num.ToString() equals label
            select new { num, label };

It translates into the following:

var query = numbers.Join(labels, num => num.ToString(), label => label,
    (num, label) => new { num, label });

The into clause creates a list of subdivided results:

var groups = from p in projects
             join t in tasks on p equals t.Parent into projTasks
             select new { Project = p, projTasks };

That translates into a GroupJoin:

var groups = projects.GroupJoin(tasks,
    p => p, t => t.Parent, (p, projTasks) =>
        new { Project = p, TaskList = projTasks });

The entire process of converting all expressions into method calls is very involved, and often tasks several steps.

The good news is that for the most part, you can happily go about your work secure in the knowledge that the compiler does the correct translation and because your type implements IEnumerable<T>, users of your type are getting the correct behavior.

But you may have that nagging urge to create your own version of one or more of the methods which implement the query expression pattern Maybe your collection type is always sorted on some key, and you can short circuit the OrderBy method. Maybe your type exposes lists of lists, which means you may find the GroupBy and GroupJoin can be implemented more efficiently. More ambitiously, maybe you intend to create your own provider and you’ll be implementing the entire pattern. That being the case, you need to understand the behavior of each query method, and know what should go into your implementation. Refer back to the examples above and make sure you understand the expected behavior of each query method before you embark on creating your own implementations.

Many of the custom types you define model some form of collection. The developers that use our types will expect to use our collections in the same way that they use every other collection type, with the built in query syntax. As long as you support the IEnumerable<T> interface for any type that models a collection, you’ll meet that expectation. However, your types may be able to improve upon the default implementation by making use of the internal specifics in your type. When you choose to do that, ensure that your type matches the contract from the query pattern, in all forms.

  • + Share This
  • 🔖 Save To Your Account

Discussions

comments powered by Disqus