Performance increase in LINQ 7

Programming LINQ C#

LINQ (Language-Integrated Query) is a simple and convenient query language. It allows you to express complex operations in a simple way. However, this simplicity of use comes at a price of the execution speed and extra memory allocation. In most situations, it has no significant effect. However, in cases when performance is critical, these limitations may be pretty unpleasant.

So, the recent update has enhanced the performance of the following methods: Enumerable.Max, Enumerable.Min, Enumerable.Average, Enumerable.Sum.

Let's see how their performance increased using the following benchmark:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Collections.Generic;
using System.Linq;


[MemoryDiagnoser(displayGenColumns: false)]
public partial class Program
{
static void Main(string[] args) =>
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

[Params (10, 10000)]
public int Size { get; set; }
private IEnumerable<int> items;

[GlobalSetup]
public void Setup()
{
items = Enumerable.Range(1, Size).ToArray();
}

[Benchmark]
public int Min() => items.Min();

[Benchmark]
public int Max() => items.Max();

[Benchmark]
public double Average() => items.Average();

[Benchmark]
public int Sum() => items.Sum();
}

The results show that the execution time for finding the minimum element of an array has generally decreased. In 10 times for small arrays and 20 times for arrays containing 10,000 elements. Similarly, for other methods (except for finding the sum, the difference between the sizes of the collections did not affect the results much).

It is also worth noting that in .NET 7 no additional memory is allocated when methods are called.

In .NET 6, all operations on arrays are much faster than on lists. The same is true for small collections in .NET 7. As the number of elements increases, lists are equal to arrays in performance!

According to the test results, lists' performance increased by 31 times.

But how could this be achieved?

Let's take a closer look at the implementation of the Min<> method.  That's how the Min method is implemented in .NET 6:

public static int Min(this IEnumerable<int> source)
{
if (source == null)
{
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.source);
}

int value;
using (IEnumerator<int> e = source.GetEnumerator())
{
if (!e.MoveNext())
{
ThrowHelper.ThrowNoElementsException();
}

value = e.Current;
while (e.MoveNext())
{
int x = e.Current;
if (x < value)
{
value = x;
}
}
}
return value;
}

The method is quite simple. We get the IEnumerable<int> collection, take the collection element and use the MoveNext method to get the next element. We compare them, save the one that is smaller, and repeat until the end of the collection.

The new version of the Min method is quite different:

public static int Min(this IEnumerable<int> source) => MinInteger(source);

The MinInteger method is applied to a collection of integers. Let's examine it in more details:

private static T MinInteger<T>(this IEnumerable<T> source)
where T : struct, IBinaryInteger<T>
{
T value;

if (source.TryGetSpan(out ReadOnlySpan<T> span))
{
if (Vector.IsHardwareAccelerated && span.Length >= Vector<T>.Count * 2)
{
// Optimized implementation
return ....;
}
}
//Implementation as in .NET 6
}

In .NET 7, one of the larger improvements in LINQ relates to a much larger set of optimizations throughout .NET 7, that of vectorization.

Vectorization is the process of changing an implementation to use vector instructions, which are SIMD (single instruction multiple data) instructions capable of processing multiple pieces of data at the same time.

Imagine that you wanted to determine whether an array of 1,000,000 bytes contained any zeros. You could loop over every byte in the array looking for 0, in which case you'd be performing 1,000,000 reads and comparisons. But what if you instead treated the array of 1,000,000 bytes as a span of 250,000 Int32 values? You'd then only need to perform 250,000 read and comparison operations, and since the cost of reading and comparing an Int32 is generally no more expensive than the cost of reading and comparing a byte, you'd have just quadrupled the throughput of your loop. What if you instead handled it as a span of 125,000 Int64 values? What if you could process even more of the data at a time? That's vectorization.

Modern hardware provides the ability to process 128 bits, 256 bits, even 512 bits at a time (referred to as the width of the vector), with a plethora of instructions for performing various operations over a vector of data at a time. As you might guess, using these instructions can result in absolutely massive performance speedups. Many of these instructions were surfaced for direct consumption as “hardware intrinsics” in .NET Core 3.1 and .NET 5, but using those directly requires advanced know-how and is only recommended when absolutely necessary.

Higher level support has previously been exposed via the Vector<T> type, which enables you to write code in terms of Vector<T>, and the JIT then compiles that usage down to the best available instructions for the given system. Vector<T> is referred to as being “variable width,” because, depending on the system, the code actually ends up running on, it might map to 128-bit or 256-bit instructions, and because of that variable nature, the operations you can perform with it are somewhat limited. .NET 7 sees the introduction of the new fixed-width Vector128<T> and Vector256<T> types, which are much more flexible.

Many of the public APIs in .NET7 itself are now vectorized using one or more of these approaches.

The Span type has also undergone some changes in the new version of C#. Since C# 11 introduced the option to create reference fields in a ref struct, the internal representation of Span<T> has changed. Previously, a special internal type — ByReference<T>, was used to store a reference to the beginning of the memory area, but there was no security check in it. Now ref fields are used for this purpose. It provides a more secure memory operation.

But let's get back to the Min method. When you get ReadOnlySpan, the method tries to perform a vector search using the Vector<T> type. To do this, the following condition should be met:

if (Vector.IsHardwareAccelerated && span.Length >= Vector<T>.Count * 2)

The first part of the condition checks whether the Vector.IsHardwareAccelerated property returns true. Let's take a look at the implementation of this property:

public static bool IsHardwareAccelerated
{
[Intrinsic]
get => false;
}

The [Intrinsic] attribute is applied to the getter. The attribute indicates that the value returned by IsHardwareAccelerated can be replaced JIT. The property returns true if hardware acceleration can be applied to operations on vectors through built-in JIT support, otherwise false is returned. To enable hardware acceleration, you need to run the build for the x64 platform with the Release configuration or build the project for AnyCPU with the "Prefer 32-bit" setting disabled.

To fulfill the second part of the condition, the size of Span should be at least 2 times the size of the vector.

How is the size of this vector calculated?

The Vector<T> data type is used in the new implementation of the Min method to create a vector. This type is a wrapper for three other types: Vector64<T>Vector128<T> and Vector256<T>. They contain vectorized data of the corresponding length. Vector128<T> can store 16 8-bit, 8 16-bit, 4 32-bit or 2 64-bit values. The type to be used is selected depending on whether it is supported by the processor or not.

Thus, if the Vector128<T> type is used when executing the method, then the Span type obtained from the array or list must contain 8 or more elements of the int type.

If all conditions are met, the method uses the advantages of the ReadOnlySpan and Vector types to optimize finding the minimum:

private static T MinInteger<T>(this IEnumerable<T> source)
where T : struct, IBinaryInteger<T>
{
....
if (Vector.IsHardwareAccelerated && span.Length >= Vector<T>.Count * 2)
{
var mins = new Vector<T>(span);
index = Vector<T>.Count;
do
{
mins = Vector.Min(mins, new Vector<T>(span.Slice(index)));
index += Vector<T>.Count;
}
while (index + Vector<T>.Count <= span.Length);

value = mins[0];
for (int i = 1; i < Vector<T>.Count; i++)
{
if (mins[i] < value)
{
value = mins[i];
}
}
....
}

Let's explore how the vectorized implementation of finding a minimum works. A vector is created from the first N Span elements (N depends on the type of vector; for Vector128<int> it is 4 elements). This vector is compared with a vector of the following N elements in Vector.Min. The resulting vector contains the minimum value of each pair of elements of the two given vectors. Then the next part of Span is taken, and the search continues. You only need to find the minimum value among those stored in the resulting vector.

The advantage of using a vector is that all operations on its elements occur simultaneously.

An example of using Span and vectorization to optimize the Min method is shown above. But what about the other methods? You can meet similar features for the MaxAverageSum methods. The optimized versions of these methods are available for both arrays and lists of intlongfloatdouble and decimal types.

Conclusion

Microsoft developers used Span and hardware acceleration to work with vectors. As a result, the .NET community brought visible enhancement to LINQ performance. Now we can use advanced methods in cases where the speed of code execution, and allocated memory are critical.