Should I always use Parallel.Foreach because more threads MUST speed up everything?

C#.Net 4.0ForeachParallel Processing

C# Problem Overview


Does it make sense to you to use for every normal foreach a parallel.foreach loop ?

When should I start using parallel.foreach, only iterating 1,000,000 items?

C# Solutions


Solution 1 - C#

No, it doesn't make sense for every foreach. Some reasons:

  • Your code may not actually be parallelizable. For example, if you're using the "results so far" for the next iteration and the order is important)
  • If you're aggregating (e.g. summing values) then there are ways of using Parallel.ForEach for this, but you shouldn't just do it blindly
  • If your work will complete very fast anyway, there's no benefit, and it may well slow things down

Basically nothing in threading should be done blindly. Think about where it actually makes sense to parallelize. Oh, and measure the impact to make sure the benefit is worth the added complexity. (It will be harder for things like debugging.) TPL is great, but it's no free lunch.

Solution 2 - C#

No, you should definitely not do that. The important point here is not really the number of iterations, but the work to be done. If your work is really simple, executing 1000000 delegates in parallel will add a huge overhead and will most likely be slower than a traditional single threaded solution. You can get around this by partitioning the data, so you execute chunks of work instead.

E.g. consider the situation below:

Input = Enumerable.Range(1, Count).ToArray();
Result = new double[Count];

Parallel.ForEach(Input, (value, loopState, index) => { Result[index] = value*Math.PI; });

The operation here is so simple, that the overhead of doing this in parallel will dwarf the gain of using multiple cores. This code runs significantly slower than a regular foreach loop.

By using a partition we can reduce the overhead and actually observe a gain in performance.

Parallel.ForEach(Partitioner.Create(0, Input.Length), range => {
   for (var index = range.Item1; index < range.Item2; index++) {
      Result[index] = Input[index]*Math.PI;
   }
});

The morale of the story here is that parallelism is hard and you should only employ this after looking closely at the situation at hand. Additionally, you should profile the code both before and after adding parallelism.

Remember that regardless of any potential performance gain parallelism always adds complexity to the code, so if the performance is already good enough, there's little reason to add the complexity.

Solution 3 - C#

The short answer is no, you should not just use Parallel.ForEach or related constructs on each loop that you can. Parallel has some overhead, which is not justified in loops with few, fast iterations. Also, break is significantly more complex inside these loops.

Parallel.ForEach is a request to schedule the loop as the task scheduler sees fit, based on number of iterations in the loop, number of CPU cores on the hardware and current load on that hardware. Actual parallel execution is not always guaranteed, and is less likely if there are fewer cores, the number of iterations is low and/or the current load is high.

See also Does Parallel.ForEach limits the number of active threads? and Does Parallel.For use one Task per iteration?

The long answer:

We can classify loops by how they fall on two axes:

  1. Few iterations through to many iterations.
  2. Each iteration is fast through to each iteration is slow.

A third factor is if the tasks vary in duration very much – for instance if you are calculating points on the Mandelbrot set, some points are quick to calculate, some take much longer.

When there are few, fast iterations it's probably not worth using parallelisation in any way, most likely it will end up slower due to the overheads. Even if parallelisation does speed up a particular small, fast loop, it's unlikely to be of interest: the gains will be small and it's not a performance bottleneck in your application so optimise for readability not performance.

Where a loop has very few, slow iterations and you want more control, you may consider using Tasks to handle them, along the lines of:

var tasks = new List<Task>(actions.Length); 
foreach(var action in actions) 
{ 
    tasks.Add(Task.Factory.StartNew(action)); 
} 
Task.WaitAll(tasks.ToArray());

Where there are many iterations, Parallel.ForEach is in its element.

The Microsoft documentation states that

> When a parallel loop runs, the TPL partitions the data source so that > the loop can operate on multiple parts concurrently. Behind the > scenes, the Task Scheduler partitions the task based on system > resources and workload. When possible, the scheduler redistributes > work among multiple threads and processors if the workload becomes > unbalanced.

This partitioning and dynamic re-scheduling is going to be harder to do effectively as the number of loop iterations decreases, and is more necessary if the iterations vary in duration and in the presence of other tasks running on the same machine.

I ran some code.

The test results below show a machine with nothing else running on it, and no other threads from the .Net Thread Pool in use. This is not typical (in fact in a web-server scenario it is wildly unrealistic). In practice, you may not see any parallelisation with a small number of iterations.

The test code is:

namespace ParallelTests 
{ 
    class Program 
    { 
        private static int Fibonacci(int x) 
        { 
            if (x <= 1) 
            { 
                return 1; 
            } 
            return Fibonacci(x - 1) + Fibonacci(x - 2); 
        } 

        private static void DummyWork() 
        { 
            var result = Fibonacci(10); 
            // inspect the result so it is no optimised away. 
            // We know that the exception is never thrown. The compiler does not. 
            if (result > 300) 
            { 
                throw new Exception("failed to to it"); 
            } 
        } 

        private const int TotalWorkItems = 2000000; 

        private static void SerialWork(int outerWorkItems) 
        { 
            int innerLoopLimit = TotalWorkItems / outerWorkItems; 
            for (int index1 = 0; index1 < outerWorkItems; index1++) 
            { 
                InnerLoop(innerLoopLimit); 
            } 
        } 

        private static void InnerLoop(int innerLoopLimit) 
        { 
            for (int index2 = 0; index2 < innerLoopLimit; index2++) 
            { 
                DummyWork(); 
            } 
        } 

        private static void ParallelWork(int outerWorkItems) 
        { 
            int innerLoopLimit = TotalWorkItems / outerWorkItems; 
            var outerRange = Enumerable.Range(0, outerWorkItems); 
            Parallel.ForEach(outerRange, index1 => 
            { 
                InnerLoop(innerLoopLimit); 
            }); 
        } 

        private static void TimeOperation(string desc, Action operation) 
        { 
            Stopwatch timer = new Stopwatch(); 
            timer.Start(); 
            operation(); 
            timer.Stop(); 

            string message = string.Format("{0} took {1:mm}:{1:ss}.{1:ff}", desc, timer.Elapsed); 
            Console.WriteLine(message); 
        } 

        static void Main(string[] args) 
        { 
            TimeOperation("serial work: 1", () => Program.SerialWork(1)); 
            TimeOperation("serial work: 2", () => Program.SerialWork(2)); 
            TimeOperation("serial work: 3", () => Program.SerialWork(3)); 
            TimeOperation("serial work: 4", () => Program.SerialWork(4)); 
            TimeOperation("serial work: 8", () => Program.SerialWork(8)); 
            TimeOperation("serial work: 16", () => Program.SerialWork(16)); 
            TimeOperation("serial work: 32", () => Program.SerialWork(32)); 
            TimeOperation("serial work: 1k", () => Program.SerialWork(1000)); 
            TimeOperation("serial work: 10k", () => Program.SerialWork(10000)); 
            TimeOperation("serial work: 100k", () => Program.SerialWork(100000)); 

            TimeOperation("parallel work: 1", () => Program.ParallelWork(1)); 
            TimeOperation("parallel work: 2", () => Program.ParallelWork(2)); 
            TimeOperation("parallel work: 3", () => Program.ParallelWork(3)); 
            TimeOperation("parallel work: 4", () => Program.ParallelWork(4)); 
            TimeOperation("parallel work: 8", () => Program.ParallelWork(8)); 
            TimeOperation("parallel work: 16", () => Program.ParallelWork(16)); 
            TimeOperation("parallel work: 32", () => Program.ParallelWork(32)); 
            TimeOperation("parallel work: 64", () => Program.ParallelWork(64)); 
            TimeOperation("parallel work: 1k", () => Program.ParallelWork(1000)); 
            TimeOperation("parallel work: 10k", () => Program.ParallelWork(10000)); 
            TimeOperation("parallel work: 100k", () => Program.ParallelWork(100000)); 

            Console.WriteLine("done"); 
            Console.ReadLine(); 
        } 
    } 
} 

the results on a 4-core Windows 7 machine are:

serial work: 1 took 00:02.31 
serial work: 2 took 00:02.27 
serial work: 3 took 00:02.28 
serial work: 4 took 00:02.28 
serial work: 8 took 00:02.28 
serial work: 16 took 00:02.27 
serial work: 32 took 00:02.27 
serial work: 1k took 00:02.27 
serial work: 10k took 00:02.28 
serial work: 100k took 00:02.28 

parallel work: 1 took 00:02.33 
parallel work: 2 took 00:01.14 
parallel work: 3 took 00:00.96 
parallel work: 4 took 00:00.78 
parallel work: 8 took 00:00.84 
parallel work: 16 took 00:00.86 
parallel work: 32 took 00:00.82 
parallel work: 64 took 00:00.80 
parallel work: 1k took 00:00.77 
parallel work: 10k took 00:00.78 
parallel work: 100k took 00:00.77 
done

Running code Compiled in .Net 4 and .Net 4.5 give much the same results.

The serial work runs are all the same. It doesn't matter how you slice it, it runs in about 2.28 seconds.

The parallel work with 1 iteration is slightly longer than no parallelism at all. 2 items is shorter, so is 3 and with 4 or more iterations is all about 0.8 seconds.

It is using all cores, but not with 100% efficiency. If the serial work was divided 4 ways with no overhead it would complete in 0.57 seconds (2.28 / 4 = 0.57).

In other scenarios I saw no speed-up at all with parallel 2-3 iterations. You do not have fine-grained control over that with Parallel.ForEach and the algorithm may decide to "partition " them into just 1 chunk and run it on 1 core if the machine is busy.

Solution 4 - C#

There is no lower limit for doing parallel operations. If you have only 2 items to work on but each one will take a while, it might still make sense to use Parallel.ForEach. On the other hand if you have 1000000 items but they don't do very much, the parallel loop might not go any faster than the regular loop.

For example, I wrote a simple program to time nested loops where the outer loop ran both with a for loop and with Parallel.ForEach. I timed it on my 4-CPU (dual-core, hyperthreaded) laptop.

Here's a run with only 2 items to work on, but each takes a while:

2 outer iterations, 100000000 inner iterations:
for loop: 00:00:00.1460441
ForEach : 00:00:00.0842240

Here's a run with millions of items to work on, but they don't do very much:

100000000 outer iterations, 2 inner iterations:
for loop: 00:00:00.0866330
ForEach : 00:00:02.1303315

The only real way to know is to try it.

Solution 5 - C#

In general, once you go above a thread per core, each extra thread involved in an operation will make it slower, not faster.

However, if part of each operation will block (the classic example being waiting on disk or network I/O, another being producers and consumers that are out of synch with each other) then more threads than cores can begin to speed things up again, because tasks can be done while other threads are unable to make progress until the I/O operation returns.

For this reason, when single-core machines were the norm, the only real justifications in multi-threading were when either there was blocking of the sort I/O introduces or else to improve responsiveness (slightly slower to perform a task, but much quicker to start responding to user-input again).

Still, these days single-core machines are increasingly rare, so it would appear that you should be able to make everything at least twice as fast with parallel processing.

This will still not be the case if order is important, or something inherent to the task forces it to have a synchronised bottleneck, or if the number of operations is so small that the increase in speed from parallel processing is outweighed by the overheads involved in setting up that parallel processing. It may or may not be the case if a share resource requires threads to block on other threads performing the same parallel operation (depending on the degree of lock contention).

Also, if your code is inherently multithreaded to begin with, you can be in a situation where you are essentially competing for resources with yourself (a classic case being ASP.NET code handling simultaneous requests). Here the advantage in parallel operation may mean that a single test operation on a 4-core machine approaches 4 times the performance, but once the number of requests needing the same task to be performed reaches 4, then since each of those 4 requests are each trying to use each core, it becomes little better than if they had a core each (perhaps slightly better, perhaps slightly worse). The benefits of parallel operation hence disappears as the use changes from a single-request test to a real-world multitude of requests.

Solution 6 - C#

You shouldn't blindly replace every single foreach loop in your application with the parallel foreach. More threads doesn't necessary mean that your application will work faster. You need to slice the task into smaller tasks which could run in parallel if you want to really benefit from multiple threads. If your algorithm is not parallelizable you won't get any benefit.

Solution 7 - C#

No. You need to understand what the code is doing and whether it is amenable to parallelization. Dependencies between your data items can make it hard to parallelize, i.e., if a thread uses the value calculated for the previous element it has to wait until the value is calculated anyway and can't run in parallel. You also need to understand your target architecture, though, you will typically have a multicore CPU on just about anything you buy these days. Even on a single core, you can get some benefits from more threads but only if you have some blocking tasks. You should also keep in mind that there is overhead in creating and organizing the parallel threads. If this overhead is a significant fraction of (or more than) the time your task takes you could slow it down.

Solution 8 - C#

These are my benchmarks showing pure serial is slowest, along with various levels of partitioning.

class Program
{
    static void Main(string[] args)
    {
        NativeDllCalls(true, 1, 400000000, 0);  // Seconds:     0.67 |)   595,203,995.01 ops
        NativeDllCalls(true, 1, 400000000, 3);  // Seconds:     0.91 |)   439,052,826.95 ops
        NativeDllCalls(true, 1, 400000000, 4);  // Seconds:     0.80 |)   501,224,491.43 ops
        NativeDllCalls(true, 1, 400000000, 8);  // Seconds:     0.63 |)   635,893,653.15 ops
        NativeDllCalls(true, 4, 100000000, 0);  // Seconds:     0.35 |) 1,149,359,562.48 ops
        NativeDllCalls(true, 400, 1000000, 0);  // Seconds:     0.24 |) 1,673,544,236.17 ops
        NativeDllCalls(true, 10000, 40000, 0);  // Seconds:     0.22 |) 1,826,379,772.84 ops
        NativeDllCalls(true, 40000, 10000, 0);  // Seconds:     0.21 |) 1,869,052,325.05 ops
        NativeDllCalls(true, 1000000, 400, 0);  // Seconds:     0.24 |) 1,652,797,628.57 ops
        NativeDllCalls(true, 100000000, 4, 0);  // Seconds:     0.31 |) 1,294,424,654.13 ops
        NativeDllCalls(true, 400000000, 0, 0);  // Seconds:     1.10 |)   364,277,890.12 ops
    }


static void NativeDllCalls(bool useStatic, int nonParallelIterations, int parallelIterations = 0, int maxParallelism = 0)
{
    if (useStatic) {
        Iterate<string, object>(
            (msg, cntxt) => { 
                ServiceContracts.ForNativeCall.SomeStaticCall(msg); 
            }
            , "test", null, nonParallelIterations,parallelIterations, maxParallelism );
    }
    else {
        var instance = new ServiceContracts.ForNativeCall();
        Iterate(
            (msg, cntxt) => {
                cntxt.SomeCall(msg);
            }
            , "test", instance, nonParallelIterations, parallelIterations, maxParallelism);
    }
}

static void Iterate<T, C>(Action<T, C> action, T testMessage, C context, int nonParallelIterations, int parallelIterations=0, int maxParallelism= 0)
{
    var start = DateTime.UtcNow;            
    if(nonParallelIterations == 0)
        nonParallelIterations = 1; // normalize values

    if(parallelIterations == 0)
        parallelIterations = 1; 

    if (parallelIterations > 1) {                    
        ParallelOptions options;
        if (maxParallelism == 0) // default max parallelism
            options = new ParallelOptions();
        else
            options = new ParallelOptions { MaxDegreeOfParallelism = maxParallelism };

        if (nonParallelIterations > 1) {
            Parallel.For(0, parallelIterations, options
            , (j) => {
                for (int i = 0; i < nonParallelIterations; ++i) {
                    action(testMessage, context);
                }
            });
        }
        else { // no nonParallel iterations
            Parallel.For(0, parallelIterations, options
            , (j) => {                        
                action(testMessage, context);
            });
        }
    }
    else {
        for (int i = 0; i < nonParallelIterations; ++i) {
            action(testMessage, context);
        }
    }

    var end = DateTime.UtcNow;
        
    Console.WriteLine("\tSeconds: {0,8:0.00} |) {1,16:0,000.00} ops",
        (end - start).TotalSeconds, (Math.Max(parallelIterations, 1) * nonParallelIterations / (end - start).TotalSeconds));

}

}

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionElisabethView Question on Stackoverflow
Solution 1 - C#Jon SkeetView Answer on Stackoverflow
Solution 2 - C#Brian RasmussenView Answer on Stackoverflow
Solution 3 - C#AnthonyView Answer on Stackoverflow
Solution 4 - C#GabeView Answer on Stackoverflow
Solution 5 - C#Jon HannaView Answer on Stackoverflow
Solution 6 - C#Darin DimitrovView Answer on Stackoverflow
Solution 7 - C#tvanfossonView Answer on Stackoverflow
Solution 8 - C#AaronLSView Answer on Stackoverflow