I suggest you ...

Branch prediction optimization

Consider the following C++ code:

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
// generate data
const unsigned arraySize = 32768;
int data[arraySize];

for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;


// !!! with this, the next loop runs faster
std::sort(data, data + arraySize);


// test
clock_t start = clock();
long long sum = 0;

for (unsigned i = 0; i < 100000; ++i)
{
// primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}

double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
}

• Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
• With the sorted data, the code runs in 1.93 seconds

Now, to eleminate the branch replace:

if (data[c] >= 128)
sum += data[c];

with:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

------------------------------------------

Core i7 920 @ 3.5 GHz -- Visual Studio 2010 - x64 Release

// Branch - Random
seconds = 11.777

// Branch - Sorted
seconds = 2.352

// Branchless - Random
seconds = 2.564

// Branchless - Sorted
seconds = 2.587

------------------------------------------

Core i3 M330 @ 2.13 GHz -- Visual Studio 2011 Beta - x64 Release

// Branch - Random
seconds = 20.026

// Branch - Sorted
seconds = 3.044

// Branchless - Random
seconds = 7.075

// Branchless - Sorted
seconds = 7.242

------------------------------------------

Observations:

• With the Branch: There is a huge difference between the sorted and unsorted data.
• With the Hack (branch-less): There is no difference between sorted and unsorted data.
• The hack is actually a tad slower than with the branch when the data is sorted

------------------------------------------
Additional Notes:

• GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

• VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

• Intel Compiler 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

• If you give the Intel Compiler the branchless code, it just out-right vectorizes it... and is just as fast as with the branch (with the loop interchange).

------------------------------------------
Request:

Please adapt the optimization technique which Intel Compiler is using.

Extracted from http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array

7 votes
Vote
Sign in
Check!
(thinking…)
Reset
or sign in with
  • facebook
  • google
    Password icon
    I agree to the terms of service
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    Real McCoyReal McCoy shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

    2 comments

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      I agree to the terms of service
      Signed in as (Sign out)
      Submitting...

      Feedback and Knowledge Base