Branch prediction optimization
Consider the following C++ code:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! with this, the next loop runs faster
std::sort(data, data + arraySize);
// test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
}
• Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
• With the sorted data, the code runs in 1.93 seconds
Now, to eleminate the branch replace:
if (data[c] >= 128)
sum += data[c];
with:
int t = (data[c] - 128) >> 31;
sum += ~t & data[c];
------------------------------------------
Core i7 920 @ 3.5 GHz -- Visual Studio 2010 - x64 Release
// Branch - Random
seconds = 11.777
// Branch - Sorted
seconds = 2.352
// Branchless - Random
seconds = 2.564
// Branchless - Sorted
seconds = 2.587
------------------------------------------
Core i3 M330 @ 2.13 GHz -- Visual Studio 2011 Beta - x64 Release
// Branch - Random
seconds = 20.026
// Branch - Sorted
seconds = 3.044
// Branchless - Random
seconds = 7.075
// Branchless - Sorted
seconds = 7.242
------------------------------------------
Observations:
• With the Branch: There is a huge difference between the sorted and unsorted data.
• With the Hack (branch-less): There is no difference between sorted and unsorted data.
• The hack is actually a tad slower than with the branch when the data is sorted
------------------------------------------
Additional Notes:
• GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.
• VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
• Intel Compiler 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
• If you give the Intel Compiler the branchless code, it just out-right vectorizes it... and is just as fast as with the branch (with the loop interchange).
------------------------------------------
Request:
Please adapt the optimization technique which Intel Compiler is using.
Extracted from http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array
1 comment
-
Real McCoy
commented
With loop-interchanging (http://en.wikipedia.org/wiki/Loop_interchange),
On Core i3 M330 @ 2.13 GHz -- Visual Studio 2011 Beta - x64 Release
// Branch - Random
seconds = 1.988// Branch - Sorted
seconds = 1.992// Branchless - Random
seconds = 2.628// Branchless - Sorted
seconds = 2.645