Thursday, March 31, 2011

My Quest: Sort 100 Million Integers With OpenMP

Over the past couple of weeks I have been teaching myself OpenMP. I wonted to see how much faster I could make my previous Quick Sort with Insertion Sort implementation. What is so great about OpenMP is it allows you to transform a single threaded code bit into a multi-threaded one with just a few lines of code. Here is my implementation:


void QuickSortImpl::Sort(vector< int >::iterator start, vector< int >::iterator end)
{
int numProcs = omp_get_num_procs();
omp_set_num_threads(numProcs);
#pragma omp parallel
{
#pragma omp single nowait
Sort_h(start, end-1);
}
}

void QuickSortImpl::Sort_h(vector< int >::iterator start, vector< int >::iterator end)
{
if(start < end)
{
if(end - start < 20)
{
InsertionSort(start, end);
}
else
{
int q = Partition(start, end);
#pragma omp task
{
Sort_h(start, start + (q - 1));
}
Sort_h(start + (q + 1), end);
}
}
}

int QuickSortImpl::Partition(vector< int >::iterator start, vector< int >::iterator end)
{
int partitionIndex = (end - start)/2;
int privotValue = start[partitionIndex];
int tmp = start[partitionIndex];
start[partitionIndex] = *end;
*end = tmp;
int swapIndex = 0;
for (int i = 0; i < end - start; i++)
{
if(start[i] <= privotValue)
{
tmp = start[swapIndex];
start[swapIndex] = start[i];
start[i] = tmp;
swapIndex++;
}
}
tmp = start[swapIndex];
start[swapIndex] = *end;
*end = tmp;
return swapIndex;
}
void QuickSortImpl::InsertionSort(vector< int >::iterator start, vector< int >::iterator end)
{
for(int j = 1; j < = end - start; j++)
{
int key = start[j];
int i = j - 1;
while(i >= 0 && start[i] > key)
{
start[i+1] = start[i];
i = i - 1;
}
start[i+1] = key;
}
}


I chose to only task off one of my recursive calls in the sort_h routine. This is because there is a small overhead associated with sending off a task to the queue of waiting threads to execute. The actual improvement to performance was fairly minimal at about 0.2-0.3 seconds. This trick would have been more noticeable if there was more iterations of sort_h.

This implementation sorted my random list of integers in roughly 5.5 seconds. That means it is roughly a factor of three speed up over the single threaded version. It also better utilizes my desktop's CPUs but spiking all four cores for a few seconds to complete the operation.

Besides the pretty impressive speed up, what strikes me the most interesting is how little code changed from my previous implementation. Literally 4 additional lines of code was required to turn a single threaded Quick Sort into a parallel Quick Sort that scales pretty well with the number of CPU's you have (on most desktops). This just goes to show you the power of OpenMP. I highly recommend anyone needing to write multi-threaded c++ code to look at OpenMP.

Download: OpenMPQuickSort