## Thursday, March 31, 2011

### My Quest: Sort 100 Million Integers With OpenMP

Over the past couple of weeks I have been teaching myself OpenMP. I wonted to see how much faster I could make my previous Quick Sort with Insertion Sort implementation. What is so great about OpenMP is it allows you to transform a single threaded code bit into a multi-threaded one with just a few lines of code. Here is my implementation:

`void QuickSortImpl::Sort(vector< int >::iterator start, vector< int >::iterator end){    int numProcs = omp_get_num_procs();    omp_set_num_threads(numProcs);    #pragma omp parallel    {        #pragma omp single nowait        Sort_h(start, end-1);    }}void QuickSortImpl::Sort_h(vector< int >::iterator start, vector< int >::iterator end){    if(start < end)    {        if(end - start < 20)        {            InsertionSort(start, end);        }        else        {            int q = Partition(start, end);            #pragma omp task            {                Sort_h(start, start + (q - 1));            }            Sort_h(start + (q + 1), end);        }    }}int QuickSortImpl::Partition(vector< int >::iterator start, vector< int >::iterator end){    int partitionIndex =  (end - start)/2;    int privotValue = start[partitionIndex];    int tmp = start[partitionIndex];    start[partitionIndex] = *end;    *end = tmp;    int swapIndex = 0;    for (int i = 0; i < end - start; i++)    {        if(start[i] <= privotValue)        {            tmp = start[swapIndex];            start[swapIndex] = start[i];            start[i] = tmp;            swapIndex++;        }    }    tmp = start[swapIndex];    start[swapIndex] = *end;    *end = tmp;    return swapIndex;}void QuickSortImpl::InsertionSort(vector< int >::iterator start, vector< int >::iterator end){    for(int j = 1; j < = end - start; j++)    {        int key = start[j];        int i = j - 1;        while(i >= 0 && start[i] > key)        {            start[i+1] = start[i];            i = i - 1;        }        start[i+1] = key;    }}`

I chose to only task off one of my recursive calls in the sort_h routine. This is because there is a small overhead associated with sending off a task to the queue of waiting threads to execute. The actual improvement to performance was fairly minimal at about 0.2-0.3 seconds. This trick would have been more noticeable if there was more iterations of sort_h.

This implementation sorted my random list of integers in roughly 5.5 seconds. That means it is roughly a factor of three speed up over the single threaded version. It also better utilizes my desktop's CPUs but spiking all four cores for a few seconds to complete the operation.

Besides the pretty impressive speed up, what strikes me the most interesting is how little code changed from my previous implementation. Literally 4 additional lines of code was required to turn a single threaded Quick Sort into a parallel Quick Sort that scales pretty well with the number of CPU's you have (on most desktops). This just goes to show you the power of OpenMP. I highly recommend anyone needing to write multi-threaded c++ code to look at OpenMP.