Archived page.

This is an archived copy of my old (academic) homepage formerly hosted at Chalmers. It is no longer being updated.

Go to the new page.


Stream Compaction Logo

Parallel Primitives Library:

  • a.k.a. chag::pp
  • Based on our paper on Efficient Stream Compaction.
  • Authors: Markus Billeter, Ola Olsson, Ulf Assarsson
  • Download: Source
  • License: MIT

Introduction

Our parallel primitivies library, a.k.a. chag::pp, is based on the paper Efficient Stream Compaction on Wide-SIMD Many-Core Architectures by myself, Ola Olsson and Ulf Assarsson. The paper can be downloaded here. The paper and its results were presented at the High Performance Graphics conference in 2009; presentation slides are available here.

The chag::pp library provides efficient implementations of

  • reduction
  • prefix operations (scan)
  • compaction
  • radix sort

Using chag::pp requires CUDA 2.3. We have tested the code on the following platforms:

  • CUDA 3.1, Linux 64bit, GCC 4.5.0, GTX480
  • CUDA 3.1, Linux 64bit, GCC 4.5.9, GTX260
  • CUDA 2.3, Linux 64bit, GCC 4.3.3, GTX280
  • CUDA 2.3, Linux 64bit, GCC 4.3.3, GTX260
  • CUDA 2.3, WinXP 64bit, Visual Studio 2005 SP1, GTX280

(Please update to the 20100811 version if you're using CUDA 3.1 on a 64-bit machine.)


News

Update @ 2011-09-05

  • WARNING Apparently, there's a bug/problem in sort() when compiled on Fermi-level cards. The problem is not yet resolved. :-(
  • Added inclusive prefix sum operation, chag::pp::prefix_inclusive(). It has the same parameters as the exclusive (old) prefix function.
  • Fixed bug where a custom operator in chag::pp::prefix() would cause incorrect results. (Thanks to C. Allen Waddle for reporting this!)
  • Workaround for a NVCC code generation bug in release 3.1, V0.2.1221. This file demonstrates the bug (and a workaround) in ~50LOC.

Update: 2009-10-14

  • chag::pp is now available under the MIT license

Download

The library consists of headers only, no need to build/link against any binaries.


Examples

For more complete documentation, see the section Documentation below.

The following code demonstrates the compact() method with a user-provided predicate function. Thrust is used for host-device communication; chag::pp only deals with device memory.

-- {File: compact.cu - Language: CUDA} --  1 #include <chag/pp/compact.cuh>         /* chag::pp::compact() */  2 namespace pp = chag::pp;  3  4 #include <thrust/device_malloc.h>      /* device_malloc() */  5 namespace t = thrust;  6  7 #include <cstdio>  8  9 /* Predicate function */ 10 struct Predicate 11 { 12     __device__ bool operator() (uint2 value) const 13     { 14         return value.x > value.y; 15     } 16 }; 17 18 int main( void ) 19 { 20     enum { ELEMENTS = 4*1024*1024 + 4321 }; 21 22     t::device_ptr<uint2> src = t::device_malloc<uint2>(ELEMENTS); 23     t::device_ptr<uint2> out = t::device_malloc<uint2>(ELEMENTS); 24 25     t::device_ptr<pp::SizeType> count = t::device_malloc<pp::SizeType>(1); 26 27     /* Generate data for 'src' */ 28 29     // Compaction 30     pp::compact(  31         src.get(),              /* Input start pointer */ 32         src.get()+ELEMENTS,     /* Input end pointer */ 33         out.get(),              /* Output start pointer */ 34         count.get(),            /* Storage for valid element count */ 35         Predicate()             /* Predicate */ 36     ); 37 38     // Output 39     printf( "Valid elements: %d\n"int(*count) ); 40 41     return 0; 42 }

CUDA-related code, e.g. memory-allocations, are omitted. The library only deals with CUDA device memory, i.e. all pointers are assumed to point to device memory.

TODO


BibTeX

Please use the following BibTeX-entry for citations:
@inproceedings{1572795,
	 author = {Billeter, Markus and Olsson, Ola and Assarsson, Ulf},
	 title = {Efficient stream compaction on wide SIMD many-core 
		architectures},
	 booktitle = {HPG '09: Proceedings of the Conference on High 
		Performance Graphics 2009},
	 year = {2009},
	 isbn = {978-1-60558-603-8},
	 pages = {159--166},
	 location = {New Orleans, Louisiana},
	 doi = {http://doi.acm.org/10.1145/1572769.1572795},
	 publisher = {ACM},
	 address = {New York, NY, USA},
}

Documentation

chag::pp provides three kinds of interfaces:

  • STL-like algorithms
  • Class-based API
  • In-kernel utility functions

The first two interfaces expose the same kind of functionality; the class-based API allows for more fine-tuning. The in-kernel utility functions are aimed at people who develop their own CUDA-kernels (they are also used to build the kernels involved in the first two interfaces).

TODO: create & upload doxygen documentation

STL-like interface

Operations:
  • Reduction (Header: <chag/pp/reduce.cuh>)

    -- {Language: CUDA} -- 1 template< typename T > 2 inline void reduce( const T* aStart, const T* aEnd, T* aOutput ); 3 template< typename T, class Op > 4 inline void reduce( const T* aStart, const T* aEnd, T* aOutput, const Op& aOperator );

    Parallel reduction with operator+ or user specified operator. Parameters:

    • aStart, aEnd: Input range [aStart, aEnd)
    • aOutput: Output, single element
    • aOperator: User specified operator. See below.
  • Prefix Sum (Header: <chag/pp/prefix.cuh>)

    -- {Language: CUDA} -- 1 template< typename T > 2 inline void prefix( const T* aStart, const T* aEnd, T* aOutput, T* aTotal = 0 ); 3 template< typename T, class Op > 4 inline void prefix( const T* aStart, const T* aEnd, T* aOutput, T* aTotal, const Op& aOperator );

    Exclusive prefix sum (scan). Parameters:

    • aStart, aEnd: Input range [aStart, aEnd)
    • aOutput: Output, (aEnd-aStart) elements
    • aTotal: Output total
    • aOperator: User specified operator. See below.
  • Compaction (Header: <chag/pp/compact.cuh>)

    -- {Language: CUDA} -- 1 template< typename T > 2 inline void compact( const T* aStart, const T* aEnd, T* aOutput,  3     SizeType* aNumValid = 0 ); 4 template< typename T, class Predicate > 5 inline void compact( const T* aStart, const T* aEnd, T* aOutput,  6     SizeType* aNumValid, const Predicate& aPredicate );

    Compaction (stream reduction). Parameters:

    • aStart, aEnd: Input range [aStart, aEnd)
    • aOutput: Output, at most (aEnd-aStart) elements
    • aTotal: Output number of elements in aOutput
    • aPredicate: User specified predicate. See below.
  • Split (Header: <chag/pp/split.cuh>)

    -- {Language: CUDA} -- 1 template< typename T > 2 inline void split( const T* aStart, const T* aEnd, T* aOutput, 3     SizeType* aNumValid = 0 ); 4 template< typename T, class Predicate > 5 inline void split( const T* aStart, const T* aEnd, T* aOutput, 6     SizeType* aNumValid, const Predicate& aPredicate );

    Split. Parameters:

    • aStart, aEnd: Input range [aStart, aEnd)
    • aOutput: Output, (aEnd-aStart) elements
    • aTotal: Output number of valid elements
    • aPredicate: User specified predicate. See below.
  • Radix Sort (Header: <chag/pp/sort.cuh>)

    -- {Language: CUDA} -- 1 template< typename T > 2 inline T* sort( const T* aStart, const T* aEnd, T* aPing, T* aPong ); 3   4 template< 5     unsigned Low, unsigned High, template <unsigned,typename> class Pred, typename T 6 > inline T* sort( const T* aStart, const T* aEnd, T* aPing, T* aPong );

    Radix sort. Parameters:

    • aStart, aEnd: Input range [aStart, aEnd)
    • aPing: output buffer A, (aEnd-aStart) elements
    • aPing: output buffer B, (aEnd-aStart) elements
    • return value: Buffer that holds output data, i.e. on of aPing or aPong.

    A split operation is performed for each value in the range [Low ... High), using the predicate Pred<Value,T>. The method returns either aPing or aPong, depending on the number of splits that were performed. The aPong may alias the aStart buffer.


Todo

  • Documentation
  • Upload examples and tests
  • Updated performance graphs
  • Tweaks. Several minor optimizations have not been implemented in the new code.