Archived page.
This is an archived copy of my old (academic) homepage formerly hosted at Chalmers. It is no longer being updated.
Our parallel primitivies library, a.k.a. chag::pp, is based on the paper Efficient Stream Compaction on Wide-SIMD Many-Core Architectures by myself, Ola Olsson and Ulf Assarsson. The paper can be downloaded here. The paper and its results were presented at the High Performance Graphics conference in 2009; presentation slides are available here.
The chag::pp library provides efficient implementations of
Using chag::pp requires CUDA 2.3. We have tested the code on the following platforms:
(Please update to the 20100811 version if you're using CUDA 3.1 on a 64-bit machine.)
Update @ 2011-09-05
Update: 2009-10-14
The library consists of headers only, no need to build/link against any binaries.
For more complete documentation, see the section Documentation below.
The following code demonstrates the compact()
method with a user-provided predicate function.
Thrust is used
for host-device communication; chag::pp only deals with
device memory.
-- {File: compact.cu - Language: CUDA} -- 1 #include <chag/pp/compact.cuh> /* chag::pp::compact() */ 2 namespace pp = chag::pp; 3 4 #include <thrust/device_malloc.h> /* device_malloc() */ 5 namespace t = thrust; 6 7 #include <cstdio> 8 9 /* Predicate function */ 10 struct Predicate 11 { 12 __device__ bool operator() (uint2 value) const 13 { 14 return value.x > value.y; 15 } 16 }; 17 18 int main( void ) 19 { 20 enum { ELEMENTS = 4*1024*1024 + 4321 }; 21 22 t::device_ptr<uint2> src = t::device_malloc<uint2>(ELEMENTS); 23 t::device_ptr<uint2> out = t::device_malloc<uint2>(ELEMENTS); 24 25 t::device_ptr<pp::SizeType> count = t::device_malloc<pp::SizeType>(1); 26 27 /* Generate data for 'src' */ 28 29 // Compaction 30 pp::compact( 31 src.get(), /* Input start pointer */ 32 src.get()+ELEMENTS, /* Input end pointer */ 33 out.get(), /* Output start pointer */ 34 count.get(), /* Storage for valid element count */ 35 Predicate() /* Predicate */ 36 ); 37 38 // Output 39 printf( "Valid elements: %d\n", int(*count) ); 40 41 return 0; 42 }
CUDA-related code, e.g. memory-allocations, are omitted. The library only deals with CUDA device memory, i.e. all pointers are assumed to point to device memory.
TODO
@inproceedings{1572795, author = {Billeter, Markus and Olsson, Ola and Assarsson, Ulf}, title = {Efficient stream compaction on wide SIMD many-core architectures}, booktitle = {HPG '09: Proceedings of the Conference on High Performance Graphics 2009}, year = {2009}, isbn = {978-1-60558-603-8}, pages = {159--166}, location = {New Orleans, Louisiana}, doi = {http://doi.acm.org/10.1145/1572769.1572795}, publisher = {ACM}, address = {New York, NY, USA}, }
chag::pp provides three kinds of interfaces:
The first two interfaces expose the same kind of functionality; the class-based API allows for more fine-tuning. The in-kernel utility functions are aimed at people who develop their own CUDA-kernels (they are also used to build the kernels involved in the first two interfaces).
TODO: create & upload doxygen documentation
-- {Language: CUDA} -- 1 template< typename T > 2 inline void reduce( const T* aStart, const T* aEnd, T* aOutput ); 3 template< typename T, class Op > 4 inline void reduce( const T* aStart, const T* aEnd, T* aOutput, const Op& aOperator );
Parallel reduction with operator+ or user specified operator. Parameters:
-- {Language: CUDA} -- 1 template< typename T > 2 inline void prefix( const T* aStart, const T* aEnd, T* aOutput, T* aTotal = 0 ); 3 template< typename T, class Op > 4 inline void prefix( const T* aStart, const T* aEnd, T* aOutput, T* aTotal, const Op& aOperator );
Exclusive prefix sum (scan). Parameters:
-- {Language: CUDA} -- 1 template< typename T > 2 inline void compact( const T* aStart, const T* aEnd, T* aOutput, 3 SizeType* aNumValid = 0 ); 4 template< typename T, class Predicate > 5 inline void compact( const T* aStart, const T* aEnd, T* aOutput, 6 SizeType* aNumValid, const Predicate& aPredicate );
Compaction (stream reduction). Parameters:
-- {Language: CUDA} -- 1 template< typename T > 2 inline void split( const T* aStart, const T* aEnd, T* aOutput, 3 SizeType* aNumValid = 0 ); 4 template< typename T, class Predicate > 5 inline void split( const T* aStart, const T* aEnd, T* aOutput, 6 SizeType* aNumValid, const Predicate& aPredicate );
Split. Parameters:
-- {Language: CUDA} -- 1 template< typename T > 2 inline T* sort( const T* aStart, const T* aEnd, T* aPing, T* aPong ); 3 4 template< 5 unsigned Low, unsigned High, template <unsigned,typename> class Pred, typename T 6 > inline T* sort( const T* aStart, const T* aEnd, T* aPing, T* aPong );
Radix sort. Parameters:
A split operation is performed for each value in the
range [Low ... High), using the predicate
Pred<Value,T>
. The method returns
either aPing or aPong,
depending on the number of splits that were performed. The
aPong may alias the aStart
buffer.