DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within global memory.
.
- Overview
- A run-length encoding computes a simple compressed representation of a sequence of input elements such that each maximal "run" of consecutive same-valued data items is encoded as a single data value along with a count of the elements in that run.
- Usage Considerations
- Dynamic parallelism. DeviceRunLengthEncode methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported. When calling these methods from kernel code, be sure to define the
CUB_CDP macro in your compiler's macro definitions.
- Performance
- The work-complexity of run-length encode as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
- The following chart illustrates DeviceRunLengthEncode::RunLengthEncode performance across different CUDA architectures for
int32 items. Segments have lengths uniformly sampled from [1,1000].
- Performance plots for other scenarios can be found in the detailed method descriptions below.
Definition at line 78 of file device_run_length_encode.cuh.
|
| template<typename InputIteratorT , typename UniqueOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT > |
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t | Encode (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, UniqueOutputIteratorT d_unique_out, LengthsOutputIteratorT d_counts_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| | Computes a run-length encoding of the sequence d_in. More...
|
| |
| template<typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT > |
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t | NonTrivialRuns (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OffsetsOutputIteratorT d_offsets_out, LengthsOutputIteratorT d_lengths_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| | Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence d_in. More...
|
| |
template<typename InputIteratorT , typename UniqueOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
| CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceRunLengthEncode::Encode |
( |
void * |
d_temp_storage, |
|
|
size_t & |
temp_storage_bytes, |
|
|
InputIteratorT |
d_in, |
|
|
UniqueOutputIteratorT |
d_unique_out, |
|
|
LengthsOutputIteratorT |
d_counts_out, |
|
|
NumRunsOutputIteratorT |
d_num_runs_out, |
|
|
int |
num_items, |
|
|
cudaStream_t |
stream = 0, |
|
|
bool |
debug_synchronous = false |
|
) |
| |
|
inlinestatic |
Computes a run-length encoding of the sequence d_in.
- For the ith run encountered, the first key of the run and its length are written to
d_unique_out[i] and d_counts_out[i], respectively.
- The total number of runs encountered is written to
d_num_runs_out.
- The
== equality operator is used to determine whether values are equivalent
- When
d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
- Performance
- The following charts illustrate saturated encode performance across different CUDA architectures for
int32 and int64 items, respectively. Segments have lengths uniformly sampled from [1,1000].
- The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
- Snippet
- The code snippet below illustrates the run-length encoding of a sequence of
int values.
int num_items;
int *d_in;
int *d_unique_out;
int *d_counts_out;
int *d_num_runs_out;
...
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cudaMalloc(&d_temp_storage, temp_storage_bytes);
- Template Parameters
-
| InputIteratorT | [inferred] Random-access input iterator type for reading input items (may be a simple pointer type) |
| UniqueOutputIteratorT | [inferred] Random-access output iterator type for writing unique output items (may be a simple pointer type) |
| LengthsOutputIteratorT | [inferred] Random-access output iterator type for writing output counts (may be a simple pointer type) |
| NumRunsOutputIteratorT | [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type) |
- Parameters
-
| [in] | d_temp_storage | Device allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
| [in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
| [in] | d_in | Pointer to the input sequence of keys |
| [out] | d_unique_out | Pointer to the output sequence of unique keys (one key per run) |
| [out] | d_counts_out | Pointer to the output sequence of run-lengths (one count per run) |
| [out] | d_num_runs_out | Pointer to total number of runs |
| [in] | num_items | Total number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values) |
| [in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
| [in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false. |
Definition at line 149 of file device_run_length_encode.cuh.
template<typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
| CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceRunLengthEncode::NonTrivialRuns |
( |
void * |
d_temp_storage, |
|
|
size_t & |
temp_storage_bytes, |
|
|
InputIteratorT |
d_in, |
|
|
OffsetsOutputIteratorT |
d_offsets_out, |
|
|
LengthsOutputIteratorT |
d_lengths_out, |
|
|
NumRunsOutputIteratorT |
d_num_runs_out, |
|
|
int |
num_items, |
|
|
cudaStream_t |
stream = 0, |
|
|
bool |
debug_synchronous = false |
|
) |
| |
|
inlinestatic |
Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence d_in.
- For the ith non-trivial run, the run's starting offset and its length are written to
d_offsets_out[i] and d_lengths_out[i], respectively.
- The total number of runs encountered is written to
d_num_runs_out.
- The
== equality operator is used to determine whether values are equivalent
- When
d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
- Performance
- Snippet
- The code snippet below illustrates the identification of non-trivial runs within a sequence of
int values.
int num_items;
int *d_in;
int *d_offsets_out;
int *d_lengths_out;
int *d_num_runs_out;
...
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cudaMalloc(&d_temp_storage, temp_storage_bytes);
- Template Parameters
-
| InputIteratorT | [inferred] Random-access input iterator type for reading input items (may be a simple pointer type) |
| OffsetsOutputIteratorT | [inferred] Random-access output iterator type for writing run-offset values (may be a simple pointer type) |
| LengthsOutputIteratorT | [inferred] Random-access output iterator type for writing run-length values (may be a simple pointer type) |
| NumRunsOutputIteratorT | [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type) |
- Parameters
-
| [in] | d_temp_storage | Device allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
| [in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
| [in] | d_in | Pointer to input sequence of data items |
| [out] | d_offsets_out | Pointer to output sequence of run-offsets (one offset per non-trivial run) |
| [out] | d_lengths_out | Pointer to output sequence of run-lengths (one count per non-trivial run) |
| [out] | d_num_runs_out | Pointer to total number of runs (i.e., length of d_offsets_out) |
| [in] | num_items | Total number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values) |
| [in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
| [in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false. |
Definition at line 247 of file device_run_length_encode.cuh.
The documentation for this struct was generated from the following file: