|
CUB
|
DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within global memory.
CUB_CDP macro in your compiler's macro definitions.int32 keys.
fp32 values. Segments are identified by int32 keys, and have lengths uniformly sampled from [1,1000].
int32 items. Segments have lengths uniformly sampled from [1,1000].
Definition at line 89 of file device_reduce.cuh.
Static Public Methods | |
| template<typename InputIterator , typename OutputIterator , typename ReductionOp > | |
| static CUB_RUNTIME_FUNCTION cudaError_t | Reduce (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, ReductionOp reduction_op, cudaStream_t stream=0, bool debug_synchronous=false) |
Computes a device-wide reduction using the specified binary reduction_op functor. More... | |
| template<typename InputIterator , typename OutputIterator > | |
| static CUB_RUNTIME_FUNCTION cudaError_t | Sum (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Computes a device-wide sum using the addition ('+') operator. More... | |
| template<typename InputIterator , typename OutputIterator > | |
| static CUB_RUNTIME_FUNCTION cudaError_t | Min (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Computes a device-wide minimum using the less-than ('<') operator. More... | |
| template<typename InputIterator , typename OutputIterator > | |
| static CUB_RUNTIME_FUNCTION cudaError_t | ArgMin (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Finds the first device-wide minimum using the less-than ('<') operator, also returning the index of that item. More... | |
| template<typename InputIterator , typename OutputIterator > | |
| static CUB_RUNTIME_FUNCTION cudaError_t | Max (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Computes a device-wide maximum using the greater-than ('>') operator. More... | |
| template<typename InputIterator , typename OutputIterator > | |
| static CUB_RUNTIME_FUNCTION cudaError_t | ArgMax (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Finds the first device-wide maximum using the greater-than ('>') operator, also returning the index of that item. More... | |
| template<typename KeyInputIterator , typename KeyOutputIterator , typename ValueInputIterator , typename ValueOutputIterator , typename NumSegmentsIterator , typename ReductionOp > | |
| CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t | ReduceByKey (void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIterator d_keys_in, KeyOutputIterator d_keys_out, ValueInputIterator d_values_in, ValueOutputIterator d_values_out, NumSegmentsIterator d_num_segments, ReductionOp reduction_op, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
| Reduces segments of values, where segments are demarcated by corresponding runs of identical keys. More... | |
| template<typename InputIterator , typename OutputIterator , typename CountsOutputIterator , typename NumSegmentsIterator > | |
| CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t | RunLengthEncode (void *d_temp_storage, size_t &temp_storage_bytes, InputIterator d_in, OutputIterator d_compacted_out, CountsOutputIterator d_counts_out, NumSegmentsIterator d_num_segments, int num_items, cudaStream_t stream=0, bool debug_synchronous=false) |
Counts the segment lengths in the sequence d_in, where segments are demarcated by runs of identical values. More... | |
|
inlinestatic |
Computes a device-wide reduction using the specified binary reduction_op functor.
d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.CUB_CDP macro in your compiler's macro definitions.int items. | InputIterator | [inferred] Random-access input iterator type for reading input items (may be a simple pointer type) |
| OutputIterator | [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type) |
| ReductionOp | [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b) |
| [in] | d_temp_storage | Device allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
| [in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
| [in] | d_in | Pointer to the input sequence of data items |
| [out] | d_out | Pointer to the output aggregate |
| [in] | num_items | Total number of input items (i.e., length of d_in) |
| [in] | reduction_op | Binary reduction functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
| [in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
| [in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false. |
Definition at line 149 of file device_reduce.cuh.
|
inlinestatic |
Computes a device-wide sum using the addition ('+') operator.
d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.CUB_CDP macro in your compiler's macro definitions.int32 and int64 items, respectively.
int items. | InputIterator | [inferred] Random-access input iterator type for reading input items (may be a simple pointer type) |
| OutputIterator | [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type) |
| [in] | d_temp_storage | Device allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
| [in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
| [in] | d_in | Pointer to the input sequence of data items |
| [out] | d_out | Pointer to the output aggregate |
| [in] | num_items | Total number of input items (i.e., length of d_in) |
| [in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
| [in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false. |
Definition at line 226 of file device_reduce.cuh.
|
inlinestatic |
Computes a device-wide minimum using the less-than ('<') operator.
d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.CUB_CDP macro in your compiler's macro definitions.int items. | InputIterator | [inferred] Random-access input iterator type for reading input items (may be a simple pointer type) |
| OutputIterator | [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type) |
| [in] | d_temp_storage | Device allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
| [in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
| [in] | d_in | Pointer to the input sequence of data items |
| [out] | d_out | Pointer to the output aggregate |
| [in] | num_items | Total number of input items (i.e., length of d_in) |
| [in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
| [in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false. |
Definition at line 298 of file device_reduce.cuh.
|
inlinestatic |
Finds the first device-wide minimum using the less-than ('<') operator, also returning the index of that item.
d_in has value type T, the output d_out must have value type ItemOffsetPair<T, int>. The minimum value is written to d_out.value and its location in the input array is written to d_out.offset.d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.CUB_CDP macro in your compiler's macro definitions.int items. | InputIterator | [inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type) |
| OutputIterator | [inferred] Output iterator type for recording the reduced aggregate (having value type ItemOffsetPair<T, int>) (may be a simple pointer type) |
| [in] | d_temp_storage | Device allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
| [in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
| [in] | d_in | Pointer to the input sequence of data items |
| [out] | d_out | Pointer to the output aggregate |
| [in] | num_items | Total number of input items (i.e., length of d_in) |
| [in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
| [in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false. |
Definition at line 375 of file device_reduce.cuh.
|
inlinestatic |
Computes a device-wide maximum using the greater-than ('>') operator.
d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.CUB_CDP macro in your compiler's macro definitions.int items. | InputIterator | [inferred] Random-access input iterator type for reading input items (may be a simple pointer type) |
| OutputIterator | [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type) |
| [in] | d_temp_storage | Device allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
| [in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
| [in] | d_in | Pointer to the input sequence of data items |
| [out] | d_out | Pointer to the output aggregate |
| [in] | num_items | Total number of input items (i.e., length of d_in) |
| [in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
| [in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false. |
Definition at line 451 of file device_reduce.cuh.
|
inlinestatic |
Finds the first device-wide maximum using the greater-than ('>') operator, also returning the index of that item.
d_in has value type T, the output d_out must have value type ItemOffsetPair<T, int>. The maximum value is written to d_out.value and its location in the input array is written to d_out.offset.d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.CUB_CDP macro in your compiler's macro definitions.int items. | InputIterator | [inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type) |
| OutputIterator | [inferred] Output iterator type for recording the reduced aggregate (having value type ItemOffsetPair<T, int>) (may be a simple pointer type) |
| [in] | d_temp_storage | Device allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
| [in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
| [in] | d_in | Pointer to the input sequence of data items |
| [out] | d_out | Pointer to the output aggregate |
| [in] | num_items | Total number of input items (i.e., length of d_in) |
| [in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
| [in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. Also causes launch configurations to be printed to the console. Default is false. |
Definition at line 528 of file device_reduce.cuh.
|
inlinestatic |
Reduces segments of values, where segments are demarcated by corresponding runs of identical keys.
reduction_op functor. Each "run" of consecutive, identical keys in d_keys_in is used to identify a corresponding segment of values in d_values_in. The first key in the ith segment is copied to d_keys_out[i], and the value aggregate for that segment is written to d_values_out[i]. The total number of segments discovered is written to d_num_segments.== equality operator is used to determine whether keys are equivalentd_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.CUB_CDP macro in your compiler's macro definitions.fp32 and fp64 values, respectively. Segments are identified by int32 keys, and have lengths uniformly sampled from [1,1000].
int values grouped by runs of associated int keys. | KeyInputIterator | [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type) |
| KeyOutputIterator | [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type) |
| ValueInputIterator | [inferred] Random-access input iterator type for reading input values (may be a simple pointer type) |
| ValueOutputIterator | [inferred] Random-access output iterator type for writing output values (may be a simple pointer type) |
| NumSegmentsIterator | [inferred] Output iterator type for recording the number of segments encountered (may be a simple pointer type) |
| ReductionOp | [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b) |
| [in] | d_temp_storage | Device allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
| [in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
| [in] | d_keys_in | Pointer to consecutive runs of input keys |
| [out] | d_keys_out | Pointer to output keys (one key per run) |
| [in] | d_values_in | Pointer to consecutive runs of input values |
| [out] | d_values_out | Pointer to output value aggregates (one aggregate per run) |
| [out] | d_num_segments | Pointer to total number of segments |
| [in] | reduction_op | Binary reduction functor (e.g., an instance of cub::Sum, cub::Min, cub::Max, etc.) |
| [in] | num_items | Total number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values) |
| [in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
| [in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false. |
Definition at line 648 of file device_reduce.cuh.
|
inlinestatic |
Counts the segment lengths in the sequence d_in, where segments are demarcated by runs of identical values.
d_in, where segments are identified by "runs" of consecutive, identical values. The length of the ith segment is written to d_counts_out[i]. The unique values are also compacted, i.e., the first value in the ith segment is copied to d_compacted_out[i]. The total number of segments discovered is written to d_num_segments.== equality operator is used to determine whether values are equivalentd_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.CUB_CDP macro in your compiler's macro definitions.int32 and int64 items, respectively. Segments have lengths uniformly sampled from [1,1000].
int values. | InputIterator | [inferred] Random-access input iterator type for reading input items (may be a simple pointer type) |
| OutputIterator | [inferred] Random-access output iterator type for writing compacted output items (may be a simple pointer type) |
| CountsOutputIterator | [inferred] Random-access output iterator type for writing output counts (may be a simple pointer type) |
| NumSegmentsIterator | [inferred] Output iterator type for recording the number of segments encountered (may be a simple pointer type) |
| [in] | d_temp_storage | Device allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done. |
| [in,out] | temp_storage_bytes | Reference to size in bytes of d_temp_storage allocation |
| [in] | d_in | Pointer to consecutive runs of input keys |
| [out] | d_compacted_out | Pointer to output keys (one key per run) |
| [out] | d_counts_out | Pointer to output value aggregates (one aggregate per run) |
| [out] | d_num_segments | Pointer to total number of segments |
| [in] | num_items | Total number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values) |
| [in] | stream | [optional] CUDA stream to launch kernels within. Default is stream0. |
| [in] | debug_synchronous | [optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false. |
Definition at line 754 of file device_reduce.cuh.
1.8.4