8000 Find extra global load by PointKernel · Pull Request #688 · NVIDIA/cuCollections · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Find extra global load #688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

PointKernel
Copy link
Member

TBD

@PointKernel
Copy link
Member Author
PointKernel commented Feb 28, 2025

With flat storage, this is the first time we saw the new implementation outperform the old:

# static_multimap_count_uniform_capacity

## [0] Quadro RTX 8000

|  Key  |  Value  |  Distribution  |  NumInputs  |  Occupancy  |  Multiplicity  |  MatchingRate  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|---------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |   I32   |    UNIFORM     |    8000     |     0.5     |       1        |       1        |  22.151 us |       6.99% |  22.583 us |       9.20% |   0.431 us |   1.95% |   SAME   |
|  I32  |   I32   |    UNIFORM     |    80000    |     0.5     |       1        |       1        |  36.416 us |       9.23% |  40.158 us |       6.30% |   3.741 us |  10.27% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |   800000    |     0.5     |       1        |       1        | 401.024 us |      18.67% | 361.821 us |      15.44% | -39.203 us |  -9.78% |   SAME   |
|  I32  |   I32   |    UNIFORM     |   8000000   |     0.5     |       1        |       1        |   2.875 ms |       3.00% |   2.808 ms |       0.38% | -66.925 us |  -2.33% |   FAST   |
|  I32  |   I32   |    UNIFORM     |  80000000   |     0.5     |       1        |       1        |  27.423 ms |       0.27% |  27.661 ms |       0.05% | 237.853 us |   0.87% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |    8000     |     0.5     |       1        |       1        |  23.135 us |      10.60% |  22.316 us |       4.53% |  -0.819 us |  -3.54% |   SAME   |
|  I64  |   I64   |    UNIFORM     |    80000    |     0.5     |       1        |       1        |  42.780 us |      12.69% |  40.214 us |       2.53% |  -2.567 us |  -6.00% |   FAST   |
|  I64  |   I64   |    UNIFORM     |   800000    |     0.5     |       1        |       1        | 437.193 us |      10.34% | 417.448 us |       3.58% | -19.745 us |  -4.52% |   FAST   |
|  I64  |   I64   |    UNIFORM     |   8000000   |     0.5     |       1        |       1        |   3.634 ms |       1.04% |   3.688 ms |       0.30% |  53.928 us |   1.48% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  80000000   |     0.5     |       1        |       1        |  34.195 ms |       0.16% |  36.329 ms |       0.03% |   2.134 ms |   6.24% |   SLOW   |

# static_multimap_count_uniform_occupancy

## [0] Quadro RTX 8000

|  Key  |  Value  |  Distribution  |  NumInputs  |  Occupancy  |  Multiplicity  |  MatchingRate  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|-------|---------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|---------------|---------|----------|
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.1     |       1        |       1        |  31.307 ms |       0.25% |  34.631 ms |       0.03% |      3.324 ms |  10.62% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.2     |       1        |       1        |  31.272 ms |       0.70% |  35.834 ms |       7.71% |      4.562 ms |  14.59% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.3     |       1        |       1        |  31.523 ms |       0.21% |  34.598 ms |       0.02% |      3.075 ms |   9.75% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.4     |       1        |       1        |  32.404 ms |       0.18% |  34.578 ms |       0.02% |      2.174 ms |   6.71% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  34.506 ms |       0.12% |  34.577 ms |       0.05% |     70.820 us |   0.21% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.6     |       1        |       1        |  38.479 ms |       0.17% |  34.555 ms |       0.02% |  -3923.830 us | -10.20% |   FAST   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.7     |       1        |       1        |  45.102 ms |       0.27% |  34.566 ms |       0.30% | -10535.599 us | -23.36% |   FAST   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.8     |       1        |       1        |  58.036 ms |       0.18% |  34.547 ms |       0.10% | -23489.061 us | -40.47% |   FAST   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.9     |       1        |       1        |  91.570 ms |       0.13% |  34.509 ms |       0.04% | -57060.548 us | -62.31% |   FAST   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.1     |       1        |       1        |  39.615 ms |       0.90% |  47.578 ms |       0.03% |      7.963 ms |  20.10% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.2     |       1        |       1        |  37.303 ms |      10.07% |  50.532 ms |      11.09% |     13.229 ms |  35.46% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.3     |       1        |       1        |  36.864 ms |       2.83% |  47.059 ms |       8.22% |     10.195 ms |  27.65% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.4     |       1        |       1        |  39.353 ms |       0.13% |  45.794 ms |       2.97% |      6.441 ms |  16.37% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  43.383 ms |       0.14% |  45.381 ms |       0.04% |      1.998 ms |   4.60% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.6     |       1        |       1        |  49.443 ms |       0.13% |  45.377 ms |       0.03% |  -4066.805 us |  -8.23% |   FAST   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.7     |       1        |       1        |  58.406 ms |       0.14% |  45.401 ms |       0.11% | -13004.237 us | -22.27% |   FAST   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.8     |       1        |       1        |  73.871 ms |       0.13% |  45.388 ms |       0.25% | -28482.198 us | -38.56% |   FAST   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.9     |       1        |       1        | 116.161 ms |       0.09% |  45.324 ms |       0.02% | -70836.809 us | -60.98% |   FAST   |

# static_multimap_count_uniform_multiplicity

## [0] Quadro RTX 8000

|  Key  |  Value  |  Distribution  |  NumInputs  |  Occupancy  |  Multiplicity  |  MatchingRate  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|-------|---------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|---------------|---------|----------|
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  34.627 ms |       0.22% |  34.585 ms |       0.08% |    -42.315 us |  -0.12% |   FAST   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       2        |       1        |  36.470 ms |       0.16% |  34.572 ms |       0.03% |  -1897.150 us |  -5.20% |   FAST   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       4        |       1        |  44.291 ms |       1.03% |  34.562 ms |       0.03% |  -9728.790 us | -21.97% |   FAST   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       8        |       1        |  57.922 ms |       0.15% |  34.614 ms |       0.24% | -23307.597 us | -40.24% |   FAST   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       16       |       1        |  79.989 ms |       0.14% |  34.487 ms |       0.08% | -45501.636 us | -56.88% |   FAST   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  43.424 ms |       0.17% |  45.384 ms |       0.02% |      1.960 ms |   4.51% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       2        |       1        |  47.078 ms |       2.18% |  45.372 ms |       0.03% |  -1705.635 us |  -3.62% |   FAST   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       4        |       1        |  59.126 ms |       0.24% |  45.352 ms |       0.01% | -13773.945 us | -23.30% |   FAST   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       8        |       1        |  81.022 ms |       0.05% |  45.320 ms |       0.02% | -35701.331 us | -44.06% |   FAST   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       16       |       1        | 121.849 ms |       0.05% |  45.283 ms |       0.03% | -76566.178 us | -62.84% |   FAST   |

# static_multimap_count_uniform_matching_rate

## [0] Quadro RTX 8000

|  Key  |  Value  |  Distribution  |  NumInputs  |  Occupancy  |  Multiplicity  |  MatchingRate  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|-------|---------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|-------------|---------|----------|
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.1       |  33.128 ms |       0.16% |  34.591 ms |       0.13% |    1.463 ms |   4.42% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.2       |  33.426 ms |       0.21% |  34.576 ms |       0.03% |    1.150 ms |   3.44% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.3       |  33.546 ms |       0.24% |  34.652 ms |       0.22% |    1.106 ms |   3.30% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.4       |  33.651 ms |       0.14% |  34.578 ms |       0.03% |  927.283 us |   2.76% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.5       |  33.850 ms |       0.19% |  34.574 ms |       0.02% |  723.245 us |   2.14% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.6       |  34.085 ms |       0.38% |  34.617 ms |       0.10% |  532.194 us |   1.56% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.7       |  34.247 ms |       0.23% |  34.607 ms |       0.17% |  359.530 us |   1.05% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.8       |  34.492 ms |       0.37% |  34.583 ms |       0.08% |   90.867 us |   0.26% |   SLOW   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.9       |  34.599 ms |       0.22% |  34.575 ms |       0.02% |  -23.695 us |  -0.07% |   FAST   |
|  I32  |   I32   |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  34.756 ms |       0.42% |  34.611 ms |       0.28% | -145.127 us |  -0.42% |   FAST   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.1       |  38.548 ms |       0.19% |  45.392 ms |       0.04% |    6.844 ms |  17.76% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.2       |  39.206 ms |       0.12% |  45.378 ms |       0.05% |    6.172 ms |  15.74% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.3       |  39.927 ms |       0.25% |  45.382 ms |       0.04% |    5.455 ms |  13.66% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.4       |  40.482 ms |       0.17% |  45.372 ms |       0.05% |    4.890 ms |  12.08% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.5       |  40.977 ms |       0.19% |  45.369 ms |       0.01% |    4.392 ms |  10.72% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.6       |  41.524 ms |       0.29% |  45.367 ms |       0.02% |    3.843 ms |   9.26% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.7       |  42.006 ms |       0.13% |  45.407 ms |       0.04% |    3.400 ms |   8.09% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.8       |  42.647 ms |       0.19% |  45.388 ms |       0.05% |    2.741 ms |   6.43% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       1        |      0.9       |  43.034 ms |       0.15% |  45.379 ms |       0.03% |    2.344 ms |   5.45% |   SLOW   |
|  I64  |   I64   |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  43.521 ms |       0.11% |  45.421 ms |       0.25% |    1.900 ms |   4.37% |   SLOW   |

Although the additional global memory loads have decreased from 35% to 28% more, they are still present. While the number of LEA instructions remains unchanged, using flat storage significantly reduces thread divergence, thereby improving runtime performance.:
image

@PointKernel PointKernel added help wanted Extra attention is needed helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue labels Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0