8000 GitHub - thomaswue/1brc-steps: 1️⃣🐝🏎️ Path to the Fastest #1BRC Solution
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

thomaswue/1brc-steps

 
 

Repository files navigation

Path to the Fastest #1BRC Solution

This repository contains 10 snapshots of the fastest One Billion Row Challenge solution to describe the incremental path to the final version. It outlines the different optimization techniques and provides measurements of their impact. For general information on the contest and all submitted solutions, please visit the main #1brc repository.

Also, there is a lot of other great content about the challenge if you are looking for more detailed information, for example a step by step guide from Marko Topolnik or a talk by Roy van Rijn or a podcast with Gunnar Morling.

In the chart and table below, the 10 steps of the fastest solution are outlined. Each intermediate step is checked into the repository as a separate file. The summary chart shows how the optimizations applied in the first few versions made the bi 930F ggest difference compared to the reference implementation. As the solution got more advanced, the incremental progress got smaller over time, while the complexity of the solution continued to rise:

Step-by-Step Progress of the Winning #1BRC Solution

The performance measurements were performed on an Intel 13th Gen Core i9-13900K while restricting the program to the first 8 cores using taskset -c 0-7. It is a comparable setup to the contest grading system, with only a few % difference.

Code SLOC Time (s) Delta Description
ref 55 125.3 - Naive single threaded reference implementation.
ref 55 115.1 -8.2% Exact same code as above, but using the GraalVM JIT compiler, easy win by just switching the JDK distribution!
v1 143 5.67 -95.1% Split input into 8 segments and process them in parallel. Map the file into memory and use byte buffer for parsing. Multiply input values by 10 and use integer instead of double for intermediate calculations. Use a simple custom hash table implementation.
v2 165 4.24 -25.2% Use sun.misc.Unsafe instead of byte buffer to avoid bounds checks and indirections when parsing.
v3 197 3.10 -26.8% Perform scanning and collision checking 4 or 8 bytes at a time.
v4 186 2.32 -25.1% Add branchless temperature parsing and SWAR scanning for delimiter.
v5 257 2.15 -7.3% Minor code shape improvements to better specialize on the cases where the city name is <= 8 bytes or <= 16 bytes.
v6 271 1.99 -7.5% Reducing hash table size while still avoiding collisions in the example data set by adjusting the hash function.
v7 307 1.84 -7.5% Adding trick to spawn a subprocess to avoid the long wait until the Linux kernel unmapping of the file takes place.
v8 362 1.77 -3.8% Better work distribution between threads by processing in 2MB chunks. Improved instruction level parallelism by processing always three entries in parallel in the same thread.
v9 343 1.71 -3.4% Save an instruction per row by folding an addition into the x86 address calculation and another one by performing the check on the name length on the mask value.
v10 349 1.45 -15.4% Remove the branch mispredictions by processing always 16 bytes at a time and using masking instead of branches to process the city names.

GraalVM

As shown in the table above, just switching to the GraalVM JDK distribution and using the Graal JIT compiler gives an 8% speed up for the reference implementation. Many participants therefore used GraalVM for their submissions. As the performance of submissions was getting better to below 10 seconds, also native image with its instant startup characteristics became important. For best performance, the following native image flags were used:

-O3                           # Running with the highest optimization level enabled.
-H:TuneInlinerExploration=1   # Maximum inlining exploration.
-march=native                 # Produce machine code best for the target machine.
--gc=epsilon                  # Disable garbage collection as it is not necessary.

Incremental Progress

As outlined in the following chart, the improvement of the first version in comparison to the reference implementation was very large with more than 95%. This was primarily due to using parallelism (factor 8) and optimizing data structures (int instead of double values). Those two aspects are often the most important ways to speed up a task.

Improvement Over Previous Version

Branch Misses

While the incremental improvements between versions became smaller and smaller, the last version makes a big jump while not making any algorithmic or data structure changes. The improvement comes from a big change in instructions per cycle by eliminating branch misses. When looking at the number of instructions executed, the processor actually has to do more work, but because of the dramatically improved instruction level parallelism, it can do so in less cycles and therefore less time. The following chart illustrates this:

Instructions/Cycle and Branch Misses

This GitHub comment on the PR shows the detailed perf stat output that underlines this. The important parts are captured in the following simplified output from perf stat:

BEFORE:
Performance counter stats for './target/CalculateAverage_thomaswue_image':
      12,061.39 msec task-clock                                       
   25,250,300,213      cpu_core/cycles/                                  
   42,740,848,935      cpu_core/instructions/           #    1.69  insn per cycle              
    3,938,912,502      cpu_core/branches/               #  326.572 M/sec                       
      246,089,883      cpu_core/branch-misses/          #    6.25% of all branches             
            TopdownL1 (cpu_core)                 #     24.9 %  tma_bad_speculation      

AFTER:
Performance counter stats for './target/CalculateAverage_thomaswue_image':
            9,778.04 msec task-clock                             
   21,654,496,541      cpu_core/cycles/                                   
   49,328,285,361      cpu_core/instructions/           #    2.28  insn per cycle            
    3,690,652,954      cpu_core/branches/               #  377.443 M/sec                      
       14,111,445      cpu_core/branch-misses/          #    0.38% of all branches          
            TopdownL1 (cpu_core)                 #     2.2 %  tma_bad_speculation    

Instead of missing 6.25% of all branches, the new version misses only 0.38% of all branches. Therefore a major bottleneck of instruction retirement is removed and tma_bad_speculation goes from 24.9% down to 2.2%.

The algorithmic change here was to not distinguish the case of a city name <= 8 characters and a city name <= 16 characters via branching. Instead, the program always reads ahead 16 characters and then uses masking to make sure only the actual city name is used for the calculation of the hash code and comparison with the stored hash key. It does mean that for the case of the city name being <= 8 characters, the processor is doing now more work. However, on average one branch miss every two rows (assuming a random distribution of city names in the input set) is avoided.

Using perf stat and looking at the cpu_core/branch-misses as well as the tma_bad_speculation data is a good way to understand whether the performance of your program is limited by branches.

About

1️⃣🐝🏎️ Path to the Fastest #1BRC Solution

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 95.6%
  • Shell 4.4%
0