-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Parallelize stereo fusion; needs pre-loading of entire workspace #1148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This will not build for now since it has dependencies on #1129. I will fix any residual build issues once that PR gets merged. |
src/mvs/fusion.cc
Outdated
@@ -383,25 +440,35 @@ void StereoFusion::Fuse() { | |||
// Set the current pixel as visited. | |||
fused_pixel_mask.Set(row, col, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will end up with race conditions here, because you simultaneously read/write to the masks from multiple threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is true, and ideally this should have been an atomic update, but unfortunately the OpenMP standard (at least on Windows) does not allow atomic operations with function calls, only simple updates like x = val
or x += val
, etc. Wrapping this in a critical section kills the performance since all threads get blocked instead of just the ones that may conflict so this was also a no-go from a performance standpoint.
I've let that race condition exist since all threads are only attempting to set the same value (i.e. make the mask value true
) so they're not adversely impacting the result of fusion. Worst case, by my understanding, is that we reconsider a pixel that should have already been masked out as visited. This causes some potential duplication of work, but from a performance standpoint its practically negligible, and definitely much preferred to using a critical section.
If there is a better way to go about it I can update accordingly. If it so happens that atomic works better in Linux/Mac I can add the #pragma omp atomic
with a specific guard to skip it on windows. Otherwise I would have to change the data type of the mask from Mat<bool>
to std::vector
or something similar so I can directly access the data without a need for a function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I managed to make this work with #pragma omp atomic
by accessing the underlying pointer of Mat<bool>
. Of course I'll have to performance test it on my end to ensure there is no adverse performance impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I tried won't work either. Apparently std::vector<bool>
does not have a .data()
member, unlike other types of vectors. I managed to make it work by converting the mask to Mat<int>
instead of Mat<bool>
and doing this:
const int pixel_index = depth_map_sizes_.at(image_idx).first * row + col;
#pragma omp atomic
fused_pixel_mask.GetPtr()[pixel_index]++;
Note that I'm using increment and not assignment here since Windows only supports OpenMP 2.0 where atomic can only handle +=
, -=
, ++
, and --
.
If you think that's a valid way of handling this then I'll go ahead and update the PR. Thankfully making this atomic does not impact performance for parallel execution, at least in the few datasets I tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your suggested fix doesn't resolve the actual problem causing potentially many duplicated fused points. I think this effect is exacerbated by the fact that you start spawning threads for each pixel row, i.e. nearby pixels will be processed by different threads, which are very likely to be fused into the same point.
I think we could minimize this effect by using some striding between different threads. For example:
const int kRowStride = max(1, height / num_threads);
#pragma omp parallel for schedule(dynamic) num_threads(num_threads)
for (int row_start = 0; row_start < height; row_start += kRowStride) {
const int row_end = min(height, row_start + kRowStride);
for (int row = row_start; row < row_end; ++row) {
for (int col = 0; col < width; ++col) {
...
}
}
}
In this case, elements in the fusion queue of different threads are further apart and are less likely to race with each other.
I was not so much concerned about the atomic update, which will obviously be a problem for bool
, which is typically implemented using bitshifting. We could switch to Mat<char>
and just use mask.Set(row, col, 1)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I just tested the current setup (chunk size of 1) against the single threaded run and multi-threaded runs with chunk sizes of 10, 20, and 50 (that should be the same as setting a row stride).
Interestingly, the processing time goes slightly up when increasing the chunk size, and the reconstructed points are slightly less than the single threaded point clouds, with the exception of chunk size of 1. Below is a summary of the findings.
- Reference single threaded run: Runtime 11.5 min; points: 7,570,936
The comparison results show the relative number of points and the % change:
- Chunk size 1 (current schedule): Runtime: 1.45 min; points: +20,568 (+0.27%)
- Chunk size 10: Runtime: 1.50 min; points: -13353 (-0.18%)
- Chunk size 20: Runtime: 1.66 min; points: -6130 (-0.08%)
- Chunk size 50: Runtime: 2.09 min; points: -2552 (-0.03%)
Those differences are really small, thankfully, and in light of that I think it would make sense to configure the scheduling with a chunk size of 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chunk size of 10 seems like a good trade off. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to do the thread pool - was just a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing how colmap does not natively use OpenMP in favor of the ThreadPool, I'm afraid that a solution with OpenMP would pollute the code with quite a few guards checking if OpenMP is available.
I think I have a clear idea now on how to go about this with the ThreadPool, so I'll give it a try and update the PR. Should be cleaner overall with that approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks.
src/mvs/fusion.cc
Outdated
for (data.row = 0; data.row < height; ++data.row) { | ||
for (data.col = 0; data.col < width; ++data.col) { | ||
if (fused_pixel_mask.Get(data.row, data.col)) { | ||
#pragma omp parallel for schedule(dynamic) num_threads(num_threads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any OpenMP code needs to be wrapped in #ifdef OPENMP_ENABLED
, because it's an optional dependency in Python. Alternatively and preferably, we could use the ThreadPool in util/threading.h
. To this end, we could implement a StereoFusionKernel class that holds the necessary state (fused pixel vectors, etc.) as its members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it; I'll wrap it as such, and use your suggestion for an attempt to use the existing ThreadPool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question on wrapping with #ifdef OPENMP_ENABLED
. I thought that the #pragma
s are ignored in cases where they cannot be resolved (i.e. OpenMP missing and finding a #pragma omp
directive). Is this not the case with the colmap setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will certainly trigger an unknown pragma warning. If you then also treat warnings as errors, compilation will fail.
src/mvs/fusion.cc
Outdated
} else { | ||
workspace_.reset(new NoCacheWorkspace(workspace_options)); | ||
workspace_->Load(image_names); | ||
num_threads = omp_get_max_threads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to make the number of threads configurable in the options. It could be initialized to -1 and you use GetEffectiveNumThreads
to determine the number of threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good suggestion and an easy change to make :)
Replaced OpenMP with ThreadPool for both fusion and workspace. Also removed all the members from Fusion that were only really used internally in the |
src/exe/colmap.cc
Outdated
auto& max_bound = options.stereo_fusion->bounding_box.second; | ||
file >> min_bound(0) >> min_bound(1) >> min_bound(2); | ||
file >> max_bound(0) >> max_bound(1) >> max_bound(2); | ||
file.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.close() is not necessary as the destructor will do the job.
src/mvs/workspace.h
Outdated
@@ -42,12 +42,17 @@ | |||
namespace colmap { | |||
namespace mvs { | |||
|
|||
class ConfidenceMap; | |||
|
|||
class Workspace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to called to NoCache
version Workspace
and the cached one as CachedWorkspace
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK sounds good, I'll flip the inheritance structure around. In that case it would make more sense then to have the default behavior with cache disabled for fusion. This will also impact patch-match where we actually want the cached workspace, but that would just be a change in the workspace initialization.
src/mvs/workspace.cc
Outdated
confidence_maps_.resize(num_images); | ||
normal_maps_.resize(num_images); | ||
|
||
auto LoadWorkspaceData = [&, this](int image_idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const int image_idx (in COLMAP we use const even for basic types)
src/mvs/workspace.cc
Outdated
} | ||
}; | ||
|
||
int num_threads = GetEffectiveNumThreads(options_.num_threads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
src/mvs/workspace.cc
Outdated
std::cout << StringPrintf("Loading workspace data with %d threads...", | ||
num_threads) | ||
<< std::endl; | ||
for (int i = 0; i < image_names.size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_t i
src/mvs/fusion.h
Outdated
int image_idx; | ||
int row; | ||
int col; | ||
int traversal_depth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you restore the previous default values here? It's generally good practice to keep these to avoid uninitialized memory, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good; I only removed them because now we never create a default FusionData
struct and we only use it with the constructor through emplace_back
src/mvs/fusion.cc
Outdated
const Eigen::Vector3f next_proj = | ||
P_.at(next_image_idx) * xyz.homogeneous(); | ||
next_data.col = static_cast<int>(std::round(next_proj(0) / next_proj(2))); | ||
next_data.row = static_cast<int>(std::round(next_proj(1) / next_proj(2))); | ||
int next_col = static_cast<int>(std::round(next_proj(0) / next_proj(2))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const and also next_row
src/mvs/fusion.cc
Outdated
} | ||
} | ||
|
||
void StereoFusion::Fuse(int thread_id, int image_idx, int row, int col) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make the parameters const
<< std::endl; | ||
ThreadPool thread_pool(num_threads); | ||
|
||
const int kRowStride = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to add a short explanation, why we use a stride here.
src/mvs/fusion.cc
Outdated
fused_points_visibility_.shrink_to_fit(); | ||
fused_points_.reserve(total_fused_points); | ||
fused_points_visibility_.reserve(total_fused_points); | ||
for (size_t task_id = 0; task_id < task_fused_points_.size(); ++task_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
task_id -> thread_id for consistency
@anmatako You will have to merge with master. note that I modularized exe/colmap.cc into individual source files, since it grew way beyond what we should put into a single source file. |
yes now that we are getting closer with these changes I'll do a merge from master and also resolve some other issues with confidence maps and calculated normals that were introduced from the PatchMatchNet PR. I'll do a good cleanup and make sure this builds standalone. |
Merge from main repo
Merge from main
The default Workspace now works without a cache (similar to old NoCacheWorkspace) and the cached version is now explicitly named as CachedWorkspace.
src/mvs/workspace.h
Outdated
const Bitmap& GetBitmap(const int image_idx); | ||
const DepthMap& GetDepthMap(const int image_idx); | ||
const NormalMap& GetNormalMap(const int image_idx); | ||
virtual inline const Bitmap& GetBitmap(const int image_idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest moving these method definitions to the .cc file. Since it's a virtual method, it won't be able to inline these anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for all the inline methods below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK sounds good
@anmatako Thanks, the PR looks great now. I trust you checked correctness that these changes lead to equivalent (+- small changes) fusion results as before? I left two minor comments, then I am happy to merge. Thanks! |
Thanks! I've tested quite a lot in our pipelines and the differences are really minor compared to single threaded fusion; less than 0.1% in the number fused points and visually nothing stood out as an obvious outlier. |
Thanks very much. This is a great improvement! |
…map#1148) * Parallelize stereo fusion; needs pre-loading of entire workspace * Make mask update atomic * Fix bug in accessing depth map size * Allow configurable num_threads in fusion and workspace * Fix merging error * Use ThreadPool instead of OpenMP for parallel fusion * Update for PR comments * Change Workspace naming conventions The default Workspace now works without a cache (similar to old NoCacheWorkspace) and the cached version is now explicitly named as CachedWorkspace. * Fix build errors * Update fusion.h * Update fusion.h * Address PR comments Co-authored-by: Antonios Matakos <anmatako@dss.microsoft.us> Co-authored-by: Johannes Schönberger <joschonb@microsoft.com>
Stereo fusion can now be done in parallel as long as the entire workspace gets pre-loaded. A new
NoCacheWorkspace
class can do that when initializing fusion. Parallel execution is the new default so to use the existing functionality with cache and sequential fusionstereo_fusion
must be used with--StereoFusion.use_cache 1
.Also added options to:
ImageReaderOptions