Parallelize stereo fusion; needs pre-loading of entire workspace #1148

anmatako · 2021-02-24T06:02:25Z

Stereo fusion can now be done in parallel as long as the entire workspace gets pre-loaded. A new NoCacheWorkspace class can do that when initializing fusion. Parallel execution is the new default so to use the existing functionality with cache and sequential fusion stereo_fusion must be used with --StereoFusion.use_cache 1.

Also added options to:

Use image masks that excludes pixels from the depth maps. Work similarly to masks in ImageReaderOptions
Restrict the fused point cloud within a specified bounding box described from its two 3D corner points. The bounding box is in plan text format with space separated numeric values for the corner points, e.g. ` .

anmatako · 2021-02-24T06:02:54Z

This will not build for now since it has dependencies on #1129. I will fix any residual build issues once that PR gets merged.

src/mvs/fusion.cc

ahojnnes · 2021-02-24T09:43:14Z

src/mvs/fusion.cc

@@ -383,25 +440,35 @@ void StereoFusion::Fuse() {
    // Set the current pixel as visited.
    fused_pixel_mask.Set(row, col, true);


You will end up with race conditions here, because you simultaneously read/write to the masks from multiple threads.

Yes this is true, and ideally this should have been an atomic update, but unfortunately the OpenMP standard (at least on Windows) does not allow atomic operations with function calls, only simple updates like x = val or x += val, etc. Wrapping this in a critical section kills the performance since all threads get blocked instead of just the ones that may conflict so this was also a no-go from a performance standpoint.

I've let that race condition exist since all threads are only attempting to set the same value (i.e. make the mask value true) so they're not adversely impacting the result of fusion. Worst case, by my understanding, is that we reconsider a pixel that should have already been masked out as visited. This causes some potential duplication of work, but from a performance standpoint its practically negligible, and definitely much preferred to using a critical section.

If there is a better way to go about it I can update accordingly. If it so happens that atomic works better in Linux/Mac I can add the #pragma omp atomic with a specific guard to skip it on windows. Otherwise I would have to change the data type of the mask from Mat<bool> to std::vector or something similar so I can directly access the data without a need for a function call.

I think I managed to make this work with #pragma omp atomic by accessing the underlying pointer of Mat<bool>. Of course I'll have to performance test it on my end to ensure there is no adverse performance impact.

What I tried won't work either. Apparently std::vector<bool> does not have a .data() member, unlike other types of vectors. I managed to make it work by converting the mask to Mat<int> instead of Mat<bool> and doing this:

const int pixel_index = depth_map_sizes_.at(image_idx).first * row + col; #pragma omp atomic fused_pixel_mask.GetPtr()[pixel_index]++;

Note that I'm using increment and not assignment here since Windows only supports OpenMP 2.0 where atomic can only handle +=, -=, ++, and --.

If you think that's a valid way of handling this then I'll go ahead and update the PR. Thankfully making this atomic does not impact performance for parallel execution, at least in the few datasets I tested.

I think your suggested fix doesn't resolve the actual problem causing potentially many duplicated fused points. I think this effect is exacerbated by the fact that you start spawning threads for each pixel row, i.e. nearby pixels will be processed by different threads, which are very likely to be fused into the same point.

I think we could minimize this effect by using some striding between different threads. For example:

const int kRowStride = max(1, height / num_threads); #pragma omp parallel for schedule(dynamic) num_threads(num_threads) for (int row_start = 0; row_start < height; row_start += kRowStride) { const int row_end = min(height, row_start + kRowStride); for (int row = row_start; row < row_end; ++row) { for (int col = 0; col < width; ++col) { ... } } }

In this case, elements in the fusion queue of different threads are further apart and are less likely to race with each other.

I was not so much concerned about the atomic update, which will obviously be a problem for bool, which is typically implemented using bitshifting. We could switch to Mat<char> and just use mask.Set(row, col, 1).

So I just tested the current setup (chunk size of 1) against the single threaded run and multi-threaded runs with chunk sizes of 10, 20, and 50 (that should be the same as setting a row stride).

Interestingly, the processing time goes slightly up when increasing the chunk size, and the reconstructed points are slightly less than the single threaded point clouds, with the exception of chunk size of 1. Below is a summary of the findings.

Reference single threaded run: Runtime 11.5 min; points: 7,570,936

The comparison results show the relative number of points and the % change:

Chunk size 1 (current schedule): Runtime: 1.45 min; points: +20,568 (+0.27%)

Chunk size 10: Runtime: 1.50 min; points: -13353 (-0.18%)

Chunk size 20: Runtime: 1.66 min; points: -6130 (-0.08%)

Chunk size 50: Runtime: 2.09 min; points: -2552 (-0.03%)

Those differences are really small, thankfully, and in light of that I think it would make sense to configure the scheduling with a chunk size of 10.

Chunk size of 10 seems like a good trade off. Thanks!

No need to do the thread pool - was just a suggestion.

Seeing how colmap does not natively use OpenMP in favor of the ThreadPool, I'm afraid that a solution with OpenMP would pollute the code with quite a few guards checking if OpenMP is available.

I think I have a clear idea now on how to go about this with the ThreadPool, so I'll give it a try and update the PR. Should be cleaner overall with that approach.

Looks great, thanks.

src/exe/colmap.cc

src/mvs/fusion.cc

ahojnnes · 2021-03-05T08:34:33Z

src/mvs/fusion.cc

-    for (data.row = 0; data.row < height; ++data.row) {
-      for (data.col = 0; data.col < width; ++data.col) {
-        if (fused_pixel_mask.Get(data.row, data.col)) {
+#pragma omp parallel for schedule(dynamic) num_threads(num_threads)


Any OpenMP code needs to be wrapped in #ifdef OPENMP_ENABLED, because it's an optional dependency in Python. Alternatively and preferably, we could use the ThreadPool in util/threading.h. To this end, we could implement a StereoFusionKernel class that holds the necessary state (fused pixel vectors, etc.) as its members.

Got it; I'll wrap it as such, and use your suggestion for an attempt to use the existing ThreadPool.

Quick question on wrapping with #ifdef OPENMP_ENABLED. I thought that the #pragmas are ignored in cases where they cannot be resolved (i.e. OpenMP missing and finding a #pragma omp directive). Is this not the case with the colmap setup?

It will certainly trigger an unknown pragma warning. If you then also treat warnings as errors, compilation will fail.

ahojnnes · 2021-03-05T08:37:34Z

src/mvs/fusion.cc

+  } else {
+    workspace_.reset(new NoCacheWorkspace(workspace_options));
+    workspace_->Load(image_names);
+    num_threads = omp_get_max_threads();


It would be nice to make the number of threads configurable in the options. It could be initialized to -1 and you use GetEffectiveNumThreads to determine the number of threads.

Yes, that's a good suggestion and an easy change to make :)

anmatako · 2021-03-07T17:43:09Z

Replaced OpenMP with ThreadPool for both fusion and workspace. Also removed all the members from Fusion that were only really used internally in the Fuse function. Now there's much less of the ugliness with the thread_id indexing. Performance took a minor hit, but overall it's not significant.

ahojnnes · 2021-03-07T19:02:23Z

src/exe/colmap.cc

+      auto& max_bound = options.stereo_fusion->bounding_box.second;
+      file >> min_bound(0) >> min_bound(1) >> min_bound(2);
+      file >> max_bound(0) >> max_bound(1) >> max_bound(2);
+      file.close();


.close() is not necessary as the destructor will do the job.

ahojnnes · 2021-03-07T19:08:47Z

src/mvs/workspace.h

@@ -42,12 +42,17 @@
 namespace colmap {
 namespace mvs {

+class ConfidenceMap;
+
 class Workspace {


Maybe it's better to called to NoCache version Workspace and the cached one as CachedWorkspace.

OK sounds good, I'll flip the inheritance structure around. In that case it would make more sense then to have the default behavior with cache disabled for fusion. This will also impact patch-match where we actually want the cached workspace, but that would just be a change in the workspace initialization.

ahojnnes · 2021-03-07T19:09:47Z

src/mvs/workspace.cc

+  confidence_maps_.resize(num_images);
+  normal_maps_.resize(num_images);
+
+  auto LoadWorkspaceData = [&, this](int image_idx) {


const int image_idx (in COLMAP we use const even for basic types)

ahojnnes · 2021-03-07T19:10:07Z

src/mvs/workspace.cc

+    }
+  };
+
+  int num_threads = GetEffectiveNumThreads(options_.num_threads);


ahojnnes · 2021-03-07T19:10:19Z

src/mvs/workspace.cc

+  std::cout << StringPrintf("Loading workspace data with %d threads...",
+                            num_threads)
+            << std::endl;
+  for (int i = 0; i < image_names.size(); ++i) {


ahojnnes · 2021-03-07T19:11:15Z

src/mvs/fusion.h

+    int image_idx;
+    int row;
+    int col;
+    int traversal_depth;


Can you restore the previous default values here? It's generally good practice to keep these to avoid uninitialized memory, I think.

Sounds good; I only removed them because now we never create a default FusionData struct and we only use it with the constructor through emplace_back

ahojnnes · 2021-03-07T19:11:56Z

src/mvs/fusion.cc

      const Eigen::Vector3f next_proj =
          P_.at(next_image_idx) * xyz.homogeneous();
-      next_data.col = static_cast<int>(std::round(next_proj(0) / next_proj(2)));
-      next_data.row = static_cast<int>(std::round(next_proj(1) / next_proj(2)));
+      int next_col = static_cast<int>(std::round(next_proj(0) / next_proj(2)));


const and also next_row

ahojnnes · 2021-03-07T19:12:26Z

src/mvs/fusion.cc

+  }
+}
+
+void StereoFusion::Fuse(int thread_id, int image_idx, int row, int col) {


please make the parameters const

ahojnnes · 2021-03-07T19:13:57Z

src/mvs/fusion.cc

+            << std::endl;
+  ThreadPool thread_pool(num_threads);
+
+  const int kRowStride = 10;


It would be great to add a short explanation, why we use a stride here.

ahojnnes · 2021-03-07T19:15:01Z

src/mvs/fusion.cc

-  fused_points_visibility_.shrink_to_fit();
+  fused_points_.reserve(total_fused_points);
+  fused_points_visibility_.reserve(total_fused_points);
+  for (size_t task_id = 0; task_id < task_fused_points_.size(); ++task_id) {


task_id -> thread_id for consistency

ahojnnes · 2021-03-07T19:16:12Z

@anmatako You will have to merge with master. note that I modularized exe/colmap.cc into individual source files, since it grew way beyond what we should put into a single source file.

anmatako · 2021-03-07T19:27:31Z

@anmatako You will have to merge with master. note that I modularized exe/colmap.cc into individual source files, since it grew way beyond what we should put into a single source file.

yes now that we are getting closer with these changes I'll do a merge from master and also resolve some other issues with confidence maps and calculated normals that were introduced from the PatchMatchNet PR.

I'll do a good cleanup and make sure this builds standalone.

Merge from main repo

Merge from main

The default Workspace now works without a cache (similar to old NoCacheWorkspace) and the cached version is now explicitly named as CachedWorkspace.

ahojnnes · 2021-03-09T22:00:23Z

src/mvs/workspace.h

-  const Bitmap& GetBitmap(const int image_idx);
-  const DepthMap& GetDepthMap(const int image_idx);
-  const NormalMap& GetNormalMap(const int image_idx);
+  virtual inline const Bitmap& GetBitmap(const int image_idx) {


I suggest moving these method definitions to the .cc file. Since it's a virtual method, it won't be able to inline these anyway.

Same for all the inline methods below.

OK sounds good

ahojnnes · 2021-03-09T22:02:12Z

@anmatako Thanks, the PR looks great now. I trust you checked correctness that these changes lead to equivalent (+- small changes) fusion results as before? I left two minor comments, then I am happy to merge. Thanks!

anmatako · 2021-03-09T22:06:29Z

@anmatako Thanks, the PR looks great now. I trust you checked correctness that these changes lead to equivalent (+- small changes) fusion results as before? I left two minor comments, then I am happy to merge. Thanks!

Thanks! I've tested quite a lot in our pipelines and the differences are really minor compared to single threaded fusion; less than 0.1% in the number fused points and visually nothing stood out as an obvious outlier.

ahojnnes · 2021-03-09T22:39:41Z

Thanks very much. This is a great improvement!

…map#1148) * Parallelize stereo fusion; needs pre-loading of entire workspace * Make mask update atomic * Fix bug in accessing depth map size * Allow configurable num_threads in fusion and workspace * Fix merging error * Use ThreadPool instead of OpenMP for parallel fusion * Update for PR comments * Change Workspace naming conventions The default Workspace now works without a cache (similar to old NoCacheWorkspace) and the cached version is now explicitly named as CachedWorkspace. * Fix build errors * Update fusion.h * Update fusion.h * Address PR comments Co-authored-by: Antonios Matakos <anmatako@dss.microsoft.us> Co-authored-by: Johannes Schönberger <joschonb@microsoft.com>

Parallelize stereo fusion; needs pre-loading of entire workspace

562715a

ahojnnes reviewed Feb 24, 2021

View reviewed changes

Antonios Matakos and others added 3 commits February 24, 2021 11:17

Make mask update atomic

4dd82ad

Fix bug in accessing depth map size

8be4bc3

Merge branch 'dev' into public/fusion

a9089f6

ahojnnes reviewed Mar 5, 2021

View reviewed changes

src/exe/colmap.cc Outdated Show resolved Hide resolved

src/mvs/fusion.cc Outdated Show resolved Hide resolved

src/mvs/fusion.cc Outdated Show resolved Hide resolved

ahojnnes reviewed Mar 5, 2021

View reviewed changes

Antonios Matakos added 3 commits March 5, 2021 16:16

Allow configurable num_threads in fusion and workspace

3f71379

Fix merging error

bb389a4

Use ThreadPool instead of OpenMP for parallel fusion

8ee0fb6

ahojnnes reviewed Mar 7, 2021

View reviewed changes

src/mvs/workspace.cc Outdated

}

};

int num_threads = GetEffectiveNumThreads(options_.num_threads);

Copy link

Contributor

ahojnnes Mar 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

ahojnnes reviewed Mar 7, 2021

View reviewed changes

Antonios Matakos and others added 5 commits March 7, 2021 11:33

Update for PR comments

0c77f64

Merge pull request #3 from colmap/dev

485534e

Merge from main repo

Merge pull request #6 from colmap/dev

6ead0b9

Merge from main

Merge remote-tracking branch 'public/public/dev' into public/fusion

3549720

Change Workspace nami 7438 ng conventions

98575fa

The default Workspace now works without a cache (similar to old NoCacheWorkspace) and the cached version is now explicitly named as CachedWorkspace.

Antonios Matakos and others added 5 commits March 8, 2021 13:32

Fix build errors

64e3042

Update fusion.h

fe087e1

Update fusion.h

05d5815

Merge branch 'dev' into public/fusion

c8e5ca7

Merge branch 'dev' into public/fusion

0c0f4e2

ahojnnes reviewed Mar 9, 2021

View reviewed changes

Address PR comments

56d3cfb

ahojnnes enabled auto-merge (squash) March 9, 2021 22:39

ahojnnes merged commit fd3e5a1 into colmap:dev Mar 9, 2021

anmatako deleted the public/fusion branch March 10, 2021 22:25

ibrarmalik mentioned this pull request Apr 12, 2022

Use masks for stereo fusion on automatic reconstruction #1488

Merged

		@@ -383,25 +440,35 @@ void StereoFusion::Fuse() {
		// Set the current pixel as visited.
		fused_pixel_mask.Set(row, col, true);

Parallelize stereo fusion; needs pre-loading of entire workspace #1148

Parallelize stereo fusion; needs pre-loading of entire workspace #1148

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!