8000 Parallelize stereo fusion; needs pre-loading of entire workspace by anmatako · Pull Request #1148 · colmap/colmap · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Parallelize stereo fusion; needs pre-loading of entire workspace #1148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Mar 9, 2021

Conversation

anmatako
Copy link
Contributor

Stereo fusion can now be done in parallel as long as the entire workspace gets pre-loaded. A new NoCacheWorkspace class can do that when initializing fusion. Parallel execution is the new default so to use the existing functionality with cache and sequential fusion stereo_fusion must be used with --StereoFusion.use_cache 1.

Also added options to:

  • Use image masks that excludes pixels from the depth maps. Work similarly to masks in ImageReaderOptions
  • Restrict the fused point cloud within a specified bounding box described from its two 3D corner points. The bounding box is in plan text format with space separated numeric values for the corner points, e.g. ` .

@anmatako
Copy link
Contributor Author
anmatako commented Feb 24, 2021

This will not build for now since it has dependencies on #1129. I will fix any residual build issues once that PR gets merged.

@@ -383,25 +440,35 @@ void StereoFusion::Fuse() {
// Set the current pixel as visited.
fused_pixel_mask.Set(row, col, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will end up with race conditions here, because you simultaneously read/write to the masks from multiple threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is true, and ideally this should have been an atomic update, but unfortunately the OpenMP standard (at least on Windows) does not allow atomic operations with function calls, only simple updates like x = val or x += val, etc. Wrapping this in a critical section kills the performance since all threads get blocked instead of just the ones that may conflict so this was also a no-go from a performance standpoint.

I've let that race condition exist since all threads are only attempting to set the same value (i.e. make the mask value true) so they're not adversely impacting the result of fusion. Worst case, by my understanding, is that we reconsider a pixel that should have already been masked out as visited. This causes some potential duplication of work, but from a performance standpoint its practically negligible, and definitely much preferred to using a critical section.

If there is a better way to go about it I can update accordingly. If it so happens that atomic works better in Linux/Mac I can add the #pragma omp atomic with a specific guard to skip it on windows. Otherwise I would have to change the data type of the mask from Mat<bool> to std::vector or something similar so I can directly access the data without a need for a function call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I managed to make this work with #pragma omp atomic by accessing the underlying pointer of Mat<bool>. Of course I'll have to performance test it on my end to ensure there is no adverse performance impact.

Copy link
Contributor Author
@anmatako anmatako Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I tried won't work either. Apparently std::vector<bool> does not have a .data() member, unlike other types of vectors. I managed to make it work by converting the mask to Mat<int> instead of Mat<bool> and doing this:

const int pixel_index = depth_map_sizes_.at(image_idx).first * row + col;
#pragma omp atomic
fused_pixel_mask.GetPtr()[pixel_index]++;

Note that I'm using increment and not assignment here since Windows only supports OpenMP 2.0 where atomic can only handle +=, -=, ++, and --.

If you think that's a valid way of handling this then I'll go ahead and update the PR. Thankfully making this atomic does not impact performance for parallel execution, at least in the few datasets I tested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your suggested fix doesn't resolve the actual problem causing potentially many duplicated fused points. I think this effect is exacerbated by the fact that you start spawning threads for each pixel row, i.e. nearby pixels will be processed by different threads, which are very likely to be fused into the same point.

I think we could minimize this effect by using some striding between different threads. For example:

const int kRowStride = max(1, height / num_threads);
#pragma omp parallel for schedule(dynamic) num_threads(num_threads)
for (int row_start = 0; row_start < height; row_start += kRowStride) {
    const int row_end = min(height, row_start + kRowStride);
    for (int row = row_start; row < row_end; ++row) {
        for (int col = 0; col < width; ++col) {
            ...
        }
    }
}

In this case, elements in the fusion queue of different threads are further apart and are less likely to race with each other.

I was not so much concerned about the atomic update, which will obviously be a problem for bool, which is typically implemented using bitshifting. We could switch to Mat<char> and just use mask.Set(row, col, 1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just tested the current setup (chunk size of 1) against the single threaded run and multi-threaded runs with chunk sizes of 10, 20, and 50 (that should be the same as setting a row stride).

Interestingly, the processing time goes slightly up when increasing the chunk size, and the reconstructed points are slightly less than the single threaded point clouds, with the exception of chunk size of 1. Below is a summary of the findings.

  • Reference single threaded run: Runtime 11.5 min; points: 7,570,936

The comparison results show the relative number of points and the % change:

  1. Chunk size 1 (current schedule): Runtime: 1.45 min; points: +20,568 (+0.27%)
  2. Chunk size 10: Runtime: 1.50 min; points: -13353 (-0.18%)
  3. Chunk size 20: Runtime: 1.66 min; points: -6130 (-0.08%)
  4. Chunk size 50: Runtime: 2.09 min; points: -2552 (-0.03%)

Those differences are really small, thankfully, and in light of that I think it would make sense to configure the scheduling with a chunk size of 10.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunk size of 10 seems like a good trade off. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do the thread pool - was just a suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing how colmap does not natively use OpenMP in favor of the ThreadPool, I'm afraid that a solution with OpenMP would pollute the code with quite a few guards checking if OpenMP is available.

I think I have a clear idea now on how to go about this with the ThreadPool, so I'll give it a try and update the PR. Should be cleaner overall with that approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks.

for (data.row = 0; data.row < height; ++data.row) {
for (data.col = 0; data.col < width; ++data.col) {
if (fused_pixel_mask.Get(data.row, data.col)) {
#pragma omp parallel for schedule(dynamic) num_threads(num_threads)
Copy link
Contributor
@ahojnnes ahojnnes Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any OpenMP code needs to be wrapped in #ifdef OPENMP_ENABLED, because it's an optional dependency in Python. Alternatively and preferably, we could use the ThreadPool in util/threading.h. To this end, we could implement a StereoFusionKernel class that holds the necessary state (fused pixel vectors, etc.) as its members.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it; I'll wrap it as such, and use your suggestion for an attempt to use the existing ThreadPool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question on wrapping with #ifdef OPENMP_ENABLED. I thought that the #pragmas are ignored in cases where they cannot be resolved (i.e. OpenMP missing and finding a #pragma omp directive). Is this not the case with the colmap setup?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will certainly trigger an unknown pragma warning. If you then also treat warnings as errors, compilation will fail.

} else {
workspace_.reset(new NoCacheWorkspace(workspace_options));
workspace_->Load(image_names);
num_threads = omp_get_max_threads();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to make the number of threads configurable in the options. It could be initialized to -1 and you use GetEffectiveNumThreads to determine the number of threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good suggestion and an easy change to make :)

@anmatako
Copy link
Contributor Author
anmatako commented Mar 7, 2021

Replaced OpenMP with ThreadPool for both fusion and workspace. Also removed all the members from Fusion that were only really used internally in the Fuse function. Now there's much less of the ugliness with the thread_id indexing. Performance took a minor hit, but overall it's not significant.

auto& max_bound = options.stereo_fusion->bounding_box.second;
file >> min_bound(0) >> min_bound(1) >> min_bound(2);
file >> max_bound(0) >> max_bound(1) >> max_bound(2);
file.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.close() is not necessary as the destructor will do the job.

@@ -42,12 +42,17 @@
namespace colmap {
namespace mvs {

class ConfidenceMap;

class Workspace {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to called to NoCache version Workspace and the cached one as CachedWorkspace.

Copy link
Contributor Author
@anmatako anmatako Mar 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sounds good, I'll flip the inheritance structure around. In that case it would make more sense then to have the default behavior with cache disabled for fusion. This will also impact patch-match where we actually want the cached workspace, but that would just be a change in the workspace initialization.

confidence_maps_.resize(num_images);
normal_maps_.resize(num_images);

auto LoadWorkspaceData = [&, this](int image_idx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const int image_idx (in COLMAP we use const even for basic types)

}
};

int num_threads = GetEffectiveNumThreads(options_.num_threads);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

std::cout << StringPrintf("Loading workspace data with %d threads...",
num_threads)
<< std::endl;
for (int i = 0; i < image_names.size(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t i

src/mvs/fusion.h Outdated
int image_idx;
int row;
int col;
int traversal_depth;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you restore the previous default values here? It's generally good practice to keep these to avoid uninitialized memory, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good; I only removed them because now we never create a default FusionData struct and we only use it with the constructor through emplace_back

const Eigen::Vector3f next_proj =
P_.at(next_image_idx) * xyz.homogeneous();
next_data.col = static_cast<int>(std::round(next_proj(0) / next_proj(2)));
next_data.row = static_cast<int>(std::round(next_proj(1) / next_proj(2)));
int next_col = static_cast<int>(std::round(next_proj(0) / next_proj(2)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const and also next_row

}
}

void StereoFusion::Fuse(int thread_id, int image_idx, int row, int col) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make the parameters const

<< std::endl;
ThreadPool thread_pool(num_threads);

const int kRowStride = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add a short explanation, why we use a stride here.

fused_points_visibility_.shrink_to_fit();
fused_points_.reserve(total_fused_points);
fused_points_visibility_.reserve(total_fused_points);
for (size_t task_id = 0; task_id < task_fused_points_.size(); ++task_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task_id -> thread_id for consistency

@ahojnnes
Copy link
Contributor
ahojnnes commented Mar 7, 2021

@anmatako You will have to merge with master. note that I modularized exe/colmap.cc into individual source files, since it grew way beyond what we should put into a single source file.

@anmatako
Copy link
Contributor Author
anmatako commented Mar 7, 2021

@anmatako You will have to merge with master. note that I modularized exe/colmap.cc into individual source files, since it grew way beyond what we should put into a single source file.

yes now that we are getting closer with these changes I'll do a merge from master and also resolve some other issues with confidence maps and calculated normals that were introduced from the PatchMatchNet PR.

I'll do a good cleanup and make sure this builds standalone.

Antonios Matakos and others added 5 commits March 7, 2021 11:33
Merge from main repo
The default Workspace now works without a cache (similar to old
NoCacheWorkspace) and the cached version is now explicitly named as
CachedWorkspace.
const Bitmap& GetBitmap(const int image_idx);
const DepthMap& GetDepthMap(const int image_idx);
const NormalMap& GetNormalMap(const int image_idx);
virtual inline const Bitmap& GetBitmap(const int image_idx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving these method definitions to the .cc file. Since it's a virtual method, it won't be able to inline these anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for all the inline methods below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sounds good

@ahojnnes
Copy link
Contributor
ahojnnes commented Mar 9, 2021

@anmatako Thanks, the PR looks great now. I trust you checked correctness that these changes lead to equivalent (+- small changes) fusion results as before? I left two minor comments, then I am happy to merge. Thanks!

@anmatako
Copy link
Contributor Author
anmatako commented Mar 9, 2021

@anmatako Thanks, the PR looks great now. I trust you checked correctness that these changes lead to equivalent (+- small changes) fusion results as before? I left two minor comments, then I am happy to merge. Thanks!

Thanks! I've tested quite a lot in our pipelines and the differences are really minor compared to single threaded fusion; less than 0.1% in the number fused points and visually nothing stood out as an obvious outlier.

@ahojnnes
Copy link
Contributor
ahojnnes commented Mar 9, 2021

Thanks very much. This is a great improvement!

@ahojnnes ahojnnes enabled auto-merge (squash) March 9, 2021 22:39
@ahojnnes ahojnnes merged commit fd3e5a1 into colmap:dev Mar 9, 2021
@anmatako anmatako deleted the public/fusion branch March 10, 2021 22:25
lucasthahn pushed a commit to tne-ai/colmap that referenced this pull request Aug 17, 2022
…map#1148)

* Parallelize stereo fusion; needs pre-loading of entire workspace

* Make mask update atomic

* Fix bug in accessing depth map size

* Allow configurable num_threads in fusion and workspace

* Fix merging error

* Use ThreadPool instead of OpenMP for parallel fusion

* Update for PR comments

* Change Workspace naming conventions

The default Workspace now works without a cache (similar to old
NoCacheWorkspace) and the cached version is now explicitly named as
CachedWorkspace.

* Fix build errors

* Update fusion.h

* Update fusion.h

* Address PR comments

Co-authored-by: Antonios Matakos <anmatako@dss.microsoft.us>
Co-authored-by: Johannes Schönberger <joschonb@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0