Change linking order to avoid using gloo in pytorch dynamic libraries #3750
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Signed-off-by: cheng.chen c392410997@gmail.com
Description
{Pytorch_LIBRARIES} contains libtorch_cpu.so, which also includes gloo-related symbols, if {Pytorch_LIBRARIES} comes before gloo/compatible_gloo, then some of gloo functions (under gloo:rendezvous namespace) would dynamic link to libtorch_cpu.so instead of static link to gloo/compatible_gloo in horovod. This would create incompatible related errors if pytorch and horovod using different gloo versions.
In my case, I use different version of gloo in pytorch and horovod, horovod generates
[1]<stderr>:*** Error in python3: malloc(): memory corruption: 0x000014578c00b6a0 ***
, and I found out error generated because https://github.com/horovod/horovod/blob/master/horovod/common/gloo/gloo_context.cc#L85 links to libtorch_cpu.so, so made error at runtime.So make ${Pytorch_LIBRARIES} after gloo/compatible_gloo can make sure mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so static link to the gloo/compatible_gloo of third-party in horovod.
Review process to land