10000 Remove std::regex use for better cxx11 abi compatibility by trevor-m · Pull Request #3584 · horovod/horovod · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Remove std::regex use for better cxx11 abi compatibility #3584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 24, 2022

Conversation

trevor-m
Copy link
Contributor
@trevor-m trevor-m commented Jun 22, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Since TensorFlow has moved to _GLIBCXX_USE_CXX11_ABI=1, we have ran into some issues with Horovod related to the use of std::regex when we build TF/Horovod on manylinux2014. HorovodBroadcastInplaceOp was crashing when std::regex was used.

To fix the problem, I replaced the regex with something simpler which is probably also faster and more lightweight.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@trevor-m trevor-m force-pushed the tmorris-remove-regex branch from 8b642ca to bfd191c Compare June 22, 2022 00:32
Signed-off-by: Trevor Morris <tmorris@nvidia.com>
@trevor-m trevor-m force-pushed the tmorris-remove-regex branch from bfd191c to def3f29 Compare June 22, 2022 00:45
@maxhgerlach maxhgerlach self-requested a review June 22, 2022 15:59
@maxhgerlach
Copy link
Collaborator

Very interesting! The issue sounds similar to this Stack Overflow question: https://stackoverflow.com/questions/51382355/stdregex-and-dual-abi std::regex appears to be particularly sensitive to mixing different C++ ABIs.

So if in your case TensorFlow is built with _GLIBCXX_USE_CXX11_ABI=1, then horovod/tensorflow/mpi_ops.cc should also be compiled for the C++11 ABI, via

set(CMAKE_CXX_FLAGS "${Tensorflow_COMPILE_FLAGS} ${CMAKE_CXX_FLAGS}")

Would it help if the rest of Horovod (sources in horovod/common) was also compiled with _GLIBCXX_USE_CXX11_ABI=1 then? Or is the issue on manylinux2014 caused by other libraries that are linked with Horovod?

Your change looks great, @trevor-m. I'm just a little afraid that there might be more hidden corruption if the ABIs are mixed.

@trevor-m
Copy link
Contributor Author

Very interesting! The issue sounds similar to this Stack Overflow question: https://stackoverflow.com/questions/51382355/stdregex-and-dual-abi std::regex appears to be particularly sensitive to mixing different C++ ABIs.

So if in your case TensorFlow is built with _GLIBCXX_USE_CXX11_ABI=1, then horovod/tensorflow/mpi_ops.cc should also be compiled for the C++11 ABI, via

set(CMAKE_CXX_FLAGS "${Tensorflow_COMPILE_FLAGS} ${CMAKE_CXX_FLAGS}")

Would it help if the rest of H 8000 orovod (sources in horovod/common) was also compiled with _GLIBCXX_USE_CXX11_ABI=1 then? Or is the issue on manylinux2014 caused by other libraries that are linked with Horovod?

Your change looks great, @trevor-m. I'm just a little afraid that there might be more hidden corruption if the ABIs are mixed.

Ah, that could be it. I didn't realize only the TF lib would be built with a _GLIBCXX_USE_CXX11_ABI that matches TensorFlow while the other binaries might not be.

I definitely understand and agree with your concern about more hidden corruption. Perhaps we can look into it a little more and do a follow-up PR? I can check if changing the horovod _GLIBCXX_USE_CXX11_ABI makes a difference with the error I was seeing.

@maxhgerlach
Copy link
Collaborator

I definitely understand and agree with your concern about more hidden corruption. Perhaps we can look into it a little more and do a follow-up PR?

I'm definitely in favor of merging this current PR (once the CI passes). But it would be great to better understand the root cause.

I can check if changing the horovod _GLIBCXX_USE_CXX11_ABI makes a difference with the error I was seeing.

If you find the time, that would be fabulous!

@github-actions
Copy link

Unit Test Results

     893 files  +  30       893 suites  +30   9h 31m 51s ⏱️ + 3m 35s
     781 tests ±    0       737 ✔️ ±    0       44 💤 ±    0  0 ±0 
19 061 runs  +746  13 666 ✔️ +586  5 395 💤 +160  0 ±0 

Results for commit def3f29. ± Comparison against base commit 48e0aff.

@github-actions
Copy link

Unit Test Results (with flaky tests)

  1 074 files  +  27    1 074 suites  +27   10h 30m 1s ⏱️ + 23m 23s
     781 tests ±    0       736 ✔️  -     1       44 💤 ±  0  1 +1 
22 987 runs  +592  15 921 ✔️ +515  7 065 💤 +76  1 +1 

For more details on these failures, see this check.

Results for commit def3f29. 8000 ± Comparison against base commit 48e0aff.

@maxhgerlach
Copy link
Collaborator

Would it help if the rest of Horovod (sources in horovod/common) was also compiled with _GLIBCXX_USE_CXX11_ABI=1 then?

Actually, I was mistaken there: It is already the case that all Horovod sources are compiled with -D_GLIBCXX_USE_CXX11_ABI set to match the TensorFlow build when the object files are linked into the horovod/tensorflow/mpi_lib.cpython-*.so. So I don't think your crash is related to how that is set when compiling Horovod.

Alternatively, could the crash be related to some other library that your are linking to / importing in Python, where D_GLIBCXX_USE_CXX11_ABI is set differently?

In any case, making this conflict less likely to turn up by dropping std::regex makes sense to me.

@maxhgerlach maxhgerlach merged commit c7bc877 into horovod:master Jun 24, 2022
@trevor-m
Copy link
Contributor Author

Thanks @maxhgerlach!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0