Add Distributed Multigrid. #1269

pratikvn · 2023-02-06T12:57:28Z

This PR updates the multigrid class to handle distributed matrices and hence allows preconditioning and solution with distributed multigrid.

Major changes

Store row and column partition objects in the Matrix class to use within Multigrid.
Template the memory allocation and multigrid core functions on VectorType and allow dynamic switching between the two.
Store matrix_data object in the distributed matrix class to be able to generate coarse matrices.

Of course, as there is no coarse generation method, we cannot still use distributed multigrid automatically, but that will be added in a future PR.

Points of discussion

We probably need to store the partition objects in the distributed matrix class, but I am open to other alternatives.
For ease, we also probably need to store the matrix_data object (or devie_matrix_data), but I am not very happy about this.

Issues

The mixed precision version of distributed multigrid does not yet work and needs to be looked into.

greole · 2023-02-09T09:13:03Z

Would that allow to use Multigrid as a preconditioner without Schwarz?

MarcelKoch · 2023-02-10T09:36:19Z

Right now, you use the partition only to get the local size of the matrix, which you can also get from the local matrix. The stored matrix data is not used at all. I would suggest removing these until they are actually necessary.

pratikvn · 2023-02-15T20:57:42Z

@greole , Yes, but a coarse generation algorithm ( that is distributed capable) is necessary. Meaning we need to have the equivalent of AMGX which generates the triplet (R, A_c, P).

pratikvn · 2023-02-15T21:01:04Z

@MarcelKoch, yes, I dont intend to merge this yet. I just wanted to show what changes could be necessary. At present we do not need the partition or the matrix_data.

yhmtsai · 2023-02-19T19:21:06Z

@pratikvn could you rebase it? I think some changes are related to the schwarz?

yhmtsai · 2023-02-20T15:58:35Z

core/solver/multigrid.cpp

+    this->run_mg_cycle<VectorType>(cycle, level + 1, next_level_matrix, g.get(),
+                                   e.get(), next_mode);


it can not be VectorType. Different layers may have different precision

yhmtsai · 2023-02-20T15:59:13Z

core/solver/multigrid.cpp

+#endif
+
+
+template <typename VectorType>


pratikvn

some comments.

core/distributed/matrix.cpp

pratikvn · 2024-04-09T10:34:43Z

core/test/mpi/distributed/solver/multigrid.cpp

+
+    auto mg = mg_factory->generate(mtx);
+
+    ASSERT_NO_THROW(mg->apply(b, x));


Maybe also a test that shows that the result is the same as in the non-distrtibuted case would make sense ?

I think distributed coarsening method usually different from the non-distributed one?

As you are testing the solver here, I think checking if the two solutions are within some tolerance makes sense.

ah, for this situation, I can not do it because we do not have actual distributed coarsening method.
I think the current test should be enough because the distributed multigrid actually only handle the vector allocation. These tests ensure the allocation is correct and the corresponding apply works correctly

Ah, yes. Maybe make a note/TODO here for now, and it should be okay to add it later

pratikvn · 2024-04-09T10:35:42Z

include/ginkgo/core/distributed/base.hpp

+ * DistributedLocalSize is a feature class providing `get_local_size` which
+ * return the size of local operator.
+ */
+class DistributedLocalSize {


Why a separate interface class ? Can this not be in DistributedBase ?

Additionally, the question arises what to do with the non_local_size.

I was also thinking about putting it into the DistributedBase, but I wonder it is out of the scope because it only handles the communicator now.
non_local_size can not go into this base because the vector does not have the idea about non_local_size.
Or, we just put non_local_size of vector is dim<2>(0, 0).
I do not have an idea about the usage of non_local_size out of the matrix

I don't mean putting non_local_size into the DistributedLocalSize interface class, but if we have a DistributedLocalSize we should probably also implement one to handle non_local_size.

No, I only mean the functionality but don't mean put it into the DistributedLocalSize.
I agree with user may expect something similar to get_local_size() for non_local_matrix().
For me, if we need to get_non_local_size(), we need to put get_local_size() into DistributedLocalSize and get_non_local_size() into DistributedNonLocalSize because vector does not have non_local information such that we can not put them together into DistrbutedBase. One in DistrbutedBase and another in DistributedNonLocalSize are also inconsistent in my mind. (unless we put non_local_size = 0 in vector cases)

I think it should be okay to have only local_size in DistributedBase. A separate interface class is too clunky IMO. But given that this probably will be overwritten, I dont have a strong opinion, and we can discuss this later.

I move it to DistributedBase now.

greole

Some more comments from my side. Looks good to me in general. Maybe the tests can be simplified to avoid the implementation of the Dummy classes.

greole · 2024-04-10T05:46:02Z

include/ginkgo/core/distributed/base.hpp

+ * DistributedLocalSize is a feature class providing `get_local_size` which
+ * return the size of local operator.
+ */
+class DistributedLocalSize {


Additionally, the question arises what to do with the non_local_size.

core/test/mpi/distributed/solver/multigrid.cpp

core/solver/multigrid.cpp

yhmtsai · 2024-04-14T15:50:39Z

@greole yes, the coarsening may affect the non_local_matrix, but coarsening method will take care of it not the multigrid itself.
In distributed multigrid, we have prepared the distributed matrix each level already, so we only need to deal with the vector according to each distributed matrix (the usage of local size here for creating the vector).

pratikvn

I think from my side, this looks good to be merged.

Co-authored-by: Pratik Nayak <pratikvn@protonmail.com>

yhmtsai

due to github pr limitation, approve on behalf of @pratikvn

greole

LGTM! Just a final question, in order to use it we also need to merge a distributed coarsening method, like PGM?

pratikvn · 2024-04-18T12:35:27Z

@greole, yes, #1403 would need to be merged as well. @yhmtsai , please feel to merge this when you are ready. If you need me to merge it, let me know.

Co-authored-by: Gregor Olenik <gregor.olenik@web.de>

Co-authored-by: Pratik Nayak <pratikvn@protonmail.com>

pratikvn added mod:core This is related to the core module. type:multigrid This is related to multigrid 1:ST:need-feedback The PR is somewhat ready but feedback on a blocking topic is required before a proper review. type:distributed-functionality labels Feb 6, 2023

pratikvn self-assigned this Feb 6, 2023

ginkgo-bot added the type:solver This is related to the solvers label Feb 6, 2023

MarcelKoch self-requested a review February 6, 2023 13:07

upsj requested review from upsj, greole and yhmtsai February 6, 2023 13:29

pratikvn force-pushed the dist-schwarz branch from a901141 to 6f6aa8f Compare February 6, 2023 18:25

pratikvn force-pushed the dist-mg-update branch from 225b9f9 to 54b2785 Compare February 8, 2023 08:54

pratikvn force-pushed the dist-schwarz branch 4 times, most recent 8000 ly from 8274219 to fc978a0 Compare February 9, 2023 14:15

pratikvn force-pushed the dist-schwarz branch 2 times, most recently from cf7cbcf to de2adf6 Compare February 10, 2023 14:13

Base automatically changed from dist-schwarz to develop February 11, 2023 12:27

pratikvn force-pushed the dist-mg-update branch 2 times, most recently from 55ad491 to ce4d26a Compare February 20, 2023 07:57

yhmtsai reviewed Feb 21, 2023

View reviewed changes

MarcelKoch removed their request for review October 25, 2023 09:07

yhmtsai force-pushed the dist-mg-update branch 2 times, most recently from 3f09518 to a3f7ba4 Compare January 18, 2024 08:50

MarcelKoch requested a review from greole April 5, 2024 09:24

MarcelKoch added this to the Ginkgo 1.8.0 milestone Apr 5, 2024

yhmtsai assigned yhmtsai and unassigned pratikvn Apr 8, 2024

yhmtsai added the 1:ST:ready-for-review This PR is ready for review label Apr 8, 2024

yhmtsai force-pushed the dist-mg-update branch from e5c20fe to 8a00fba Compare April 8, 2024 13:18

pratikvn commented Apr 9, 2024

View reviewed changes

greole reviewed Apr 10, 2024

View reviewed changes

yhmtsai force-pushed the dist-mg-update branch from 8a00fba to aabcffa Compare April 14, 2024 16:09

pratikvn removed the 1:ST:need-feedback The PR is somewhat ready but feedback on a blocking topic is required before a proper review. label Apr 16, 2024

pratikvn commented Apr 16, 2024

View reviewed changes

yhmtsai and others added 2 commits April 18, 2024 11:54

enable distributed multigrid

fbefea5

Co-authored-by: Pratik Nayak <pratikvn@protonmail.com>

add the distributed multigrid test

9838beb

yhmtsai force-pushed the dist-mg-update branch from aabcffa to cd62186 Compare April 18, 2024 09:55

yhmtsai approved these changes Apr 18, 2024

View reviewed changes

yhmtsai force-pushed the dist-mg-update branch from cd62186 to b9d8f1b Compare April 18, 2024 09:57

greole approved these changes Apr 18, 2024

View reviewed changes

yhmtsai added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Apr 18, 2024

yhmtsai force-pushed the dist-mg-update branch 2 times, most recently from e4387bc to 95ead72 Compare April 19, 2024 09:19

yhmtsai and others added 4 commits April 19, 2024 13:29

add get_local_size to vector and corresponding test

44a8a86

Co-authored-by: Gregor Olenik <gregor.olenik@web.de>

use const comm

21f1f3f

Co-authored-by: Gregor Olenik <gregor.olenik@web.de>

put get_local_case in DistributedBase

dd459ca

Co-authored-by: Pratik Nayak <pratikvn@protonmail.com>

now using explicit create not forwarding

67f442b

yhmtsai force-pushed the dist-mg-update branch from 95ead72 to 67f442b Compare April 19, 2024 11:29

yhmtsai merged commit c60c77e into develop Apr 19, 2024
12 of 15 checks passed

yhmtsai deleted the dist-mg-update branch April 19, 2024 18:25

		this->run_mg_cycle<VectorType>(cycle, level + 1, next_level_matrix, g.get(),
		e.get(), next_mode);


		auto mg = mg_factory->generate(mtx);

		ASSERT_NO_THROW(mg->apply(b, x));

		#endif


		template <typename VectorType>

Add Distributed Multigrid. #1269

Add Distributed Multigrid. #1269

Uh oh!

Conversation

Uh oh!

Major changes

Points of discussion

Issues

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!