Update main with development #359

lkurija1 · 2023-03-03T14:48:36Z

Add group associations to roles (Add group associations to roles #319)
Create diagnose script (Create diagnose script #325)
Sync up the generated code from openapi generator with what we have currently (Api improvements #331)
applied formatting (Format example json configuration files #341)
pre-commit setup and dev reqs (Add pre commit support #342)
Refactor config handling with pydantic (Configuration Handling Refactor #332)
Make sdk config backwards compatible. (Make SDK config backwards compatible #355)
Fix merge conflicts between development and main branch (Fix merge conflicts between development and main branch #353)
optimizer compatibility with tensorflow and example for medmnist keras/pytorch (example and documentation for medmnist keras/pytorch #320)

Tensorflow compatibility for new optimizers was added, which included fedavg, fedadam, fedadagrad, and fedyogi.

A shell script for tesing all 8 possible combinations of optimizers and frameworks is included. This allows the medmnist example to be run with keras (the folder structure was refactored to include a trainer and aggregator for keras).

The typo in fedavg.py has now been fixed.

feat+fix: grpc support for hierarchical fl (feat+fix: grpc support for hierarchical fl #321)

Hierarchical fl didn't work with grpc as backend. This is because groupby field was not considered in metaserver service and p2p backend.

In addition, a middle aggregator hangs even after a job is completed. This deadlock occurs because p2p backend cleanup code is called as a part of a channel cleanup. However, in a middle aggregator, p2p backend is responsible for tasks across all channnels. The p2p cleanup code couldn't finish cleanup because a broadcast task for in the other channel can't finish. This bug is fixed here by getting the p2p backend cleanup code out side of channel cleanup code.

documenation for metaserver/mqtt local (documenation for metaserver #322)

Documentation for using metaserver will allow users to run examples with a local broker. It also allows for mqtt local brokers.
This decreases the chances of any job ID collisions.

Modifications to the config.json for the mnist example were made in order to make it easier to switch to a local broker. The readme does indicate how to do this for other examples now.

feat: asynchronous fl (feat: asynchronous fl #323)

Asynchronous FL is implemented for two-tier topology and three-tier hierarchical topology.

The main algorithm is based on the following two papers:

Two examples for asynchronous fl are also added. One is for a two-tier topology and the other for a three-tier hierarchical topology.

This implementation includes the core algorithm but doesn't include SecAgg algorithm (presented in the papers), which is not the scope of this change.

fix+refactor: asyncfl loss divergence (fix+refactor: asyncfl loss divergence #330)

For asyncfl, a client (trainer) should send delta by subtracting local weights from original global weights after training. In the current implementation, the whole local weights were sent to a server (aggregator). This causes loss divergence.

Supporting delta update requires refactoring of aggregators of synchronous fl (horizontal/{top_aggregator.py, middle_aggregator.py}) as well as optimizers' do() function.

The changes here support delta update universally across all types of modes (horizontal synchronous, asynchronous, and hybrid).

fix: conflict bewtween integer tensor and float tensor (fix: conflict bewtween integer tensor and float tensor #335)

Model architectures can have integer tensors. Applying aggregation on those tensors results in type mistmatch and throws a runtime error: "RuntimeError: result type Float can't be cast to the desired output type Long"

Integer tensors don't matter in back propagation. So, as a workaround to the issue, we typecast to the original dtype when the original type is different from the dtype of weighted tensors for aggregation. In this way, we can keep the model architecture as is.

refactor: config for hybrid example in library (refactor: config for hybrid example in library #334)

To enable library-only execution for hybrid example, its configuration files are updated accordingly. The revised configuration has local mqtt and p2p broker config and p2p broker is selected.

misc: asynchronous hierarchical fl example (misc: asynchronous hierarchical fl example #340)

Since the Flame SDK supports asynchronous FL, we add an example of an asynchronous hierarchical FL for control plane.

chore: clean up examples folder (chore: clean up examples folder #336)

The examples folder at the top level directory has some outdated and irrelevant files. Those are now removed from the folder.

fix: workaround for hybrid mode with two p2p backends (fix: workaround for hybrid mode with two p2p backends #345)

Due to grpc/grpc#25364, when two p2p backends (which rely on grpc and asyncio) are defined, the hybrid mode example throws an execption: 'BlockingIOError: [Errno 35] Resource temporarily unavailable'. The issue still appears unresolved. As a temporary workaround, we use two different types of backends: mqtt for one and p2p for the other. This means that when this example is executed, both metaserver and a mqtt broker (e.g., mosquitto) must be running in the local machine.

fix: distributed mode (fix: distributed mode #344)

Distributed mode has a bug: before 'weights' is not defined as member variable, deepcopy(self.weights) in _update_weights() is called. To address this issue, self.weights is initialized in init().

Also, to run a distributed example locally, configuration files are revised.

example/implementation for fedprox (example/implementation for fedprox #339)

This example is similar to the ones seen in the fedprox paper, although it currently does not simmulate stragglers and uses another dataset/architecture.

A few things were changed in order for there to be a simple process for modifying trainers. This includes a function in util.py and another class variable in the trainer containing information on the client side regularizer.

Additionally, tests are automated (mu=1,0.1,0.01,0.001,0) so running the example generates or modifies existing files in order to provide the propper configuration for an experiment.

Create diagnose script (Create diagnose script #348)
Create diagnose script
Make the script executable

refactor+fix: configurable deployer / lib regularizer fix (refactor+fix: configurable deployer / lib regularizer fix #351)

deployer's job template file is hard-coded, which makes it hard to use different template file at deployment time. Using different different template file is useful when underlying infrastructure is different (e.g., k8s vs knative). To support that, template folder and file is fed as config variables.

Also, deployer's config info is fed as command argument, which is cumbersome. So, the config parsing part is refactored such that the info is fed as a configuration file.

During the testing of deployer change, a bug in the library is identified. The fix for it is added here too.

Finally, the local dns configuration in flame.sh is updated so that it can be done correctly across different linux distributions (e.g., archlinux and ubuntu). The tests for flame.sh are under archlinux and ubuntu.

Add missing merge fix
Make sdk config backwards compatible. (Make SDK config backwards compatible #355)

Description

Please provide a meaningful description of what this change will do, or is for. Bonus points for including links to
related issues, other PRs, or technical references.

Note that by not including a description, you are asking reviewers to do extra work to understand the context of this
change, which may lead to your PR taking much longer to review, or result in it not being reviewed at all.

Type of Change

Checklist

I have read the contributing guidelines
Existing issues have been referenced (where applicable)
I have verified this change is not present in other open pull requests
Functionality is documented
All code style checks pass
New code contribution is covered by automated tests
All new and existing tests pass

* Add group associations to roles (#319) * Create diagnose script (#325) Co-authored-by: Alex Ungurean <aungurea@cisco.com> * Sync up the generated code from openapi generator with what we have currently (#331) * applied formatting (#341) * pre-commit setup and dev reqs (#342) * Refactor config handling with pydantic (#332) * Make sdk config backwards compatible. (#355) * Fix merge conflicts between development and main branch (#353) * optimizer compatibility with tensorflow and example for medmnist keras/pytorch (#320) Tensorflow compatibility for new optimizers was added, which included fedavg, fedadam, fedadagrad, and fedyogi. A shell script for tesing all 8 possible combinations of optimizers and frameworks is included. This allows the medmnist example to be run with keras (the folder structure was refactored to include a trainer and aggregator for keras). The typo in fedavg.py has now been fixed. * feat+fix: grpc support for hierarchical fl (#321) Hierarchical fl didn't work with grpc as backend. This is because groupby field was not considered in metaserver service and p2p backend. In addition, a middle aggregator hangs even after a job is completed. This deadlock occurs because p2p backend cleanup code is called as a part of a channel cleanup. However, in a middle aggregator, p2p backend is responsible for tasks across all channnels. The p2p cleanup code couldn't finish cleanup because a broadcast task for in the other channel can't finish. This bug is fixed here by getting the p2p backend cleanup code out side of channel cleanup code. * documenation for metaserver/mqtt local (#322) Documentation for using metaserver will allow users to run examples with a local broker. It also allows for mqtt local brokers. This decreases the chances of any job ID collisions. Modifications to the config.json for the mnist example were made in order to make it easier to switch to a local broker. The readme does indicate how to do this for other examples now. Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org> * feat: asynchronous fl (#323) Asynchronous FL is implemented for two-tier topology and three-tier hierarchical topology. The main algorithm is based on the following two papers: - https://arxiv.org/pdf/2111.04877.pdf - https://arxiv.org/pdf/2106.06639.pdf Two examples for asynchronous fl are also added. One is for a two-tier topology and the other for a three-tier hierarchical topology. This implementation includes the core algorithm but doesn't include SecAgg algorithm (presented in the papers), which is not the scope of this change. * fix+refactor: asyncfl loss divergence (#330) For asyncfl, a client (trainer) should send delta by subtracting local weights from original global weights after training. In the current implementation, the whole local weights were sent to a server (aggregator). This causes loss divergence. Supporting delta update requires refactoring of aggregators of synchronous fl (horizontal/{top_aggregator.py, middle_aggregator.py}) as well as optimizers' do() function. The changes here support delta update universally across all types of modes (horizontal synchronous, asynchronous, and hybrid). * fix: conflict bewtween integer tensor and float tensor (#335) Model architectures can have integer tensors. Applying aggregation on those tensors results in type mistmatch and throws a runtime error: "RuntimeError: result type Float can't be cast to the desired output type Long" Integer tensors don't matter in back propagation. So, as a workaround to the issue, we typecast to the original dtype when the original type is different from the dtype of weighted tensors for aggregation. In this way, we can keep the model architecture as is. * refactor: config for hybrid example in library (#334) To enable library-only execution for hybrid example, its configuration files are updated accordingly. The revised configuration has local mqtt and p2p broker config and p2p broker is selected. * misc: asynchronous hierarchical fl example (#340) Since the Flame SDK supports asynchronous FL, we add an example of an asynchronous hierarchical FL for control plane. * chore: clean up examples folder (#336) The examples folder at the top level directory has some outdated and irrelevant files. Those are now removed from the folder. * fix: workaround for hybrid mode with two p2p backends (#345) Due to grpc/grpc#25364, when two p2p backends (which rely on grpc and asyncio) are defined, the hybrid mode example throws an execption: 'BlockingIOError: [Errno 35] Resource temporarily unavailable'. The issue still appears unresolved. As a temporary workaround, we use two different types of backends: mqtt for one and p2p for the other. This means that when this example is executed, both metaserver and a mqtt broker (e.g., mosquitto) must be running in the local machine. * fix: distributed mode (#344) Distributed mode has a bug: before 'weights' is not defined as member variable, deepcopy(self.weights) in _update_weights() is called. To address this issue, self.weights is initialized in __init__(). Also, to run a distributed example locally, configuration files are revised. * example/implementation for fedprox (#339) This example is similar to the ones seen in the fedprox paper, although it currently does not simmulate stragglers and uses another dataset/architecture. A few things were changed in order for there to be a simple process for modifying trainers. This includes a function in util.py and another class variable in the trainer containing information on the client side regularizer. Additionally, tests are automated (mu=1,0.1,0.01,0.001,0) so running the example generates or modifies existing files in order to provide the propper configuration for an experiment. * Create diag 8000 nose script (#348) * Create diagnose script * Make the script executable --------- Co-authored-by: Alex Ungurean <aungurea@cisco.com> * refactor+fix: configurable deployer / lib regularizer fix (#351) deployer's job template file is hard-coded, which makes it hard to use different template file at deployment time. Using different different template file is useful when underlying infrastructure is different (e.g., k8s vs knative). To support that, template folder and file is fed as config variables. Also, deployer's config info is fed as command argument, which is cumbersome. So, the config parsing part is refactored such that the info is fed as a configuration file. During the testing of deployer change, a bug in the library is identified. The fix for it is added here too. Finally, the local dns configuration in flame.sh is updated so that it can be done correctly across different linux distributions (e.g., archlinux and ubuntu). The tests for flame.sh are under archlinux and ubuntu. * Add missing merge fix * Make sdk config backwards compatible. (#355) --------- Co-authored-by: GustavBaumgart <98069699+GustavBaumgart@users.noreply.github.com> Co-authored-by: Myungjin Lee <myungjin@users.noreply.github.com> Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org> Co-authored-by: alexandruuBytex <56033021+alexandruuBytex@users.noreply.github.com> Co-authored-by: Alex Ungurean <aungurea@cisco.com> Co-authored-by: elqurio <119978637+elqurio@users.noreply.github.com> --------- Co-authored-by: openwithcode <123649857+openwithcode@users.noreply.github.com> Co-authored-by: alexandruuBytex <56033021+alexandruuBytex@users.noreply.github.com> Co-authored-by: Alex Ungurean <aungurea@cisco.com> Co-authored-by: GustavBaumgart <98069699+GustavBaumgart@users.noreply.github.com> Co-authored-by: Myungjin Lee <myungjin@users.noreply.github.com> Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>

codecov-commenter · 2023-03-03T14:52:25Z

Codecov Report

Merging #359 (81ed30e) into main (6532a3b) will decrease coverage by 0.66%.
The diff coverage is 29.31%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main     #359      +/-   ##
==========================================
- Coverage   15.04%   14.39%   -0.66%     
==========================================
  Files          48       48              
  Lines        2824     2771      -53     
==========================================
- Hits          425      399      -26     
+ Misses       2381     2354      -27     
  Partials       18       18

Impacted Files	Coverage Δ
cmd/deployer/cmd/root.go	`17.30% <13.04%> (-16.03%)`	⬇️
cmd/flamectl/cmd/get_design.go	`12.90% <50.00%> (ø)`
cmd/controller/app/job/builder.go	`62.13% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

openwithcode changed the base branch from main to development March 3, 2023 15:13

openwithcode changed the base branch from development to main March 3, 2023 15:14

lkurija1 changed the base branch from main to development March 3, 2023 15:15

myungjin force-pushed the development branch from 956d7ff to 57937c2 Compare March 6, 2023 07:26

openwithcode closed this Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update main with development #359

Update main with development #359

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Update main with development #359

Update main with development #359

Uh oh!

Conversation

Description

Type of Change

Checklist

Uh oh!

Codecov Report

Uh oh!

Uh oh!