Add Modern Optimizers in Levanter #955

WhenWen · 2025-04-28T17:00:53Z

Implemented a list of modern optimizers.

ADOPT
- Implemented in adopt.py
- Reference: Nagahara et al. ADOPT: Modified Adam Can Converge with Any β₂ with the Optimal Rate. arXiv:2411.02853
Muon
- Implemented in muon.py
- Reference: Keller Jordan et.al https://github.com/KellerJordan/modded-nanogpt
SCION
- Implemented in scion.py
- Reference: Pethick et al. Training Deep Learning Models with Norm-Constrained LMOs. arXiv:2502.07529
MARS
- Implemented in mars.py
- Reference: Li, S., Zhou, Y., & Wang, P. (2024). MARS: Unleashing the Power of Variance Reduction for Training Large Models. arXiv:2411.10438
Cautious
- Implemented in cautious.py
- Reference: Liang, K., Chen, L., Liu, B., & Liu, Q. (2024). Cautious Optimizers: Improving Training with One Line of Code. arXiv:2411.16085
Kron (Variant of PSGD)
- Implemented in kron.py
- Reference: https://github.com/evanatyourservice/psgd_jax and Li et al. Preconditioned Stochastic Gradient Descent, arxiv: 1512.04202
- Credit to @evanatyourservice
RMSProp with Momentum
- Implemented in rmsprop.py
- Reference: Tieleman, T., & Hinton, G. (2012). RMSProp: Divide the gradient by a running average of its recent magnitude
SOAP
- Implemented in soap.py
- Reference: Vyas, N., Morwani, D., Zhao, R., Kwun, M., Shapira, I., Brandfonbrener, D., … & Kakade, S. (2024). SOAP: Improving and Stabilizing Shampoo using Adam. arXiv:2409.11321

dlwh · 2025-04-29T22:31:41Z

@WhenWen can you fix

FAILED tests/test_train_lm.py::test_train_lm - AttributeError: 'AdamConfig' object has no attribute 'nesterov'
FAILED tests/test_train_lm.py::test_train_lm_fp8 - AttributeError: 'AdamConfig' object has no attribute 'nesterov'

WhenWen · 2025-04-30T20:59:15Z

@WhenWen can you fix

FAILED tests/test_train_lm.py::test_train_lm - AttributeError: 'AdamConfig' object has no attribute 'nesterov' FAILED tests/test_train_lm.py::test_train_lm_fp8 - AttributeError: 'AdamConfig' object has no attribute 'nesterov'

Sorry for this, forgot to merge the config file for Nesterov AdamW. It is done now.

dlwh · 2025-04-29T22:32:24Z

src/levanter/optim/cautious.py

+import chex
+
+class ScaleByCautiousState(NamedTuple):
+  """State for the Mars algorithm."""


Suggested change

"""State for the Mars algorithm."""

"""State for the Cautious algorithm."""

dlwh · 2025-04-29T22:36:34Z

src/levanter/optim/scion.py

+@dataclass
+class ScionConfig(OptimizerConfig):
+    """
+    Scion optimizer configuration: Momentum Orthogonalized by Newton-Schulz.


is this right?

dlwh · 2025-04-29T22:39:44Z

src/levanter/optim/soap.py

+
+    def partition(self, tensor):
+        """Partition tensor into blocks."""
+        print('difference')


dlwh · 2025-04-29T22:39:51Z

src/levanter/optim/soap.py

+    def partition(self, tensor):
+        """Partition tensor into blocks."""
+        print('difference')
+        print(tensor.shape, self._shape)


dlwh · 2025-05-10T06:55:29Z

src/levanter/optim/adam_mini.py

+
+    def create_mask(self, params):
+        """
+        Creates a mask that labels parameters as 'mini' or 'adamw' based on their


doc comment isn't quite accurate

dlwh · 2025-05-11T22:38:57Z

src/levanter/optim/cautious.py

+
+
+class ScaleByCautiousState(NamedTuple):
+    """State for the Mars algorithm."""


Suggested change

"""State for the Mars algorithm."""

"""State for the Cautious Adam algorithm."""

dlwh · 2025-05-11T22:40:43Z

src/levanter/optim/kron.py

+    **kwargs,
+) -> base.GradientTransformation:
+    """
+    Implements PSGD Kron from https://github.com/lixilinx/psgd_torch.


maybe put this link in the config class too

dlwh · 2025-05-11T22:42:40Z

src/levanter/optim/kron.py

+
+    def init_fn(params, return_partition_specs_only=False):


i'm not checking this logic carefully

dlwh · 2025-05-11T22:43:33Z

src/levanter/optim/kron.py

+                    params_sharding_ = jax.tree.map(lambda x: x.spec, params_sharding_)
+                updates, updates_struct = jax.tree.flatten(updates)
+                scanned_layers_ = jax.tree.leaves(scanned_layers_)
+                print(f"kron scanned_layers_: {scanned_layers_}")


dlwh · 2025-05-11T22:43:36Z

src/levanter/optim/kron.py

+                scanned_layers_ = jax.tree.leaves(scanned_layers_)
+                print(f"kron scanned_layers_: {scanned_layers_}")
+                params_sharding_ = jax.tree.leaves(params_sharding_)
+                print(f"kron params_sharding_: {params_sharding_}")


Kaiyue Wen added 2 commits April 28, 2025 09:45

optimizers

7106f95

remove comments

6d35a74

WhenWen requested a review from dlwh April 28, 2025 17:00

add nesterov adam support

67ad4ff

WhenWen and others added 2 commits May 4, 2025 02:19

most current version of draccus no longer support Any as a type

d768ff5

make linter happy

0e4d854

dlwh approved these changes May 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Modern Optimizers in Levanter #955

Add Modern Optimizers in Levanter #955

	"""State for the Mars algorithm."""
	"""State for the Cautious algorithm."""



		class ScaleByCautiousState(NamedTuple):
		"""State for the Mars algorithm."""

	"""State for the Mars algorithm."""
	"""State for the Cautious Adam algorithm."""

Add Modern Optimizers in Levanter #955

Are you sure you want to change the base?

Add Modern Optimizers in Levanter #955

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment