Description
Heya,
Thanks for your continued work in building better DEQs.
The main selling point of DEQs is that the solver can take as many steps as required to converge without increasing the memory. This isn't true for your implementation of broyden, which starts off with:
Us = torch.zeros(bsz, total_hsize, seq_len, max_iters).to(dev)
VTs = torch.zeros(bsz, max_iters, total_hsize, seq_len).to(dev)
and therefore has a memory cost linear with max_iters, even though the ops aren't tracked. Anderson also keeps the previous m
states in memory, where m
is usually larger than the number of solver iterations needed anyways. Don't those solvers contradict the claim of constant memory cost?
On a related note, I've found it quite hard to modify these solvers even after going over the theory. Is there any notes or resources you could point to to help people understand your implementation? Thanks!