Struggling with >1 billion term expression trees. How to make CSE tractable with optimizations? #28095

csp256 · 2025-05-27T19:05:43Z

csp256
May 27, 2025

I have some modestly complicated robotics-style code: a couple hundred lines of code operating on scalars and 3-vectors. I want to take the Jacobian of this (3x20; 3 outputs, 20 inputs), run cse(optimizations='basic') on it, and ultimately emit the code into both Python and C++. This works on smaller problems, but cse() is intractably slow for my entire task.

When I walk the expression tree of my function (not its Jacobian), I get 5 million operations, not including leaf nodes like Symbol and Integer. Counting the actual operations I see 1m (1 million) adds, 2.3m muls, 0.2m pows, and 1.7m sin/cos. For example:

	# <class 'sympy.core.add.Add'>                            :: 951569 subtotal
	#         9 args      2 times,        16 adds
	#         3 args 156911 times,    313822 adds
	#         2 args 576316 times,    576316 adds
	#        13 args   2572 times,     30864 adds
	#        12 args   2572 times,     28292 adds
	#         4 args    304 times,       912 adds
	#         6 args    256 times,      1280 adds
	#         7 args      3 times,        18 adds
	#         8 args      7 times,        49 adds

Running cse(f, optimizations='basic', order='none') on this takes 3.5 minutes and results in just 405 operations total, with 131 adds, 231 muls, 31 pows, and 12 sin/cos. It is more efficient that my carefully hand-written code, and performance is a concern. Love it!

	# <class 'sympy.core.add.Add'>                            :: 131 subtotal
	#         2 args     32 times,        32 adds
	#         3 args     13 times,        26 adds
	#        11 args      1 times,        10 adds
	#        10 args      2 times,        18 adds
	#         4 args      2 times,         6 adds
	#         6 args      4 times,        20 adds
	#        13 args      1 times,        12 adds
	#         8 args      1 times,         7 adds

Unfortunately, computing the Jacobian of my function takes 25 minutes, and results in a tree that is 1 billion (10^9) terms long. 187 million adds, 450 million muls, 44 million pows, and 75 million sin/cos. I've left cse(jac_f, optimizations='basic', order='none') running for 18 hours and it still has not completed.

It might yet still return in a "reasonable" length of time, but given the exponential scaling of the number of terms in the expression tree with the depth of the computation I am certainly close to the point where this becomes infeasible.

So far I have tried explicitly breaking the dependency. I used the multivariate chain rule on the Jacobian of f of g of x, and carefully prevented f from "seeing" g, so I could instead optimize multiple smaller expression trees.

https://wikimedia.org/api/rest_v1/media/math/render/svg/8e1b5731ed718474bd9d8fa61241cad7e0c7337a

To evaluate the Jacobian of f(g(x)) w.r.t. x I look at the output dimension of g(x) and use that to create a new ImmutableMatrix of new symbols g_out with no symbolic dependency on x that I plug into the Jacobian of f so I have more, smaller (shallower) expression trees, and then do cse optimization on each individually. To actually evaluate Jac_f_g I simply do a tiny bit of semi-manual surgery to take the output of g(x) and plug it into Jac_f instead of g_out.

I've only just now experimentally implemented this idea with lambdify(cse=True) just to get an idea of the scaling properties of cse:

# Finds the Jacobian of f(g(x)) w.r.t. x
def simple_chain_rule(f, g, x):
    g_ = g(x)
    common_dim = shape(g_)[0]
    g_out = symbols(f"g_out:{common_dim}")
    
    Jac_f = f(g_out).jacobian(g_out)
    Jac_g = g_.jacobian(x)

    g_lamb = lambdify(x, g_, cse=True)
    Jac_f_lamb = lambdify(ImmutableMatrix(g_out), Jac_f, cse=True)
    Jac_g_lamb = lambdify(x, Jac_g, cse=True)

    def out_func(inp):
        g0 = g_lamb( inp )
        a = Jac_f_lamb( g0 )
        a = a.reshape(a.shape[0], a.shape[1]) # Is there a good way of avoiding this?
        b = Jac_g_lamb(inp)
        return a.dot( b )
    
    return lambda inp: out_func(inp)

I haven't really tested this yet, but it seems to work on a toy multivariate problem. Of course this approach will not optimize away common sub expressions shared between 2 or more of Jac_f, g, and Jac_g.

Is this my best path forward? Is there something I can do to make this more efficient, easier, etc? Or should I simply emit source code implementing the cse-optimized form of my function (not its Jacobian), then rely on (say) C++ templates to facilitate forward mode automatic differentiation to compute the Jacobian?

Is there a function already in Sympy which can do something analogous to my simple_chain_rule(f, g, x) function? Or is there something I am missing outright?

csp256 · 2025-05-27T20:48:19Z

csp256
May 27, 2025
Author

My question stands, but I would like to note it took 19 hours, 8 minutes to run, and the output has gone from >1 billion operations to just 3,679 FLOPs: 1,090 adds, 3,768 muls, 390 pows, and 12 sin/cos.

	# Weighted FLOPs: 5416.0
	# Equal-weight FLOPs: 3679

	# sin                                                     :: weight 14.0 *    6 = 84.0 subtotal
	#         1 args      6 times,         6 eq_cost
	# cos                                                     :: weight 14.0 *    6 = 84.0 subtotal
	#         1 args      6 times,         6 eq_cost
	# <class 'sympy.core.add.Add'>                            :: weight 1.0 * 1090 = 1090.0 subtotal
	#         2 args    122 times,       122 eq_cost
	#         3 args    124 times,       248 eq_cost
	#        11 args      2 times,        20 eq_cost
	#        10 args      3 times,        27 eq_cost
	#         6 args     28 times,       140 eq_cost
	#         4 args     63 times,       189 eq_cost
	#        13 args      6 times,        72 eq_cost
	#         8 args      1 times,         7 eq_cost
	#         9 args     21 times,       168 eq_cost
	#         7 args      7 times,        42 eq_cost
	#         5 args     11 times,        44 eq_cost
	#        12 args      1 times,        11 eq_cost
	# <class 'sympy.core.mul.Mul'>                            :: weight 1.5 * 2512 = 3768.0 subtotal
	#         2 args   1164 times,      1164 eq_cost
	#         4 args     56 times,       168 eq_cost
	#         3 args    393 times,       786 eq_cost
	#         5 args     29 times,       116 eq_cost
	#         6 args     52 times,       260 eq_cost
	#         7 args      3 times,        18 eq_cost
	# <class 'sympy.core.power.Pow'>                          :: weight 6.0 *   65 = 390.0 subtotal
	#         2 args     65 times,        65 eq_cost

This is just barely usable for my case. However, adding a couple extra 'nice to have' things to my function takes the Jacobian from 1 billion terms to half a trillion, where cse() might take a year to run.

Also, is there any standard way of gathering statistics about large expression trees? If not, should I clean up my function emitting these summaries and make a pull request?

0 replies

moorepants · 2025-05-28T04:44:25Z

moorepants
May 28, 2025
Collaborator

We have a function here: https://github.com/sympy/sympy/blob/master/sympy/simplify/_cse_diff.py that you can try. Note that it is private (i.e. subject to change without deprecation).

0 replies

csp256 · 2025-05-28T17:14:37Z

csp256
May 28, 2025
Author

Thank you!!

It seems that still somehow keeps a lot of common sub expression between Jacobian terms? (I copied the _forward_jacobian source locally and forced cse() to use optimizations='basic', then included everything else from that file.)

However, a second pass of cse given a list containing all the replacements and Jacobian terms appears to almost just work? And it takes under a minute to run, instead of a full day!

This functionality is very useful to me. If I can get this working satisfactorily, I would like to help make this a part of the public interface of SymPy. I've never contributed to open source before, so I'll review https://docs.sympy.org/latest/contributing/index.html

3 replies

moorepants May 29, 2025
Collaborator

If you want to help make it public, then I suggest opening an issue to propose some design ideas for doing so. The current issue is that making it behave like a call to .jacobian() does not necessarily speed anything up because it has to back replace all the common subexpressions. Our current plan is to use it internally in SymPy in various places and after we get a better understanding of the performance and usage needs, then we could formulate what it should look like as a public function.

csp256 May 29, 2025
Author

I am encountering a couple other pain points, so I think I'll make a small general purpose library for this application (as I have to solve this problem anyways), then use that as the basis for a conversation when I do ultimately open an issue to solicit feedback on what sympy can/should provide.

There are a few sympy features that are either missing or, more likely, I am simply unaware of. I reckon I'll open another Q&A discussion for those.

Thanks for your help, I'll let you know how this goes once I unbreak everything. 😅

csp256 May 30, 2025
Author

The current issue is that making it behave like a call to .jacobian() does not necessarily speed anything up because it has to back replace all the common subexpressions.

It's funny you should say this. It took me a while to understand what you even meant, because I think I do not want to do that at all. I have been using _forward_jacobian_cse() directly, and keeping the factored terms around.

Regardless, _forward_jacobian() itself still also results in a significant speedup for me. And the backsubstituted Jacobian expression tree operation count is somehow still smaller than what I get with a naive call to .jacobian() (by a factor of about 5).

My use case is to essentially immediately generate code for the Jacobian, so no further symbolic manipulation is desired except to pull out common sub expressions between Jacobian terms... and back substituting first seems to only slow that down. (It slows it down so much I haven't yet tested to see if it would generate even more efficient code.)

Uh oh!

Struggling with >1 billion term expression trees. How to make CSE tractable with optimizations? #28095

Uh oh!

Uh oh!

csp256 May 27, 2025

Replies: 3 comments · 3 replies

Uh oh!

Uh oh!

csp256 May 27, 2025 Author

Uh oh!

moorepants May 28, 2025 Collaborator

Uh oh!

csp256 May 28, 2025 Author

Uh oh!

moorepants May 29, 2025 Collaborator

Uh oh!

Uh oh!

csp256 May 29, 2025 Author

Uh oh!

csp256 May 30, 2025 Author

csp256
May 27, 2025

Replies: 3 comments 3 replies

csp256
May 27, 2025
Author

moorepants
May 28, 2025
Collaborator

csp256
May 28, 2025
Author

moorepants May 29, 2025
Collaborator

csp256 May 29, 2025
Author

csp256 May 30, 2025
Author