Description
Currently, switching between lazy and eager can be a huge overhead even when using the same device. This is mainly due to the ir graph execution and the conversion of tensor device types. However, the latter is not necessary, I think it's historical reasons (xrt), which can be seen from the interface name TransferToServer
/TransferFromServer
. Even if it is from gpu to the same gpu, it must be redirected from the cpu.
I'm implementing a PoC so that xla_tensor.to('cuda')
and cuda_tensor.to('xla')
are actually zero copy. So far it could running a eager/lazy mixed mnist.
But there should be some problems here, I used _to_copy
op but there is no copy actually, I wonder if there will be problems with the backward direction during training.
I am currently considering how to implement zero copy while ensuring correctness, and would like to know if the community has any relevant experience.