Open
Description
Thanks for your wonderful work! I have a question: How does the visual specialist (e.g., stablevideo) receive both textual instruction and task features?
It seems that textual instruction are a series of words, while task features are matrices or tensors. How can we combine them to input into the visual specialist?
Metadata
Metadata
Assignees
Labels
No labels