How does the visual specialist (e.g., stablevideo) receive both textual instruction and task features?

Thanks for your wonderful work! I have a question: How does the visual specialist (e.g., stablevideo) receive both textual instruction and task features?

It seems that textual instruction are a series of words, while task features are matrices or tensors. How can we combine them to input into the visual specialist?