A powerful model for generating reasoning outputs from both images and text inputs.
Discovered on HuggingFace via HuggingFace:unknown