A robust pipeline for performing reasoning tasks based on image and text data.
Discovered on HuggingFace via HuggingFace:unknown