A pipeline for processing image and text input translating them into text outputs.
Discovered on HuggingFace via HuggingFace:unknown