Analyzing Robustness of Modern Multimodal Transformer Architecture

Author: 
John Wahlig
Adviser(s): 
Priya Panda
Philipp Strack
Abstract: 

Multimodal transformers have emerged as a viable solution for machine perception across various axes of sensory information tasks, such as image recognition, captioning, and translation. These transformer models can process visual, textual, and other forms of input, synthesizing them in such a way that all aspects can be analyzed cohesively. However, multimodal transformers can struggle when one or more of the input modalities is noisy or otherwise corrupted or missing. In this project, we analyze the impact of noisy modality data on the performance of multimodal transformer architecture, using OpenAI’s CLIP (Contrastive Language-Image Pretraining) as an experimental subject. We also explore strategies for mitigating the detrimental effects to performance due to noisiness – specifically, we investigate a technique called “text ensembling,” in which adding additional prompts to the model can make it more resistant to noise. Text ensembling changes the array of prompts the model has available to select from, in a sense diversifying the model’s input and enhancing its ability to separate out key information from noisy text data, as the variety of prompts enables the model to detect common threads even among prompts that are superficially quite distinct. For example, the prompts “a rendition of a sedan” and “a jpeg corrupted photo of the sedan” could inform the model that “sedan” is common among the two prompts, even though the two prompts are linguistically different. By doing this, we prevent the model from becoming overfitted to specific prompts, and therefore enhance the model’s overall robustness. By introducing more text prompt variety through text ensembling, our data shows the model becomes less affected by noise, and is more robust as a result.

Term: 
Fall 2023