About

I am a Ph.D. student in Computer Science at Virginia Tech, advised by Dr. Chris Thomas at the Sanghani Center for Artificial Intelligence and Analytics. My research focuses on multimodal learning and vision-language models, with an emphasis on building retrieval systems that capture diverse, non-literal meaning across images, text, video, and audio.

Research

My primary line of work addresses a core limitation in cross-modal retrieval: standard contrastive objectives collapse representations into single-point embeddings that miss figurative, cultural, and abstract associations. To counter this, I develop multi-prompt embedding strategies and maximal matching objectives that preserve semantic diversity during training. This resulted in MaxMatch (ACL 2025, Main), which improves cross-modal retrieval by up to 7.1% while preventing representation collapse. Most recently, Lenses (CVPR 2026) introduces a dataset and retrieval model that captures meaning through five interpretive perspectives – literal, figurative, abstract, emotional, and background – showing that a single image can support multiple valid interpretations.

I also work on grounded reasoning for vision-language models, contributing to benchmarks and evaluation pipelines that test whether VLMs truly understand compositional and abstract visual content. This includes JourneyBench (NeurIPS 2024) and ENTER (NeurIPS 2024 MAR Workshop, Spotlight), which introduce event-based interpretable reasoning for video question answering.

Selected Publications

  • [CVPR 2026] Lenses: Toward Polysemous Vision-Language Understanding.
  • [ACL 2025] Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval.
  • [NeurIPS 2024] JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark.
  • [NeurIPS 2024 Workshop, Spotlight] ENTER: Event-Based Interpretable Reasoning for VideoQA.
  • [CVPR 2025 Workshop] Real-Time Ultra-Fine-Grained Surgical Instrument Classification.

See the full list on the publications page.