Hi, I’m Hani

I build retrieval that understands images, text, video, and sound – not just the literal match. I’m a Ph.D. researcher at Virginia Tech’s Sanghani Center working on multimodal learning with Dr. Chris Thomas. My work combines vision-language models, retrieval-augmented generation, and ranking systems to surface the meaningful, often unexpected connections across modalities.

What I focus on

Multi-prompt embeddings that generate many small “views” of meaning, so search stays rich, fair, and resilient against collapse.
Two-stage retrieval pipelines that pair lightweight retrievers with cross-modal rerankers and VLM “readers” for grounded reasoning.
Diversity-aware retrieval that captures literal, figurative, emotional, abstract, and cultural cues rather than a single one-size vector.

Why it matters

We live in a semantically diverse era where creativity, safety, and accessibility depend on bridging visual, textual, and auditory understanding. Multimodal systems must retrieve more than the “closest” item–they must retrieve what matters. My research builds new objectives, evaluation pipelines, and datasets so retrieval can support idioms, metaphors, cultural nuance, and underrepresented perspectives.

Now

Training a VLM-based retriever with a cross-modal reranker, and multi-prompt-embedding based to guide the diversity.
Building a RAG workflow that treats the VLM as a “reader” and keeps factual grounding.
Exploring room-impulse responses (RIRs) for spatial cues in generative scene understanding.

Recent highlights

ACL 2025 (Main) – Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval.
NeurIPS 2024 & MAR @ NeurIPS 2024 – JourneyBench and ENTER (Event-Based Interpretable Reasoning for VideoQA).
FGVC @ CVPR 2025 – Real-Time Ultra-Fine-Grained Surgical Instrument Classification.

If you’re building diversity-aware retrieval, interpretable VLMs, or new evaluation benchmarks, let’s talk ^^

Hani Alomari

What I focus on

Why it matters

Now

Recent highlights