GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations
Researchers have developed GraSP-VL, a method to better utilize frozen vision-language model (VLM) embeddings by treating their length as a semantic interface. This approach learns a shared prefix transform that allows shorter prefixes to represent coarse semantic roles and longer prefixes to reveal finer distinctions. Experiments on COCO/Flickr30K datasets show GraSP-VL effectively reorganizes VLM embeddings into a truncatable semantic prefix interface, outperforming simple compression techniques. AI
IMPACT Enables more nuanced control over vision-language model outputs by treating embedding length as a semantic interface.