By aligning six modalities’ embedding into a common space, ImageBind enables cross-modal retrieval of different types of content that aren’t observed together, the addition of embeddings from different modalities to naturally compose their semantics, and audio-to-image generation by using our audio embeddings with a pretrained DALLE-2 decoder to work with CLIP text embeddings. ImageBind is a multimodal model that joins a recent series of Meta's open source AI tools.