MAEB: Evaluating Audio Embeddings at Scale

Community Article Published February 24, 2026

Upvote

Kenneth C. Enevoldsen

KennethEnevoldsen

Isaac Chung

isaacchung

Key Takeaways:

MAEB brings audio into the MTEB ecosystem with 98 tasks, 100+ languages, and baselines for 50+ models: the first unified evaluation framework for audio embeddings.
No single model dominates: large audio-language models show the most promise overall, but every model family has clear blind spots.
Available now in MTEB v2.8. Run 100+ audio tasks with the same interface you already know.

Audio embedding models are everywhere: powering voice assistants, music recommendation, speaker verification, and multilingual transcription. Yet until now, there's been no good way to answer a deceptively simple question: which model is actually best?

Researchers have had to cobble together results from isolated benchmarks, each with its own protocol, its own datasets, its own metrics. A model that looks great on environmental sound classification might be useless for multilingual speech retrieval. A strong speech encoder might fail entirely on music. Without a unified testbed, these gaps stay hidden.

We built MAEB (Massive Audio Embedding Benchmark) to change that. It joins the MTEB ecosystem alongside text and image embeddings, bringing the same principled, community-maintained evaluation approach to audio.

What MAEB Covers

MAEB spans 98 tasks across 7 categories and 11 acoustic domains: from the familiar to the overlooked:

Classification tests whether embeddings carry enough information for downstream prediction. Using 8-shot linear probing (just 8 examples per class), we evaluate emotion recognition, genre detection, intent classification, speaker counting, and more. The deliberately small training set keeps results honest: it's a test of the embedding, not the classifier.

Zero-Shot Classification is a task where models match audio directly to text descriptions like "this is a sound of ocean waves," no task-specific training needed. This is where you find out if audio-text alignment actually generalizes.

Clustering is perhaps the most unforgiving task type. With no labels at all, embeddings must naturally organize similar audio together. Models that look strong on supervised tasks often fall apart here, revealing that their representations aren't as structured as they appear.

Retrieval covers both finding audio given audio and finding audio given text, and vice versa. MAEB's retrieval suite spans 156 languages through the FLEURS dataset, making multilingual cross-modal retrieval a first-class evaluation target.

Pair Classification asks models to compare two audio clips and predict a relationship: same accent, same sound category, same emotional tone. It's a direct probe of the embedding space geometry.

Reranking goes a step further than retrieval by evaluating discrimination over hard negative candidates: clips that are plausible but wrong. Good retrieval doesn't guarantee good reranking.

Multi-Label Classification handles the messy reality that audio rarely fits a single category. A clip from AudioSet might be "music," "outdoor," and "crowd" all at once. MAEB evaluates models on complex tagging tasks including bioacoustic species detection.

No Single Model Wins

After evaluating 53 models, the headline finding is clear: no model is universally strong.

Large audio-language models like LCO-Embedding-Omni-7B lead the overall leaderboard (52.2% average), with particularly strong cross-modal retrieval (50.3%) and zero-shot classification (64.5%). Qwen2-Audio-7B ranks second overall and first on audio-only tasks, excelling at reranking (80.8%). These models benefit from rich language model backbones and multimodal pretraining, but they're large, and even they have glaring weaknesses.

Contrastive models like CLAP variants are excellent for environmental sound and music tasks but essentially fail at multilingual speech. On the MInDS-14 intent classification benchmark across 14 languages, CLAP scores hover around random chance for most languages. Whisper and MMS show the opposite pattern: strong on speech but weak outside the speech domain.

The most striking gap is clustering. Even the best model for this category (clap-htsat-fused) reaches only 22.7%. Top overall models do even worse: LCO-Embedding-Omni-7B scores just 1.7% on clustering despite leading everywhere else. This is a systemic failure: current audio embeddings are trained for supervised discrimination, not semantic organization.

There's also a fundamental acoustic vs. linguistic trade-off that MAEB makes visible. On VoxPopuli, CLAP-htsat-unfused achieves 94.4% on gender identification but only 30.0% on language identification. Whisper-medium flips this entirely (59.2% gender vs. 99.4% language). These aren't the same skill, and current architectures can't do both well simultaneously.

Multilingual Audio Is a Largely Unsolved Problem

MAEB evaluates across 165 languages. What the results show is sobering.

SIB-FLEURS topic classification results reveal a large resource gap. High-resource languages frequently achieve 50–80% accuracy, while many low-resource languages such as Umbundu, Yoruba, and Xhosa remain below ~30% even for the strongest models.

This disparity becomes catastrophic for cross-modal tasks. While audio-to-audio retrieval maintains reasonable performance across languages (50–99% on JamAlt), cross-modal audio-text retrieval collapses in multilingual settings. On FLEURS spanning 102 languages, most CLAP models score below 3% for the vast majority of language pairs, often below 1%, a direct consequence of training predominantly on English data.

The LCO-Embedding models are the notable exception, maintaining strong cross-modal retrieval across the majority of the 102 FLEURS languages, a result that shows what's possible when speech-text alignment is done at scale and across languages.

Better Encoders → Better Audio LLMs

One of the more practically useful findings is that encoder quality on MAEB predicts downstream Audio LLM performance. We took the encoders underlying four Audio LLMs, Qwen2-Audio, SALMONN, LTU, and Pengi, evaluated each encoder on a 26-task MAEB+ subset aligned with MMAU's Speech, Music, and Sound domains, and compared those scores against each LLM's published MMAU results. A preliminary positive correlation emerged (R²=0.86). The sample is small (n=4) and the result is preliminary, but the signal is meaningful: encoder embedding quality, as measured by MAEB, is predictive of how well the full system performs on multimodal audio reasoning tasks.

Designed for Real-World Use

MAEB+ contains 98 tasks; while comprehensive, it can be expensive to run. The MAEB benchmark (30 tasks) was constructed through principled filtering: removing redundant tasks (Spearman ρ > 0.8 correlation with a retained task), keeping unique domain coverage regardless of cost, and prioritizing linguistic breadth. The result is a 2.2–3.3× speedup with only minimal loss in ranking accuracy (Pearson r=0.981 with the full MAEB+ collection).

A small audio model can complete the full MAEB in 2 GPU hours on an A100. For researchers without access to large compute clusters, MAEB(audio-only) - 19 tasks, audio encoders only - makes evaluation even more accessible.

Getting Started

MAEB is available in MTEB v2.8 with the same interface you already know:

import mteb

# All 108 audio tasks available in MTEB
audio_tasks = mteb.get_tasks(modalities=["audio"])
len(audio_tasks) # 108

# All registered audio models
models = mteb.get_model_metas()
audio_models = [m for m in models if "audio" in m.modalities]
len(audio_models) # 56

# Evaluate -- same as always
task = audio_tasks[0] # e.g. CREMA_D
model = audio_models[0] # e.g. google/vggish
mteb.evaluate(model, task)

Or run the full MAEB benchmark directly:

pip install mteb[audio]
mteb run -b 'MAEB(audio)' -m openai/whisper-medium

The benchmark supports CLAP, Whisper, CNN-based models, wav2vec variants, LLM-based models, and any custom implementation that follows MTEB's model interface.

What Comes Next

MAEB is a starting point, not an endpoint. The benchmark already points toward where the field needs to go: unified training objectives that don't sacrifice one acoustic domain for another, multilingual contrastive pretraining at scale, and architectures that can simultaneously capture acoustic properties and linguistic content.

The leaderboard, code, and all 98 tasks in MAEB+ are publicly available. We're building MAEB to evolve with the field, and we'd love contributions from the community to help it do exactly that.

Authors & Contributors:

Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha, Rahul Chand, Silky Singh, Kaitlyn Wang, Ali Sartaz Khan, Marc Moussa Nasser, Sufen Fong, Pengfei He, Alan Xiao, Ayush Sunil Munot, Aditya Shrivastava, Artem Gazizov, Niklas Muennighoff, and Kenneth Enevoldsen.

📄 Read the paper · 🏆 View the leaderboard

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote