ECAPA-TDNN Voice Encoder (from Qwen3-TTS 1.7B)

Standalone ECAPA-TDNN voice encoder extracted from Qwen/Qwen3-TTS-12Hz-1.7B-Base. Produces 2048-dimensional x-vector speaker embeddings from audio.

The encoder follows the ECAPA-TDNN architecture (Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification) and uses Res2Net blocks, squeeze-excitation attention, and attentive statistical pooling.

Only extracted from the *-Base model variant. The *-CustomVoice and *-VoiceDesign variants do NOT support speaker embeddings and will not work. Do not attempt to use them for speaker encoding.

Usage

Recommended: AutoProcessor + AutoModel

import librosa
import torch
from transformers import AutoModel, AutoProcessor

processor = AutoProcessor.from_pretrained(
    "marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B", trust_remote_code=True,
)
model = AutoModel.from_pretrained(
    "marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B", trust_remote_code=True,
)
model.eval()

audio, sr = librosa.load("audio.wav", sr=None, mono=True)
inputs = processor(audio, sampling_rate=sr)

with torch.no_grad():
    embedding = model(**inputs).last_hidden_state  # (1, 2048)

Pipeline API

from transformers import pipeline

pipe = pipeline(
    "feature-extraction",
    model="marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B",
    trust_remote_code=True,
)

# From file path
embedding = pipe("audio.wav")  # list, shape (1, 2048)

# From numpy array
import librosa
audio, sr = librosa.load("audio.wav", sr=None, mono=True)
embedding = pipe(audio, sampling_rate=sr)

Saving & Loading Embeddings

Speaker embeddings can be stored and shared as SafeTensors files.

Save to SafeTensors

import torch
from safetensors.torch import save_file

# embedding: torch.Tensor of shape (2048,) or (1, 2048)
embedding = embedding.squeeze()  # ensure 1D
save_file({"speaker_embedding": embedding}, "my_voice.safetensors")

Load from SafeTensors

from safetensors.torch import load_file

tensors = load_file("my_voice.safetensors")
embedding = tensors["speaker_embedding"]  # (2048,)

Use the key "speaker_embedding" by convention — this matches the field name used by Qwen3-TTS and vLLM-Omni.

Using Embeddings with Qwen3-TTS

These embeddings are designed to drive voice cloning in the Qwen3-TTS family. There are two main inference paths: the qwen_tts Python package and the vLLM-Omni serving API.

qwen_tts (offline)

The qwen_tts package wraps the TTS model and exposes generate_voice_clone. To inject a pre-computed embedding without needing the original reference audio on disk, construct a VoiceClonePromptItem directly:

import torch
import soundfile as sf
from dataclasses import dataclass
from typing import Optional
from safetensors.torch import load_file

# The prompt item dataclass (mirrors qwen_tts.inference.qwen3_tts_model)
@dataclass
class VoiceClonePromptItem:
    ref_code: Optional[torch.Tensor]       # None when using x-vector only
    ref_spk_embedding: torch.Tensor        # (2048,)
    x_vector_only_mode: bool
    icl_mode: bool
    ref_text: Optional[str] = None

# 1. Load a saved embedding
embedding = load_file("my_voice.safetensors")["speaker_embedding"]  # (2048,)

# 2. Build the prompt item — no reference audio needed
prompt = VoiceClonePromptItem(
    ref_code=None,
    ref_spk_embedding=embedding,
    x_vector_only_mode=True,
    icl_mode=False,
)

# 3. Load the TTS model
from qwen_tts import Qwen3TTSModel

tts = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0",
)

# 4. Generate speech — reusable across any text
wavs, sr = tts.generate_voice_clone(
    text="Hello from a stored embedding!",
    language="English",
    voice_clone_prompt=prompt,
)
sf.write("output.wav", wavs[0], sr)

x_vector_only_mode=True skips the text encoder and uses only the speaker embedding. Quality may be slightly reduced compared to full voice cloning with a reference transcript, but it lets you synthesize from stored embeddings without any audio files.

vLLM-Omni (online serving)

When serving Qwen3-TTS with vLLM-Omni, you can pass a pre-computed embedding directly via the speaker_embedding field in the API request.

The speaker_embedding field requires the ht branch of our vLLM-Omni fork. There is an upstream PR pending — use the fork until it is merged.

import httpx
from safetensors.torch import load_file

embedding = load_file("my_voice.safetensors")["speaker_embedding"]

response = httpx.post(
    "http://localhost:8000/v1/audio/speech",
    json={
        "model": "qwen3-tts-1.7b-base",
        "input": "Hello from a stored voice embedding.",
        "task_type": "Base",
        "speaker_embedding": embedding.tolist(),  # flat list of 2048 floats
        "response_format": "wav",
        "language": "Auto",
    },
    headers={"Authorization": "Bearer EMPTY"},
)

with open("output.wav", "wb") as f:
    f.write(response.content)

Embedding Arithmetic

Interpolate between voices using SLERP or weighted averaging:

import numpy as np

def slerp(v0, v1, t):
    """Spherical linear interpolation between two embeddings."""
    v0_n = v0 / (np.linalg.norm(v0) + 1e-8)
    v1_n = v1 / (np.linalg.norm(v1) + 1e-8)
    omega = np.arccos(np.clip(np.dot(v0_n, v1_n), -1, 1))
    if omega < 1e-6:
        return (1 - t) * v0 + t * v1
    return (np.sin((1 - t) * omega) / np.sin(omega)) * v0 + \
           (np.sin(t * omega) / np.sin(omega)) * v1

blended = slerp(embedding_a.numpy(), embedding_b.numpy(), t=0.5)

Model Details

Property	Value
Architecture	ECAPA-TDNN
Embedding dimension	2048
Input	128-bin log-mel spectrogram
Sample rate	24000 Hz
Parameters	~12.2M
Source model	Qwen/Qwen3-TTS-12Hz-1.7B-Base
License	Apache 2.0

Architecture

Input mel (batch, time, 128)
  → TDNN (128 → 512, k=5, d=1)
  → SE-Res2Net (512 → 512, k=3, d=2)
  → SE-Res2Net (512 → 512, k=3, d=3)
  → SE-Res2Net (512 → 512, k=3, d=4)
  → Multi-layer Feature Aggregation (1536 → 1536, k=1, d=1)
  → Attentive Statistics Pooling
  → Linear (3072 → 2048)
  → Output embedding (batch, 2048)

Audio Preprocessing

The model expects log-mel spectrograms with these parameters:

Parameter	Value
Sample rate	24000 Hz
FFT size	1024
Hop length	256
Window length	1024
Mel bins	128
Frequency range	0–12000 Hz
Mel scale	Slaney
Compression	log(clamp(x, min=1e-5))

Dependencies

torch
transformers
librosa (for audio loading and mel filterbank computation)
numpy

Related Models

marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B — 1024-dim embeddings from the 0.6B model

The 0.6B and 1.7B encoders produce embeddings of different dimensions (1024 vs 2048). They are not interchangeable — do not mix embeddings from different model sizes.

Alternative: Manual Mel Preprocessing

If you need full control over the mel spectrogram computation (e.g. for integration into a custom pipeline), you can bypass the feature extractor:

import torch
import librosa
from librosa.filters import mel as librosa_mel_fn
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B", trust_remote_code=True,
)
model.eval()

audio, sr = librosa.load("audio.wav", sr=None, mono=True)
if sr != 24000:
    audio = librosa.resample(audio, orig_sr=sr, target_sr=24000)

y = torch.from_numpy(audio).unsqueeze(0).float()
mel_basis = torch.from_numpy(
    librosa_mel_fn(sr=24000, n_fft=1024, n_mels=128, fmin=0, fmax=12000)
).float()
padding = (1024 - 256) // 2
y = torch.nn.functional.pad(y.unsqueeze(1), (padding, padding), mode="reflect").squeeze(1)
spec = torch.stft(
    y, 1024, hop_length=256, win_length=1024,
    window=torch.hann_window(1024), center=False, return_complex=True,
)
mel = torch.log(torch.clamp(torch.matmul(mel_basis, torch.abs(spec)), min=1e-5))
mel = mel.transpose(1, 2)  # (1, time, 128)

with torch.no_grad():
    embedding = model(input_values=mel).last_hidden_state  # (1, 2048)

Citation

@article{Qwen3-TTS,
  title={Qwen3-TTS Technical Report},
  author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
  journal={arXiv preprint arXiv:2601.15621},
  year={2026}
}

@article{ecapa-tdnn,
  title={ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification},
  author={Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
  journal={Proc. Interspeech},
  year={2020}
}