ECAPA-TDNN Voice Encoder (from Qwen3-TTS 1.7B)
Standalone ECAPA-TDNN voice encoder extracted from Qwen/Qwen3-TTS-12Hz-1.7B-Base. Produces 2048-dimensional x-vector speaker embeddings from audio.
The encoder follows the ECAPA-TDNN architecture (Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification) and uses Res2Net blocks, squeeze-excitation attention, and attentive statistical pooling.
Only extracted from the
*-Basemodel variant. The*-CustomVoiceand*-VoiceDesignvariants do NOT support speaker embeddings and will not work. Do not attempt to use them for speaker encoding.
Usage
Recommended: AutoProcessor + AutoModel
import librosa
import torch
from transformers import AutoModel, AutoProcessor
processor = AutoProcessor.from_pretrained(
"marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B", trust_remote_code=True,
)
model = AutoModel.from_pretrained(
"marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B", trust_remote_code=True,
)
model.eval()
audio, sr = librosa.load("audio.wav", sr=None, mono=True)
inputs = processor(audio, sampling_rate=sr)
with torch.no_grad():
embedding = model(**inputs).last_hidden_state # (1, 2048)
Pipeline API
from transformers import pipeline
pipe = pipeline(
"feature-extraction",
model="marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B",
trust_remote_code=True,
)
# From file path
embedding = pipe("audio.wav") # list, shape (1, 2048)
# From numpy array
import librosa
audio, sr = librosa.load("audio.wav", sr=None, mono=True)
embedding = pipe(audio, sampling_rate=sr)
Saving & Loading Embeddings
Speaker embeddings can be stored and shared as SafeTensors files.
Save to SafeTensors
import torch
from safetensors.torch import save_file
# embedding: torch.Tensor of shape (2048,) or (1, 2048)
embedding = embedding.squeeze() # ensure 1D
save_file({"speaker_embedding": embedding}, "my_voice.safetensors")
Load from SafeTensors
from safetensors.torch import load_file
tensors = load_file("my_voice.safetensors")
embedding = tensors["speaker_embedding"] # (2048,)
Use the key
"speaker_embedding"by convention β this matches the field name used by Qwen3-TTS and vLLM-Omni.
Using Embeddings with Qwen3-TTS
These embeddings are designed to drive voice cloning in the Qwen3-TTS family.
There are two main inference paths: the qwen_tts Python package and the
vLLM-Omni serving API.
qwen_tts (offline)
The qwen_tts package wraps the TTS
model and exposes generate_voice_clone. To inject a pre-computed
embedding without needing the original reference audio on disk, construct a
VoiceClonePromptItem directly:
import torch
import soundfile as sf
from dataclasses import dataclass
from typing import Optional
from safetensors.torch import load_file
# The prompt item dataclass (mirrors qwen_tts.inference.qwen3_tts_model)
@dataclass
class VoiceClonePromptItem:
ref_code: Optional[torch.Tensor] # None when using x-vector only
ref_spk_embedding: torch.Tensor # (2048,)
x_vector_only_mode: bool
icl_mode: bool
ref_text: Optional[str] = None
# 1. Load a saved embedding
embedding = load_file("my_voice.safetensors")["speaker_embedding"] # (2048,)
# 2. Build the prompt item β no reference audio needed
prompt = VoiceClonePromptItem(
ref_code=None,
ref_spk_embedding=embedding,
x_vector_only_mode=True,
icl_mode=False,
)
# 3. Load the TTS model
from qwen_tts import Qwen3TTSModel
tts = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0",
)
# 4. Generate speech β reusable across any text
wavs, sr = tts.generate_voice_clone(
text="Hello from a stored embedding!",
language="English",
voice_clone_prompt=prompt,
)
sf.write("output.wav", wavs[0], sr)
x_vector_only_mode=Trueskips the text encoder and uses only the speaker embedding. Quality may be slightly reduced compared to full voice cloning with a reference transcript, but it lets you synthesize from stored embeddings without any audio files.
vLLM-Omni (online serving)
When serving Qwen3-TTS with vLLM-Omni, you can pass a pre-computed embedding
directly via the speaker_embedding field in the API request.
The
speaker_embeddingfield requires thehtbranch of our vLLM-Omni fork. There is an upstream PR pending β use the fork until it is merged.
import httpx
from safetensors.torch import load_file
embedding = load_file("my_voice.safetensors")["speaker_embedding"]
response = httpx.post(
"http://localhost:8000/v1/audio/speech",
json={
"model": "qwen3-tts-1.7b-base",
"input": "Hello from a stored voice embedding.",
"task_type": "Base",
"speaker_embedding": embedding.tolist(), # flat list of 2048 floats
"response_format": "wav",
"language": "Auto",
},
headers={"Authorization": "Bearer EMPTY"},
)
with open("output.wav", "wb") as f:
f.write(response.content)
Embedding Arithmetic
Interpolate between voices using SLERP or weighted averaging:
import numpy as np
def slerp(v0, v1, t):
"""Spherical linear interpolation between two embeddings."""
v0_n = v0 / (np.linalg.norm(v0) + 1e-8)
v1_n = v1 / (np.linalg.norm(v1) + 1e-8)
omega = np.arccos(np.clip(np.dot(v0_n, v1_n), -1, 1))
if omega < 1e-6:
return (1 - t) * v0 + t * v1
return (np.sin((1 - t) * omega) / np.sin(omega)) * v0 + \
(np.sin(t * omega) / np.sin(omega)) * v1
blended = slerp(embedding_a.numpy(), embedding_b.numpy(), t=0.5)
Model Details
| Property | Value |
|---|---|
| Architecture | ECAPA-TDNN |
| Embedding dimension | 2048 |
| Input | 128-bin log-mel spectrogram |
| Sample rate | 24000 Hz |
| Parameters | ~12.2M |
| Source model | Qwen/Qwen3-TTS-12Hz-1.7B-Base |
| License | Apache 2.0 |
Architecture
Input mel (batch, time, 128)
β TDNN (128 β 512, k=5, d=1)
β SE-Res2Net (512 β 512, k=3, d=2)
β SE-Res2Net (512 β 512, k=3, d=3)
β SE-Res2Net (512 β 512, k=3, d=4)
β Multi-layer Feature Aggregation (1536 β 1536, k=1, d=1)
β Attentive Statistics Pooling
β Linear (3072 β 2048)
β Output embedding (batch, 2048)
Audio Preprocessing
The model expects log-mel spectrograms with these parameters:
| Parameter | Value |
|---|---|
| Sample rate | 24000 Hz |
| FFT size | 1024 |
| Hop length | 256 |
| Window length | 1024 |
| Mel bins | 128 |
| Frequency range | 0β12000 Hz |
| Mel scale | Slaney |
| Compression | log(clamp(x, min=1e-5)) |
Dependencies
torchtransformerslibrosa(for audio loading and mel filterbank computation)numpy
Related Models
- marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B β 1024-dim embeddings from the 0.6B model
The 0.6B and 1.7B encoders produce embeddings of different dimensions (1024 vs 2048). They are not interchangeable β do not mix embeddings from different model sizes.
Alternative: Manual Mel Preprocessing
If you need full control over the mel spectrogram computation (e.g. for integration into a custom pipeline), you can bypass the feature extractor:
import torch
import librosa
from librosa.filters import mel as librosa_mel_fn
from transformers import AutoModel
model = AutoModel.from_pretrained(
"marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B", trust_remote_code=True,
)
model.eval()
audio, sr = librosa.load("audio.wav", sr=None, mono=True)
if sr != 24000:
audio = librosa.resample(audio, orig_sr=sr, target_sr=24000)
y = torch.from_numpy(audio).unsqueeze(0).float()
mel_basis = torch.from_numpy(
librosa_mel_fn(sr=24000, n_fft=1024, n_mels=128, fmin=0, fmax=12000)
).float()
padding = (1024 - 256) // 2
y = torch.nn.functional.pad(y.unsqueeze(1), (padding, padding), mode="reflect").squeeze(1)
spec = torch.stft(
y, 1024, hop_length=256, win_length=1024,
window=torch.hann_window(1024), center=False, return_complex=True,
)
mel = torch.log(torch.clamp(torch.matmul(mel_basis, torch.abs(spec)), min=1e-5))
mel = mel.transpose(1, 2) # (1, time, 128)
with torch.no_grad():
embedding = model(input_values=mel).last_hidden_state # (1, 2048)
Citation
@article{Qwen3-TTS,
title={Qwen3-TTS Technical Report},
author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
journal={arXiv preprint arXiv:2601.15621},
year={2026}
}
@article{ecapa-tdnn,
title={ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification},
author={Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
journal={Proc. Interspeech},
year={2020}
}
- Downloads last month
- 337