Using a manifold vocoder for spectral voice and style conversion

Tuan Dinh; Alexander Kain; Kris Tjaden

doi:10.21437/Interspeech.2019-1176

Using a manifold vocoder for spectral voice and style conversion

Tuan Dinh, Alexander Kain, Kris Tjaden

Institute on Development and Disability

Research output: Contribution to journal › Conference article › peer-review

4 Scopus citations

Abstract

We propose a new type of spectral feature that is both compact and interpolable, and thus ideally suited for regression approaches that involve averaging. The feature is realized by means of a speaker-independent variational autoencoder (VAE), which learns a latent space based on the low-dimensional manifold of high-resolution speech spectra. In vocoding experiments, we showed that using a 12-dimensional VAE feature (VAE-12) resulted in significantly better perceived speech quality compared to a 12-dimensional MCEP feature. In voice conversion experiments, using VAE-12 resulted in significantly better perceived speech quality as compared to 40-dimensional MCEPs, with similar speaker accuracy. In habitual to clear style conversion experiments, we significantly improved the speech intelligibility for one of three speakers, using a custom skip-connection deep neural network, with the average keyword recall accuracy increasing from 24% to 46%.

Original language	English (US)
Pages (from-to)	1388-1392
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2019-September
DOIs	https://doi.org/10.21437/Interspeech.2019-1176
State	Published - 2019
Event	20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria Duration: Sep 15 2019 → Sep 19 2019

Keywords

Intelligibility
Speech coding
Style conversion
Variational autoencoder
Voice conversion

ASJC Scopus subject areas

Language and Linguistics
Human-Computer Interaction
Signal Processing
Software
Modeling and Simulation

Access to Document

10.21437/Interspeech.2019-1176

Cite this

@article{2c387a7701244ea0a83a1a8d22911786,

title = "Using a manifold vocoder for spectral voice and style conversion",

abstract = "We propose a new type of spectral feature that is both compact and interpolable, and thus ideally suited for regression approaches that involve averaging. The feature is realized by means of a speaker-independent variational autoencoder (VAE), which learns a latent space based on the low-dimensional manifold of high-resolution speech spectra. In vocoding experiments, we showed that using a 12-dimensional VAE feature (VAE-12) resulted in significantly better perceived speech quality compared to a 12-dimensional MCEP feature. In voice conversion experiments, using VAE-12 resulted in significantly better perceived speech quality as compared to 40-dimensional MCEPs, with similar speaker accuracy. In habitual to clear style conversion experiments, we significantly improved the speech intelligibility for one of three speakers, using a custom skip-connection deep neural network, with the average keyword recall accuracy increasing from 24% to 46%.",

keywords = "Intelligibility, Speech coding, Style conversion, Variational autoencoder, Voice conversion",

author = "Tuan Dinh and Alexander Kain and Kris Tjaden",

note = "Publisher Copyright: {\textcopyright} 2019 ISCA; 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 ; Conference date: 15-09-2019 Through 19-09-2019",

year = "2019",

doi = "10.21437/Interspeech.2019-1176",

language = "English (US)",

volume = "2019-September",

pages = "1388--1392",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Using a manifold vocoder for spectral voice and style conversion

AU - Dinh, Tuan

AU - Kain, Alexander

AU - Tjaden, Kris

PY - 2019

Y1 - 2019

N2 - We propose a new type of spectral feature that is both compact and interpolable, and thus ideally suited for regression approaches that involve averaging. The feature is realized by means of a speaker-independent variational autoencoder (VAE), which learns a latent space based on the low-dimensional manifold of high-resolution speech spectra. In vocoding experiments, we showed that using a 12-dimensional VAE feature (VAE-12) resulted in significantly better perceived speech quality compared to a 12-dimensional MCEP feature. In voice conversion experiments, using VAE-12 resulted in significantly better perceived speech quality as compared to 40-dimensional MCEPs, with similar speaker accuracy. In habitual to clear style conversion experiments, we significantly improved the speech intelligibility for one of three speakers, using a custom skip-connection deep neural network, with the average keyword recall accuracy increasing from 24% to 46%.

AB - We propose a new type of spectral feature that is both compact and interpolable, and thus ideally suited for regression approaches that involve averaging. The feature is realized by means of a speaker-independent variational autoencoder (VAE), which learns a latent space based on the low-dimensional manifold of high-resolution speech spectra. In vocoding experiments, we showed that using a 12-dimensional VAE feature (VAE-12) resulted in significantly better perceived speech quality compared to a 12-dimensional MCEP feature. In voice conversion experiments, using VAE-12 resulted in significantly better perceived speech quality as compared to 40-dimensional MCEPs, with similar speaker accuracy. In habitual to clear style conversion experiments, we significantly improved the speech intelligibility for one of three speakers, using a custom skip-connection deep neural network, with the average keyword recall accuracy increasing from 24% to 46%.

KW - Intelligibility

KW - Speech coding

KW - Style conversion

KW - Variational autoencoder

KW - Voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85074714751&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85074714751&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2019-1176

DO - 10.21437/Interspeech.2019-1176

M3 - Conference article

AN - SCOPUS:85074714751

SN - 2308-457X

VL - 2019-September

SP - 1388

EP - 1392

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019

Y2 - 15 September 2019 through 19 September 2019

ER -

Using a manifold vocoder for spectral voice and style conversion

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this