Artificial Generation of Realistic Voices

Dhruva Mahajan(1*), Ashish Gapat(2), Lalita Moharkar(3), Prathamesh Sawant(4), Kapil Dongardive(5),

(1) Mumbai University
(2) Mumbai University
(3) Mumbai University
(4) Mumbai University
(5) Mumbai University
(*) Corresponding Author

Abstract


In this paper, we propose an end-to-end text-to-speech system deployment wherein a user feeds input text data which gets synthesized, variated, and altered into artificial voice at the output end. To create a text-to-speech model, that is, a model capable of generating speech with the help of trained datasets. It follows a process which organizes the entire function to present the output sequence in three parts. These three parts are Speaker Encoder, Synthesizer, and Vocoder. Subsequently, using datasets, the model accomplishes generation of voice with prior training and maintains the naturalness of speech throughout. For naturalness of speech we implement a zero-shot adaption technique. The primary capability of the model is to provide the ability of regeneration of voice, which has a variety of applications in the advancement of the domain of speech synthesis. With the help of speaker encoder, our model synthesizes user generated voice if the user wants the output trained on his/her voice which is feeded through the mic, present in GUI. Regeneration capabilities lie within the domain Voice Regeneration which generates similar voice waveforms for any text.


Full Text:

PDF

References


L. Wan, Q. Wang, A. Papir and I. L. Moreno, “Generalized end-to-end loss for speaker verification.” Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference, 2018.

Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno and Y. Wu, “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis.” Advances in neural information processing systems, 31, 4485–4495, 2018.

Artificial Intelligence at Google – Our Principles. https://ai.google/principles/, 2018.

A.V.D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior and K. Kavukcuoglu. “Wavenet: A generative model for raw audio.” arXiv preprint 1609.03499, (2016).

J. Shen, R. Pang, Ron J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis and Y. Wu. “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions.” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.

M. Grechanik, Q. Xie and C. Fu, “Creating GUI testing tools using accessibility technologies.” Proceedings of the IEEE International Conference on Software Testing, Verification, and Validation Workshops, 243–250, 2009.




DOI: https://doi.org/10.24071/ijasst.v3i1.2744

Refbacks

  • There are currently no refbacks.









Publisher : Faculty of Science and Technology

Society/Institution : Sanata Dharma University

 

 

 

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.