
1.Introduction
Acoustic-to-articulatory inversion is the technique that estimates vocal tract shape or articulators' position based on input speech signals.It is more than of theoretical interest.It could help automatic speech recognition (ASR)[1],speech therapy and language training[2,3],talking head animation and lip-syncing[4-6],and low bit-rate speech coding[7].
Acoustic-to-articulatory inversion was firstly performed by using codebook-based methods,where code-book was built based on the acoustic-articulatory parameter pairs generated by synthesizing sounds with an articulatory model by scanning entire space of the control parameters.Then,articulation was inferred by looking up the codebook[8,9].However,this approach leads to invalid vocal tract shape since the same acoustic features could be generated by different combinations of articulatory parameters,and some of them never occur in human speech production.this problem can be suppressed by introducing dynamic programing[10]as post-processing or by using human data.
Thanks to the advent of large corpora of synchronized articulatory-acoustic data of human,in past decades,numerous inversion methods have been proposed to tackle the problem of acoustic-to-articulatory inversion.It includes a variety of hidden Markov models (HMM)[11-15],Kalman filtering[16],Gaussian mixture model (GMM) based regression[18],codebooks[19],non-linear regression with multilayer perceptron (MLP)[20],mixture density networks (MDN)[21],deep neural networks[22,23],trajectory MDNs (TMDN)[24].In addition,several studies tried to incorporate visual features to give an audiovisual-to-articulatory mapping[15,25].
Most of the methods mentioned above either apply maximum likelihood or least square error criterion to train the inversion model,where the coordinates of each coil are treated with equal importance.Nevertheless,different articulator is of different importance in different articulation.The position of some articulators have consistent pattern in deferent context,such as the lower lip for bilabials,the tongue tip for alveolars,and the tongue dorsum for velars,which are called as “critical articulators”[20].Hence,the question arises that whether critical information can be incorporated into the Conventional cost function to achieve better acoustic-to-articulatory inversion performance.
In this study,we apply a batch normalization DNN to the task of acoustic-to-articulatory inversion.The cost function takes the form of weighted least square error,where the weighting coefficient of each articulatory channel is determined by the exponential function of the its velocity profile.