Bio: Hervé Bourlard is Director of the Idiap Research Institute, Full Professor at the Swiss Federal Institute of Technology Lausanne (EPFL), and Founding Director of the Swiss NSF National Centre of Competence in Research on “Interactive Multimodal Information Management (IM2)” (2001-2013). He is also an External Fellow of the International Computer Science Institute (ICSI), Berkeley, CA.
His main research interests mainly include statistical pattern classification, signal processing, multi-channel processing, artificial neural networks, and applied mathematics, with applications to a wide range of Information and Communication Technologies, including spoken language processing, speech and speaker recognition, language modeling, multimodal interaction, and augmented multi-party interaction.
H. Bourlard is the author/coauthor/editor of 8 books, and over 330 reviewed papers (including one IEEE paper award). He is a Fellow of IEEE and ISCA, and a Senior Member and Member of the European Council of ACM. He is the recipient of several scientific and entrepreneurship awards.
Abstract: Over the last few years, artificial neural networks, now often referred to as deep learning or Deep Neural Networks (DNNs) have significantly reshaped research and development in a variety of signal and information processing tasks, while further pushing the state-of-the-art in Automatic Speech Recognition (ASR).
In this talk, and starting with an historical account of DNNs, we will provide an overview of deep learning methodology applied to ASR, and recall/revisit key links with statistical inference, linear algebra, and more recent trends towards novel approaches such as sparse recovery modeling.
This overview will discuss the main feed-forward, convolutional or recurrent DNN properties, when either used as very efficient discriminant classifiers, strong posterior probability estimators (of the output classes conditioned on the input vectors in temporal context), or feature extractors. We will then discuss the impact of those properties on current and future ASR technology.
Bio: Niko Brummer received B.Eng (1986), M.Eng (1988) and Ph.D. (2010) degrees, all in electronic engineering, from Stellenbosch University. He worked as researcher at DataFusion (later called Spescom DataVoice), and AGNITIO and is currently with Nuance Communications.
Most of his research for the last 25 years has been applied to automatic speaker and language recognition and he has been participating in most of the NIST SRE and LRE evaluations in these technologies, from the year 2000 to the present. He has been contributing to the Odyssey Workshop series since 2001 and was organizer of Odyssey 2008 in Stellenbosch. His FoCal and Bosaris Toolkits are widely used for fusion and calibration in speaker and language recognition research.
His research interests include development of new algorithms for speaker and language recognition, as well as evaluation methodologies for these technologies. In both cases, his emphasis is on probabilistic modelling. He has worked with both generative (eigenchannel, JFA, i-vector PLDA) and discriminative (system fusion, discriminative JFA and PLDA) recognizers. In evaluation, his focus is on judging the goodness of classifiers that produce probabilistic outputs in the form of well calibrated class likelihoods.
Abstract: Embeddings in machine learning are low-dimensional representations of complex input patterns with the property that simple geometric operations like Euclidean distances and dot products can be used for classification and comparison tasks. In speaker recognition, the i-vector, extracted with the help of a Gaussian mixture model, is a good example of an embedding. Recently, more general embeddings extracted with deep neural nets, known as x-vectors, have been disrupting the long reign of the i-vector.
Although i-vectors and x-vectors are powerful representations of speaker information, they do form a bottleneck that fails to quantify the uncertainty about the speaker that is inherent in low quality inputs, such as short or noisy recordings. We propose meta-embeddings, a more powerful representation that is designed to allow for the propagation of this uncertainty. This ultimately allows for more accurate speaker recognition, especially in cases where the quality of the input is highly variable. Meta-embeddings live in Euclidean space and can be compared using dot products. Meta-embeddings can be interpreted as distributed embeddings, but they are also points in Hilbert function space, such that inner products in this space can be used for comparisons in the form of likelihood-ratio scores. This talk introduces the general theory, a first practical implementation and some encouraging experimental results.