Oberseminar Mensch-Maschine-Kommunikation

Die Vorträge finden donnerstags 14:00 -15:00 Uhr (45 min Vortrag + Diskussion) in der Bibliothek des Lehrstuhls N0116 statt (Abweichungen sind jeweils einzeln angegeben).

SS 2018

Donnerstag, 21.06.2018, 17:00 Uhr, N0116 (Lehrstuhlbibliothek)

Dipl.-Ing. Clara Hollomey, wiss. Mitarbeiterin an der Professur Audio-Signalverarbeitung
Towards Understanding Musical Timbre

Abstract: This presentation aims at clarifying which signal parameters do and do not have an influence on musical timbre perception. Besides pitch, loudness and duration, musical timbre is one of the four defining properties of a musical sound. Timbre provides information on the kind of musical instrument playing the note, how the musician played it and the context within which the note was played. Therefore, understanding musical timbre perception corresponds to understanding how we make sense from musical sounds.

Unlike for pitch, loudness and duration, there is no generally accepted definition of musical timbre. Thus, it is not clear which signal parameters are salient to its perception. Previous research focuses on deriving musical timbre from the perceived similarities within musical sounds. Such approaches yield a considerable range of signal correlates that might contribute to musical timbre, but the findings depend on the stimuli used, and they are not coherent or generally extensible.

Musical timbre perception is considered as a source identification process targeted at the specific subclass of sinusoidal combinations that occur in the acoustic signals emitted by musical instruments. It is postulated that the auditory system at first splits the incoming sound into auditory sources, and only then, once it "knows" what a source is, draws information from it. In that sense, the decision on what constitutes a source limits the amount of information, or the timbre percept, that can be drawn from that source.

A series of listening tests, targeting the effects of static and dynamic signal properties as well as the effects of the temporal envelope on auditory source separation were conducted. The outcomes suggest that there are indeed signal parameters limiting auditory source perception and thus, musical timbre perception. Based on the findings, a conceptually and perceptually meaningful definition of musical timbre is derived, which is hoped to provide a valid starting point for further research on human and machine listening alike.

Montag, 02.07.2018, 10:00 Uhr, N0116 (Lehrstuhlbibliothek)

Prof. Dr. rer. nat. Dr. med. Birger Kollmeier, Medizinische Physik & Cluster of Excellence Hearing4All - Carl von Ossietzky Universität Oldenburg
Dont believe the elders: Modelling speech recognition with and without hearing aids using machine learning 

Abstract: An overview of some of the work in our Cluster of Excellence Hearing4all (Oldenburg/ Hannover) is given with a focus on recognizing speech in noise - the classical "Cocktail Party Problem" which becomes more acute with increasing hearing loss and age. While some years ago we assumed that automatic speech recognizers (ASR) need hearing aids to perform as well as human listeners in challenging acoustic situations, recent advances in ASR backends (such as DNN) and ASR frontends (such as Gabor features) have helped to close the gap between human and machine speech recognition performance. Hence, understanding Human Speech Recognition (HSR) is no longer instrumental in improving ASR. Conversely, ASR may help to better model and understand HSR – especially for hearing-impaired listeners and for predicting the effect of a hearing aid for the individual user.

Here the Framework for Auditory Discrimination Experiments (FADE, Schädler et al., JASA 2016) is employed for predicting patient performance employing the German Matrix sentence test (available for 20 major languages, see Kollmeier et al., Int. J. Audiol. 2015). It is compared with a DNN-based ASR system utilizing an open-set sentence recognition test. FADE can well predict the average individual performance with different (binaural) noise reduction algorithms using a cafeteria noise in comparison to individual empirical data from Völker et al. (2015) with R² of about 0.9. Using a simple approach to include suprathreshold performance deficits in the model (Kollmeier et al., 2017), a high precision of predicting the benefit from a hearing device can be achieved.

The analysis shows that several of the "classical" auditory-model-based assumptions and theories about hearing must be modified: For example, the negative effect of increased auditory filterwidths in hearing-impaired listeners is much less pronounced than currently believed. Also, a very simple combination of a filterbank frontend with some amplitude modulation filter properties can already explain the observed speech recognition thresholds in strongly fluctuating noise much better than current standard models (SII, ESII, STOI, mr-sEPSM, Schubotz et al., 2016, Spille et al., 2018). Hence, there is still much to learn about our ears - and machines definitely help.

Kollmeier, B., Warzybok, A., Hochmuth, S., Zokoll, M. A., Uslar, V., Brand, T., & Wagener, K. C. (2015). The multilingual matrix test: Principles, applications, and comparison across languages: A review. International Journal of Audiology, 54(sup2), 3-16.

Kollmeier, B., Schädler, M. R., Warzybok, A., Meyer, B. T., & Brand, T. (2016). Sentence recognition prediction for hearing-impaired listeners in stationary and fluctuation noise with FADE: Empowering the Attenuation and Distortion concept by Plomp with a quantitative processing model. Trends in hearing, 20, 2331216516655795.

Schaedler, M.R., Warzybok, A., Ewert, S. D., and Kollmeier, B. (2016). A simulation framework for auditory discrimination experiments: Revealing the importance of across-frequency processing in speech perception. The Journal of the Acoustical Society of America, 139(5):2708–2722.

Schubotz, W., Brand, T., Kollmeier, B., & Ewert, S. D. (2016). Monaural speech intelligibility and detection in maskers with varying amounts of spectro-temporal speech features. JASA, 140, 524–540.

Spille, C., Kollmeier, B., Meyer, B.T. (2018). "Predicting Speech Intelligibility with Deep Neural Networks," Computer Speech and Language 48, pp. 51-66. doi:10.1016/j.csl.2017.10.004

Völker, C., Warzybok, A., & Ernst, S. M. (2015). Comparing binaural pre-processing strategies III: Speech intelligibility of normal-hearing and hearing-impaired listeners. Trends in hearing, 19, 2331216515618609.


WS 2017/2018

Donnerstag, 30.11.2017, 15:00 Uhr, N0116 (Lehrstuhlbibliothek)

Gaetano Andreisek, Former Doctoral Candidate '13-'17 at Audio Information Processing
Acoustic defect detection in composite wind turbine blades using the tap test

Abstract: Wind turbine blades experience some of the highest stresses among modern civil structures. In order to ensure a lifetime of fifteen to twenty years, regular testing of their structural integrity is required. Advanced non-destructive testing (NDT) methods such as ultrasonic or thermography testing can be used, however, they are difficult to apply in the field. More feasible yet cost-effective methods are sought after. The tap test is readily used in everyday field inspections of wind turbine blades, in which experienced engineers periodically tap on the blade shell and analyze the generated impact sounds. This NDT method combines an unmatched ease-of-operation, it does not require bulky equipment, and it is effective at covering large areas in a short time. However, it is often criticized for its subjective nature and vague defect detection mechanisms.

For this work, extensive tap test measurements were conducted on a set of wind turbine blades with different levels of defectiveness. With this at hand, the presentation will touch the approach to demonstrate the full detection potential of the tap test in terms of defect depth and defect size based on acoustic features. We will cover stages of acoustic feature extraction and reduction leading to an automated classification of the material condition based on well-defined features. As part of the analysis process, a new measure of defectiveness was developed to quantify the severity of critical voids in the glass-fiber composite material of wind turbine blades. Moreover, a comprehensive feature extraction toolbox was implemented in Matlab that facilitates a flexible definition and efficient extraction of several hundred features from acoustic impulse responses. The findings can be used to advance the tap test towards an automated testing procedure featuring the same capabilities as human inspectors while offering the opportunity to be more cost-effective and less subjective.


SS 2017

Donnerstag, 06.07.2017, 15:00 Uhr, N0116 (Lehrstuhlbibliothek)

Mikhail Startsev, wiss. Mitarbeiter der Nachwuchsforschergruppe VESPA
360-aware Saliency Prediction with Ensemble of Deep Networks 

Abstract: We aim at predicting saliency maps for equirectangular images of panoramic 360-image stimulus. To do so, we explore the applicability of regular 2D saliency predictors when combined with various ways of interpreting the input data, which incorporate the additional information provided by the recording scenario, as well as counteract projection distortions.

Donnerstag, 04.05.2017, 14:00 Uhr, N0116 (Lehrstuhlbibliothek)

Simon Schenk, M.Sc., wiss. Mitarbeiter am Lehrstuhl MMK
GazeEverywhere: Enabling Gaze-only User Interaction on an Unmodified Desktop PC in Everyday Scenarios  

Abstract: Eye tracking is becoming more and more affordable, and thus gaze has the potential to become a viable input modality for human-computer interaction. We present the GazeEverywhere solution that can replace the mouse with gaze control by adding a transparent layer on top of the system GUI. It comprises three parts: i) the SPOCK interaction method that is based on smooth pursuit eye movements and does not suffer from the Midas touch problem; ii) an online recalibration algorithm that continuously improves gaze-tracking accuracy using the SPOCK target projections as reference points; and iii) an optional hardware setup utilizing head-up display technology to project superimposed dynamic stimuli onto the PC screen where a software modification of the system is not feasible. In validation experiments, we show that GazeEverywhere's throughput according to ISO 9241-9 was improved over dwell time based interaction methods and nearly reached trackpad level. Online recalibration reduced interaction target ('button') size by about 25%. Finally, a case study showed that users were able to browse the internet and successfully run Wikirace using gaze only, without any plug-ins or other modifications.


WS 2016/2017

Donnerstag, 17.11.2016, 15:00 Uhr, N0116 (Lehrstuhlbibliothek)

Ayako Noguchi, Student of our Guest Professor Shinji Sako
Segmental conditional random fields audio-to-score alignment distinguishing percussion sounds from other instruments

Abstract: Audio-to-score alignment is useful technique because it can be widely applied to many practical applications for musical performance.
However, it is still open problem due to the complexity of audio signal especially in the polyphonic music. Additionally, performing in real-time is also important in practical situation. In this study, we propose a new alignment method based on segmental conditional random fields (SCRFs). The attractive feature of this method is utilizing to distinguish percussion sounds from the other instruments. In general, percussion sounds have a role in managing whole music. Moreover, performers can pick up the percussion sounds from the others by hearing whole sound thanks to their unique features of the sound. In the field of score alignment, hidden Markov models (HMMs) or CRFs was used in previous studies including our previous one. In summary, these methods were formulated as a matching problem of the state sequence of mixed notes with the audio feature sequence. In this study, we extend our previous method by combining an additional state which represents percussion sounds. Furthermore, we introduce the balancing factor to control the importance of classifying feature functions. We confirmed the effectiveness of our method by conducting experiments using RWC music database.

Shohei Awata, Student of our Guest Professor Shinji Sako
Vowel duration dependent hidden Markov model for automatic lyrics recognition

Abstract: Recently, due to the spread of music distribution service, a large amount of music is available on the Internet. Accordingly, it is generally increasing the demand of music information retrieval (MIR). In the field of MIR research, there are several researches to extract meaningful information from music audio signals. However, automatic lyrics recognition is still a challenging problem because the variation of singing voice is much larger than that of speaking voice and a large database of singing voice is not available. In the relevant study, lyrics recognition was performed by extending the framework of speech recognition using hidden Markov model (HMM). However, accuracy rate was not sufficient. To recognize singing voice precisely, one promising approach is utilizing musical features. This study considers the task of recognizing syllable from a cappella singing voice. To respond to the variation of the length of a phoneme, we construct the duration dependent HMM. A large database of singing voice is essential for training the acoustic model. We use synthetic singing voice by HMM based singing voice synthesis system to solve the lack of the database of a cappella singing voice. We confirmed the effectiveness of our method.


SS 2016

Donnerstag, 21.07.2016, 14:00 Uhr, N0116 (Lehrstuhlbibliothek)

Dr. Marco F.H. Schmidt, Leiter der Internationalen Nachwuchsforschergruppe „Developmental Origins of Human Normativity“ an der LMU, München
On the Ontogeny of Normativity  

The "DeNo" research group ("Developmental Origins of Human Normativity"), led by Dr. Marco F. H. Schmidt, investigates the ontogenetic basis of human normativity and cooperation. Their research focuses on the question of how children develop an understanding of norms and rules and which social-cognitive and motivational capabilities contribute to the ontogeny of human normativity.

Donnerstag, 14.07.2016, 14:00 Uhr, N0116 (Lehrstuhlbibliothek)

Guido Maiello, UCL Institute of Ophthalmology, University College London, London, UK
Naturalistic Depth Perception and Binocular Vision  

How is binocular visual information processed, integrated and exploited by our perceptual and
motor systems?

How does the visual system employ binocular input to maintain the accuracy of oculomotor control processes across the lifespan?

How does degraded binocular input affect the perceptual and motor systems?

Most eye movements in the real-world redirect the foveae to objects at a new depth and thus require the co-ordination of saccadic and vergence eye movements. This talk will cover a mix of studies regarding perception, misperception and motor control in binocular vision. Using a gaze contingent display we first examine how the time course of binocular fusion depends on depth cues from blur and stereoscopic disparity in natural images. Our findings demonstrate that disparity and peripheral blur interact to modify eye movement behavior and facilitate binocular fusion. We then examine oculomotor plasticity when error signals are independently manipulated in each eye, which can occur naturally owing to aging, or in oculomotor dysfunctions. We find that both rapid saccades and slow vergence eye movements are continuously recalibrated independently of one another and corrections can occur in opposite directions in each eye. Lastly, since it is well known that the motor systems controlling the eyes and the hands are closely linked when executing tasks in peripersonal space, we examine how eye-hand coordination is affected by binocular and asymmetric monocular simulated visual impairment. We observe a critical impairment level up to which pursuit and vergence eye movements maintain fronto-parallel and sagittal tracking accuracy.
Our results confirm that the motor control signals that guide hand movements are utilized by the visual system to plan pursuit and vergence eye movements in depth. The methods, results, and data I will present have implications for basic understanding of oculomotor control and 3D perception, as well as applied clinical implications.


WS 2015/2016

Donnerstag, 03.03.2016, 14:00 Uhr, N0116 (Lehrstuhlbibliothek)

Aiming Wang, Ph.D., director of Advanced Space System Division, R&D Center, ISSE
Introduction of CAST and ISSE’s Space Activities  

China Academy of Space Technology (CAST), subordinated to China Aerospace Science and Technology Corporation (CASC), is the main development base for space technology and products in China. It is mainly engaged in development and manufacturing of spacecraft, external exchange and cooperation in space technology, satellite applications, etc. Since 1970, CAST has successfully developed and launched over 180 spacecrafts of various kinds, such as Dongfanghong, Shenzhou, Chang'e, Ziyuan and Beidou.  Institute of Spacecraft System Engineering (ISSE) is an institute of CAST. ISSE is the first, so far the biggest, institute engaged in spacecraft system design and spacecraft specialized technology in China. ISSE mainly involved in system design, integrated assembly and the development of structure, thermal control, TT&C, data management and space environment subsystems. ISSE has successfully developed and launched about 100 spacecrafts.

Aiming Wang received the Ph.D. degree in signal processing from Institute of Electronics, Chinese Academy of Sciences in 2002. From 2006 to 2011 he has been a vice-professor, and since 2011 he was a professor with the Institute of Spacecraft System Engineering (ISSE), China Academy of Space Technology (CAST). Currently, he is the director of Advanced Space System Division, R&D Center, ISSE. His research interests include advanced space system design and SAR signal processing.

SS 2015

Dienstag, 30.06.2015, 14:00 Uhr, N0116 (Lehrstuhlbibliothek)

Dipl.-Ing. Andreas Haslbeck, Lehrstuhl für Ergonomie der Technische Universität München
Blickkettenanalyse bei Linienpiloten

Abstract: Beim manuellen Fliegen müssen Piloten einige wichtige Displays regelmäßig überprüfen. Derzeit wird am Lehrstuhl für Ergonomie untersucht, ob Piloten diese Anzeigen in wahlloser Reihenfolge oder anhand bestimmter Muster untersuchen. Ein weiterer Untersuchungsgegenstand ist dabei auch, ob entsprechende Muster an die jeweilige Situation adaptiert werden. Dieser Vortrag stellt die Vorgehensweise und zugehörige Ergebnisse der Untersuchung zur Diskussion. 

Donnerstag, 28.05.2015, 14:30 Uhr, N0116 (Lehrstuhlbibliothek)

Leana Copeland, Research School of Computer Science, Australian National University
Analysis of Eye Movements and Reading Behaviours in eLearning Environments

Abstract: Online learning extends teaching and learning from the classroom to a wide and varied audience that has different needs, backgrounds, and motivations. Particularly in tertiary education, eLearning technologies are becoming increasingly ubiquitous. As a result there is growing importance in designing effective eLearning environments. The focus of this investigation is on the use of eye tracking technology to analyse reading behaviour in eLearning environments. The purpose of which is to examine how eye gaze can be used to increase reading comprehension and reduce distraction during reading within eLearning. To do this three aspects are investigated; the first involved recording of participants eye gaze to analyse the effects of tutorial presentation on eye movements and reading behaviour. Secondly, the use of eye gaze to predict comprehension is explored. Finally, the use of eye tracking to mitigate distraction during reading in eLearning environments is explored.


WS 2014/2015

Freitag, 16.01.2015, 15:00 Uhr, N0116 (Lehrstuhlbibliothek)

D. M. Gavrila, Daimler R&D and Univ. of Amsterdam

Abstract: Human-Aware Intelligent Systems that use sensors to interact naturally with a human-inhabited environment represent the next frontier of information technology. They play an important role in applications such as intelligent vehicles, smart surveillance, social robotics, and entertainment.

The talk covers my research in this domain at Daimler R&D and the Univ. of Amsterdam focusing on visual perception and machine learning. I discuss pattern classification methods for learning visual appearance, probabilistic temporal models for pose estimation and activity recognition, and mixture models for activity discovery.

SS 2014

Mittwoch, 06.08.2014, 14:00 Uhr, N0116 (Lehrstuhlbibliothek)

Dr. Amr Ibrahim El-Desoky Mousa, Member of MISP-Group of MMK
Sub-Word Based Language Modeling of Morphologically Rich Languages for LVCSR

Abstract: Morphologically rich languages are challenging for efficient language modeling in large vocabulary continuous speech recognition. Complex morphology causes data sparsity and high out-of-vocabulary rates leading to poor language model probabilities. Traditional word m-gram models are usually characterized by high perplexities and suffer from the inability to model unseen words. In this work, we investigate alternative language modeling approaches for morphologically rich languages. We use hybrid sub-word based language models comprising multiple types of units such as morphemes, syllables, and joint character and phoneme sequence units. In addition, morphology-based classes derived on sub-word level are incorporated into the modeling process to support sparse m-grams. Models such as stream-based, class-based, and factored language models are tested and compared. The above approaches are combined with state-of-the-art models, like hierarchical Pitman-Yor language models, and feed-forward deep neural network language models. These approaches have been found to cope with data sparsity and improve the performance of Arabic, German, and Polish speech recognition systems. 

Mª Inmaculada Mohíno Herranz, University of Alcalá, Madrid/Spain:
Emotion recognition through analysis of biological signals

Abstract: First, I´ll introduce myself and then I will present my previous and current work on emotion recognition system. I was working on emotion recognition using speech and I proposed a new feature. In addition, as one of the collaborators in ATREC project (described below), I focused on emotion detection using biosignals (ECG, GSR, Respiration) to detect stress level in combatants. In the presentation I will discuss about the recorded dataset, methods used and results that I achieved. Finally, I would like to have your comments for collaboration. ATREC project: The Spanish Ministry of Defense, through its Future Combatant program, has sought to develop technology aids with the aim of extending combatants' operational capabilities. This project combines multidisciplinary disciplines and fields, including wearable instrumentation, textile technology, signal processing, pattern recognition and psychological analysis of the obtained information. Speech.

Felix Weninger
, Member of MISP-Group at MMK:
Supervised Training in Single-Channel Source Separation

Abstract: I will present my work on supervised training of non-negative matrix factorization (NMF) and deep neural network models for single-channel source separation and automatic speech recognition. I will give an accessible introduction to NMF and then present my pioneering work on discriminative NMF and deep neural networks for source separation, in particular long short-term memory deep recurrent neural networks (LSTM-DRNNs). My models obtain world-leading separation results on a very challenging task where speech is mixed with highly non-stationary noise. My talk will appeal to a general audience as well as to those interested in the latest developments in single-channel source separation.

Dienstag, 01.07.2014, 14:00 Uhr, N0116 (Lehrstuhlbibliothek)

Dr.-Ing. Marc Al-Hames, managing director of Cliqz in Munich
Learning the Language of the Internet

The major purpose of this talk is to present an answer to the question: What is really the language of the Internet and how can we learn (or classify) it - with focus on data and models ? Furthermore, it is explained why this is a key project at the company Cliqz and how this should result into promising applications and business for the company in the future. It will be also demonstrated that this turns out to be a real "Big Data" problem, resulting into the deployment of methods from this novel and popular research area. The talk will be presented in English.

Dr.-Ing. Marc Al-Hames enjoys working at the intersection of technology and business: He received a PhD in Electrical Engineering and Information Technology at TU Munich with his thesis about “Graphical Models for Pattern Recognition”. He then worked at McKinsey & Company as engagement manager. In 2011 he joined Hubert Burda Media as Head of Strategy and M&A for the publicly listed Internet holding TOMORROW FOCUS AG. In 2013 he became the managing director of the early stage Internet company Cliqz in Munich.

Donnerstag, 12.06.2014, 17:00 Uhr, N0116 (Lehrstuhlbibliothek)

Aswin Wijetillake, Wissenschaftlicher Mitarbeiter Fachgebiet Audio-Signalverarbeitung, TUM
Perceptual signal segregation in bilateral cochlear implant users

Donnerstag, 15.05.2014, 17:00 Uhr, N0116 (Lehrstuhlbibliothek)

Dr. Eleftheria Georganti, HNO des Universitätsklinikums Zürich
Evaluation schemes and methods towards the improvement of hearing devices


WS 2013/2014

Donnerstag, 12.12.2013, 15:00 Uhr, N0116 (Lehrstuhlbibliothek)

Prof. Haifeng Li, Harbin Institute of Technology (HIT)
Brain Signal Processing for a Brain Controlled News Browser

BCI --- brain–computer interface --- is a direct communication pathway between the brain and an external device. The key point in a BCI is the recognition of the user´s intention through the analysis of his brain signal --- electroencephalogram (EEG). After giving a brief introduction about the EEG processing technology, this talk will analyse some typical BCI mechanisms such as SSVEP, P300 and Eye Blink, and present how they are applied in the browser. The Steady State Visually Evoked Potentials (SSVEP) are signals that are natural responses to visual stimulation at specific frequencies. In the browser, the control buttons were designed to blink at different frequencies, and the recognition of SSVEP performs as the click on the buttons to explore the news list or to return to an upper menu. The P300 wave is an event related potential (ERP) component elicited in the process of decision making. Thus, in our brain-controlled news browser, the P300 is recognized to find out which news attract the user. Moreover, the eye blink motions are also detected and serve as to confirm or abort a choice. Such mechanisms do not request any training of/from the users, thus making the browser easily applicable to any person. Our work re-emphasises the goals of BCIs, which are directed at assisting, augmenting, or repairing human cognitive or sensory-motor functions.

Haifeng Li is professor at the Institute of Intelligent Human-Computer Interaction, School of Computer Science and Technology, Harbin Institute of Technology (HIT). He received his Ph.D. in Electro-Magnetical Measuring Technique & Instrumentation from HIT in 1997 and his Ph.D. in Computer, Communication and Electronic Science from University Paris 6, France in 2002. He started his teaching career in 1994 at HIT, was promoted as a lecturer in 1995, professor in 2003 and doctoral supervisor in 2004. His research fields include the Intelligent Information Processing Technology, Brain Machine Interaction Technology, Artificial Intelligence, and Cognitive Computing Science for application fields such as natural human-machine interaction, digital media processing and complex scientific process modeling. He earned two 1st class research and education awards and three 2nd class awards from the provincial government. He undertakes two projects of National Natural Science Foundation, a project of the National High-Tech Foundation and several projects of the Provincial and Ministry Science Foundation. He published 2 books and over 80 papers in journals and conferences on national and international level.

Donnerstag, 21.11.2013, 15:00 Uhr, N0116 (Lehrstuhlbibliothek)

Marko Takanen, Aalto University
Functional Modeling of the Auditory Pathway for the Assessment of Spatial Sound Reproduction


SS 2013

Freitag, 28.06.2013, 14:00 Uhr, N6507 (Seminarraum AIP)

Dr. Fritz Menzer, EPFL
Binaural Audio Signal Processing Using Interaural Coherence Matching

WS 2012/2013

Donnerstag, 14.02.2013, 14:00 Uhr, N0116 (Lehrstuhlbibliothek)

Dr. Verena Rieser, Lecturer at Heriot-Watt-University, Edinburgh/UK
Natural Language Generation as Planning under Uncertainty for Statistical Interactive Systems

In this talk I present a novel approach to Natural Language Generation (NLG) in statistical Spoken Dialogue Systems, using a data-driven statistical optimisation framework for incremental Information Presentation (IP), where there is a trade-off to be solved between presenting “enough” information to the user while keeping the utterances short and understandable. In a case study on recommending restaurants, we show that an optimised IP strategy outperforms a baseline mimicking human behaviour in terms of total reward gained, in simulation. The policy is then also tested with real users, and improves on a conventional hand-coded IP strategy with up to a 9.7% increase in task success. This methodology provides new insights into the nature of the NLG problem, which has previously been treated as a module following dialogue management with limited access to context features. This type of model is now widely used in research applications, which I will briefly discuss. For example, optimising information presentation in recommender systems, hierarchical NLG, personalisation and efficient incremental search and trading agents.

Verena Rieser is a lecturer in Computer Science at Heriot-Watt University, Edinburgh. Before she has undertaken post-doctoral research at the Schools of Informatics and GeoSciences at the University of Edinburgh. She holds a PhD (summa cum laude) from Saarland University (2008) and an MSc with distinction from the University of Edinburgh (2004). Her PhD also received the Dr. Eduard-Martin award for distinguished doctoral dissertations. Her research is at the intersection of Machine Learning and Natural Language Processing, with applications ranging from Multi-Modal Interaction, Spoken Dialogue System and Computational Sustainability. She is currently serving as secretary for the ACL Special Interest group in Natural Language Generation (SIGGEN) and she has recently published a book on "Reinforcement Learning for Spoken Dialogue Systems" (Springer, 2011). Since starting her research career in 2005, she has co-authored over 40 original research papers, which were cited over 400 times (H-Index: 13 according to Google Scholar). Verena has strong ties to industry, where she is a visiting researcher at Nuance's Research Lab, Sunnyvale in 2013.

Donnerstag, 08.11.2012, 15:00 Uhr im Raum N3815

Nicholas Evans, Assistant Professor at Eurecom
Biometric spoofing: countermeasures for speaker recognition

After a brief introduction to Eurecom and our wider interests in Speech and Audio Processing research, this talk will describe European efforts within the context of the FP7 Tabula Rasa project to develop new countermeasures to protect state-of-the-art biometric authentication systems from spoofing. After introducing the problem of biometric spoofing, I will introduce the Tabula Rasa consortium, the different biometric modalities that we are addressing and our work to develop software-based and hardware-based approaches to liveness detection, challenge-response countermeasures and new recognition methods which are intrinsically robust against attack. The talk will then concentrate on Eurecom’s role in developing speaker recognition spoofing countermeasures and some work in speech quality assessment, higher-level supra-segmental features, spectral texture analysis and fused approaches to spoofing detection. The talk concludes with a brief description of our contribution to the development of multimodal biometric countermeasures, where 2D face and voice modalities are combined.

Nicholas Evans is an Assistant Professor at Eurecom within its Multimedia Signal Processing Research Group and directs research in Speech and Audio Processing. His current interests include speaker diarization, modelling and recognition, multimodal biometrics and speech signal processing. He leads work in speaker recognition in the EU FP7 ICT Tabula Rasa project, has previously participated in the internationally competitive NIST Speaker Recognition Evaluations (SREs) and the NIST Rich Transcription Evaluations (RTEs) and is the lead author of a chapter on speaker recognition spoofing countermeasures for the Handbook of Biometric Anti-Spoofing to be published by Springer next year.