Matthew Perez

My name is Matthew Perez and I am a Computer Science Ph.D. candidate at the University of Michigan. I work with Dr. Emily Mower Provost at the Computational Human Artificial Intelligence (CHAI) Lab. My interest is broadly in using speech analysis and machine learning to create intelligent systems for understanding human health and behavior. Currently, my focus is on applying deep learning techniques to improve speech recognition or speech characterization for low resource applications such as disordered speech.

I received my M.S. in Computer Science and Engineering at the University of Michigan and my B.S. in Computer Science at the University of Notre Dame. I have been fortunate to receive the GEM Fellowship ('19) and the NSF Graduate Research Fellowship ('20).

profile photo

Email | Resume | Google Scholar | Github

PronScribe: Highly Accurate Multimodal Phonemic Transcription From Speech and Text
Yang Yu, Matthew Perez, Anker Bapna, Fadi Haik, Siamak Tazari, Yu Zhang
INTERSPEECH, 2023
Paper

We present PronScribe, a novel method for phonemic transcription from speech and text input based on careful finetuning and adaptation of a massive, multilingual, multimodal speech-text pretrained model. We show that our model is capable of phonemically transcribing pronunciations of full utterances with accurate word boundaries in a variety of languages covering diverse phonological phenomena, achieving phoneme error rates in the vicinity of 1-2% which is comparable to human transcribers.

Episodic Memory For Domain-Adaptable, Robust Speech Emotion Recognition
James Tavernor, Matthew Perez, Emily Mower Provost
INTERSPEECH, 2023
Paper

In this paper, we investigate how a model can be adapted to unseen environments without forgetting previously learned environments. We show that memory-based methods maintain performance on previously seen environments while still being able to adapt to new environments. These methods enable continual training of speech emotion recognition models following deployment while retaining previous knowledge, working towards a more general, adaptable, acoustic model.

Mind the gap: On the value of silence representations to lexical-based speech emotion recognition
Matthew Perez, Mimansa Jaiswal, Minxue Niu, Cristina Gorrostieta, Matthew Roddy, Kye Taylor, Reza Lotfian, John Kane, Emily Mower Provost
INTERSPEECH, 2022
Paper

Utilizing non-speech frames (i.e. silence) in a BERT-framework to improve speech emotion recognition. We find that silence has as significant impact on predicting valence and our token analysis suggests that the presence of and proximity to silence are important factors in latent text features extracted from BERT.

Enabling Off-the-Shelf Disfluency Detection and Categorization for Pathological Speech
Amrit Romana, Minxue Niu, Matthew Perez, Angela Roberts, Emily Mower Provost
INTERSPEECH, 2022
Paper

This work investigates the use of BERT for dysfluency detection and categorization. We propose finetuning BERT with an additional triplet loss function in order to specifically focus on reptitions and revisions (which are categories which underperform using a baseline BERT model). We show that the added triplet loss leads to improved BERT performance for both revisions and repetitions while preserving performance on other categories.

Articulatory Coordination for Speech Motor Tracking in Huntington Disease
Matthew Perez, Amrit Romana, Angela Roberts, Noelle Carlozzi, Jennifer Ann Miner, Praveen Dayalu, Emily Mower Provost
INTERSPEECH, 2021
Paper | Code

Acoustic biomarkers which capture articulatory coordination are particularly promising for characterizing motor symptom progression in people affected by Huntington Disease. In this paper, we utilize Vocal Tract Coordination (VTC) features extracted from read speech to estimate a motor severity score and show these features outperform other common baselines.

Automatically Detecting Errors and Disfluencies in Read Speech to Predict Cognitive Impairment in People with Parkinson’s Disease
Amrit Romana, John Bandon, Matthew Perez, Stephanie Gutierrez, Richard Richter, Angela Roberts, Emily Mower Provost
INTERSPEECH, 2021
Paper | Code

This work investigates the use of speech errors and disfluencies in people with Parkinson's Disease as a means of analyzing cognitive impairment. In this study, we focus on read speech, which offers a controlled template from which we can detect errors and disfluencies, and we analyze how errors and disfluencies vary with cognitive impairment

Learning Paralinguistic Attributes from Audiobooks with Voice Conversion
Zakaria Aldeneh, Matthew Perez, Emily Mower Provost
NAACL, 2021
Paper

Paralinguistic tasks, specifically speech emotion recognition, have limited access to large datasets with accurate labels, which makes it difficult to train models that capture paralinguistic attributes via supervised learning. In this work, we propose the Expressive Voice Conversion Autoencoder (EVoCA), which is a framework for capturing paralinguistic (e.g., emotion) attributes from a large-scale (i.e., 200 hours) audio-textual data without requiring manual emotion annotations. The proposed network utilizes the conversion of synthesized (neutral) speech and real (expressive) speech in order to learn what makes speech expressive in an unsupervised manner.

Aphasic Speech Recognition using a Mixture of Speech Intelligibility Experts
Matthew Perez, Zakaria Aldeneh, Emily Mower Provost
INTERSPEECH, 2020
Paper

Automatic speech recognition (ASR) is a key component for automatic, aphasic speech analysis. However, current approaches of using a standard, one-size-fits-all ASR model might be sub-optimal due to the wide range of speech intelligibility that exists both within and between speakers. This work investigates how speech intelligibility can be estimated using a neural network and how intelligibility variability can be addressed within an acoustic model architecture using a mixture of experts. Our results show that this style of modeling leads to significant phone recognition improvement compared to a traditional, one-size-fits-all model.

Classification of Huntington Disease using Acoustic and Lexical Features
Matthew Perez, Wenyu Jin, Duc Le, Noelle Carlozzi, Praveen Dayalu, Angela Roberts, Emily Mower Provost
INTERSPEECH, 2018
Paper

This works presents a pipeline for an automatic, end-to-end classification system using speech as the primary input for predicting Huntington Disease. We explore using transcript-based features to capture speech-characteristics of interest and use methods such as k-Nearest Neighbors (with euclidean and dynamic time warped distances) as well as more modern neural net approaches for classification.

Portable mTBI Assessment Using Temporal and Frequency Analysis of Speech
Louis Daudet, Nikhil Yadav, Matthew Perez, Christian Poellabauer, Sandra Schneider, Alan Huebner
IEEE Journal of Biomedical and Health Informatics , 2016
Paper

This work investigates the use of mobile devices for the extraction and analysis of various acoustic features at detecting mild traumatic brain injury (mTBI). Our results suggest strong correlation between certain temporal and frequency features and likelihood of a concussion.

Website Template