Matthew Perez

My name is Matthew Perez and I am a Machine Learning Research Scientist at TeachFX. I graduated from the University of Michigan with a Ph.D. in Computer Science and Engineering working with Dr. Emily Mower Provost at the Computational Human Artificial Intelligence (CHAI) Lab. My research interest is broadly in speech- and language-based machine learning applications. My work has spanned topics including speaker diarization, ASR, error characterization, grapheme-to-phoneme, and emotion recognition.

I received my M.S. in Computer Science and Engineering at the University of Michigan and my B.S. in Computer Science at the University of Notre Dame. I have been fortunate to receive the GEM Fellowship ('19) and the NSF Graduate Research Fellowship ('20).

Email | Resume | Google Scholar | Github

Publications

Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End Models
Matthew Perez, Aneesha Sampath, Minxue Niu, Emily Mower Provost
INTERSPEECH, 2024
Paper | Code

We introduce novel approaches for automatic paraphasia detection, leveraging a generative pretrained transformer (GPT) and end-to-end models that jointly handle ASR and classification. Our results show that a single-sequence model outperforms GPT baselines for identifying multiple types of paraphasias in speech.

PronScribe: Highly Accurate Multimodal Phonemic Transcription From Speech and Text
Yang Yu, Matthew Perez, Anker Bapna, Fadi Haik, Siamak Tazari, Yu Zhang
INTERSPEECH, 2023
Paper

We present PronScribe, a novel method for phonemic transcription from speech and text input based on careful finetuning and adaptation of a massive, multilingual, multimodal speech-text pretrained model. We show that our model is capable of phonemically transcribing pronunciations of full utterances with accurate word boundaries in a variety of languages covering diverse phonological phenomena, achieving phoneme error rates in the vicinity of 1-2% which is comparable to human transcribers.

Episodic Memory For Domain-Adaptable, Robust Speech Emotion Recognition
James Tavernor, Matthew Perez, Emily Mower Provost
INTERSPEECH, 2023
Paper

In this paper, we investigate how a model can be adapted to unseen environments without forgetting previously learned environments. We show that memory-based methods maintain performance on previously seen environments while still being able to adapt to new environments. These methods enable continual training of speech emotion recognition models following deployment while retaining previous knowledge, working towards a more general, adaptable, acoustic model.

Mind the gap: On the value of silence representations to lexical-based speech emotion recognition
Matthew Perez, Mimansa Jaiswal, Minxue Niu, Cristina Gorrostieta, Matthew Roddy, Kye Taylor, Reza Lotfian, John Kane, Emily Mower Provost
INTERSPEECH, 2022
Paper

Utilizing non-speech frames (i.e. silence) in a BERT-framework to improve speech emotion recognition. We find that silence has as significant impact on predicting valence and our token analysis suggests that the presence of and proximity to silence are important factors in latent text features extracted from BERT.

Enabling Off-the-Shelf Disfluency Detection and Categorization for Pathological Speech
Amrit Romana, Minxue Niu, Matthew Perez, Angela Roberts, Emily Mower Provost
INTERSPEECH, 2022
Paper

This work investigates the use of BERT for dysfluency detection and categorization. We propose finetuning BERT with an additional triplet loss function in order to specifically focus on reptitions and revisions (which are categories which underperform using a baseline BERT model). We show that the added triplet loss leads to improved BERT performance for both revisions and repetitions while preserving performance on other categories.

Articulatory Coordination for Speech Motor Tracking in Huntington Disease
Matthew Perez, Amrit Romana, Angela Roberts, Noelle Carlozzi, Jennifer Ann Miner, Praveen Dayalu, Emily Mower Provost
INTERSPEECH, 2021
Paper | Code

Acoustic biomarkers which capture articulatory coordination are particularly promising for characterizing motor symptom progression in people affected by Huntington Disease. In this paper, we utilize Vocal Tract Coordination (VTC) features extracted from read speech to estimate a motor severity score and show these features outperform other common baselines.

Automatically Detecting Errors and Disfluencies in Read Speech to Predict Cognitive Impairment in People with Parkinson’s Disease
Amrit Romana, John Bandon, Matthew Perez, Stephanie Gutierrez, Richard Richter, Angela Roberts, Emily Mower Provost
INTERSPEECH, 2021
Paper | Code

This work investigates the use of speech errors and disfluencies in people with Parkinson's Disease as a means of analyzing cognitive impairment. In this study, we focus on read speech, which offers a controlled template from which we can detect errors and disfluencies, and we analyze how errors and disfluencies vary with cognitive impairment

Learning Paralinguistic Attributes from Audiobooks with Voice Conversion
Zakaria Aldeneh, Matthew Perez, Emily Mower Provost
NAACL, 2021
Paper

Paralinguistic tasks, specifically speech emotion recognition, have limited access to large datasets with accurate labels, which makes it difficult to train models that capture paralinguistic attributes via supervised learning. In this work, we propose the Expressive Voice Conversion Autoencoder (EVoCA), which is a framework for capturing paralinguistic (e.g., emotion) attributes from a large-scale (i.e., 200 hours) audio-textual data without requiring manual emotion annotations. The proposed network utilizes the conversion of synthesized (neutral) speech and real (expressive) speech in order to learn what makes speech expressive in an unsupervised manner.

Aphasic Speech Recognition using a Mixture of Speech Intelligibility Experts
Matthew Perez, Zakaria Aldeneh, Emily Mower Provost
INTERSPEECH, 2020
Paper

Automatic speech recognition (ASR) is a key component for automatic, aphasic speech analysis. However, current approaches of using a standard, one-size-fits-all ASR model might be sub-optimal due to the wide range of speech intelligibility that exists both within and between speakers. This work investigates how speech intelligibility can be estimated using a neural network and how intelligibility variability can be addressed within an acoustic model architecture using a mixture of experts. Our results show that this style of modeling leads to significant phone recognition improvement compared to a traditional, one-size-fits-all model.

Classification of Huntington Disease using Acoustic and Lexical Features
Matthew Perez, Wenyu Jin, Duc Le, Noelle Carlozzi, Praveen Dayalu, Angela Roberts, Emily Mower Provost
INTERSPEECH, 2018
Paper

This works presents a pipeline for an automatic, end-to-end classification system using speech as the primary input for predicting Huntington Disease. We explore using transcript-based features to capture speech-characteristics of interest and use methods such as k-Nearest Neighbors (with euclidean and dynamic time warped distances) as well as more modern neural net approaches for classification.

Portable mTBI Assessment Using Temporal and Frequency Analysis of Speech
Louis Daudet, Nikhil Yadav, Matthew Perez, Christian Poellabauer, Sandra Schneider, Alan Huebner
IEEE Journal of Biomedical and Health Informatics , 2016
Paper

This work investigates the use of mobile devices for the extraction and analysis of various acoustic features at detecting mild traumatic brain injury (mTBI). Our results suggest strong correlation between certain temporal and frequency features and likelihood of a concussion.

Website Template