Predicting Emotional Valence from Audio with Deep Learning - 09/2024
Project Overview
This project explored the application of deep learning techniques to the task of predicting emotional valence from audio signals. Emotional valence represents the positive or negative quality of an emotional state and is commonly used in affective computing and emotion-aware systems. Our goal was to design a neural network capable of estimating continuous valence scores from audio inputs represented as Mel spectrograms, a time-frequency representation commonly used in audio analysis.
Developed as part of an advanced deep learning course, the project emphasized both model design and hyperparameter optimization, allowing us to integrate practical experience with theoretical concepts. The result was a regression model that generalized well on unseen data and demonstrated the potential for real-world applications in emotion recognition from speech.
Problem Statement
Emotion recognition from speech plays a vital role in areas such as human-computer interaction, mental health monitoring, and adaptive multimedia systems. In this project, we focused on building a model that could predict a continuous valence score from audio features extracted from a provided dataset.
The dataset, supplied in .pkl
format, contained serialized audio recordings and associated emotional valence labels. Our task involved converting the raw audio into a learnable feature representation, developing a predictive model, and optimizing it for low error and strong generalization.
Data Preparation
The preprocessing pipeline began with loading the dataset locally and converting each audio file into a Mel spectrogram. These spectrograms represent audio in terms of frequency and time, capturing rich acoustic features suitable for machine learning models.
Each spectrogram was then normalized to ensure scale invariance and facilitate stable convergence during training. The full dataset was split into training (95%) and validation (5%) subsets using train_test_split
. Corresponding valence labels were paired with each feature vector, forming the basis for supervised regression modeling.
This structured preparation ensured that the model was exposed to consistent patterns and could be reliably evaluated during validation.
Model Architecture
The model architecture was a fully connected feedforward neural network designed to balance expressive power with regularization. It consisted of three hidden layers followed by a single-unit output layer for continuous score prediction.
Hidden Layer 1: 256 units, ReLU activation, followed by batch normalization and dropout
Hidden Layer 2: 128 units, ReLU activation, batch normalization, dropout
Hidden Layer 3: 64 units, ReLU activation, batch normalization, dropout
Output Layer: 1 unit with linear activation for regression
This design was chosen to ensure the model could capture non-linear patterns while reducing the risk of overfitting. ReLU activations enabled the model to learn complex relationships, batch normalization accelerated training and stabilized learning, and dropout improved generalization by randomly deactivating neurons during training.
Hyperparameter Optimization
To fine-tune the model, we implemented Bayesian Optimization, a sample-efficient method for hyperparameter tuning. The objective was to minimize Mean Squared Error (MSE) on the validation set by adjusting two key hyperparameters:
Learning rate: Explored between 0.0001 and 0.005
Batch size: Explored between 16 and 128
The optimizer began with two random initial configurations and performed five additional search iterations. In each iteration, the model was trained and evaluated on the validation set, and the resulting MSE was used as the objective function to be minimized.
The best configuration discovered was:
Batch Size: 128
Learning Rate: 0.0015
This setup yielded the most promising performance in terms of both convergence and validation accuracy.
Results and Evaluation
With the optimized hyperparameters, the model achieved a training MSE of 0.8713 and a validation MSE of 0.5068. The relatively low validation error indicated strong generalization capability and minimal overfitting. Epoch-wise loss curves showed consistent convergence, with no major divergence between training and validation loss—a testament to the effectiveness of our regularization strategies.
These results confirmed that the chosen architecture and optimization approach were suitable for the regression task, successfully capturing the relationship between acoustic features and emotional valence.
Conclusion
This project demonstrated the effectiveness of a regularized deep neural network for predicting emotional valence from Mel spectrograms. The combination of thoughtful architecture design and Bayesian hyperparameter tuning resulted in a model that generalized well and minimized prediction error on unseen data.
Key takeaways include:
Deep learning models can effectively model complex, non-linear relationships in audio-based emotion data.
Bayesian optimization is a powerful tool for tuning model hyperparameters in a structured and efficient way.
Regularization techniques such as batch normalization and dropout are essential in preventing overfitting, especially with limited data.