This project implements the SimCLR (Simple Framework for Contrastive Learning of Visual Representations) architecture for the classification of urban sounds using the UrbanSound8K dataset. The goal is to accurately classify different urban sounds like sirens, car horns, etc., using advanced deep learning techniques.
The dataset used for this project is the UrbanSound8K dataset. This dataset consists of Mel-Spectrogram images, which are a visual representation of the audio data, suitable for our SimCLR model.
The conversion of audio to Mel-Spectrogram images is performed using a code available in this GitHub repository: UrbanSound8k-MelSpectrogram. This is crucial for preparing the dataset in a format that our model can process.
- SimCLR Framework: A self-supervised learning model used to learn representations of audio data.
- Classifier: A neural network that classifies audio based on the representations learned by SimCLR.
- Epochs: 15
- Number of Folds: 10 (Cross-validation approach)
- Batch Size: 32
- Learning Rate: 0.001
- Weight Decay: 1e-6
- Optimizer: Adam
- Loss Function: NTXentLoss (Contrastive Loss)
The model was trained across multiple folds, showing consistent improvement in accuracy. Here are some highlights:
- Validation Accuracy: Ranges around 65% to 81%, varying across different epochs and folds.
This implementation showcases the effectiveness of SimCLR in a non-traditional domain like urban sound classification. The model achieves promising results, illustrating the power of self-supervised learning in audio processing.