FS1-EcoAcousticAlarmDetection is a few-shot learning model designed to classify ecological audio recordings into three categories: alarm, non-alarm, and background. The model begins by converting MP3 or WAV files into Mel spectrograms and, for each episode, randomly splits samples into a support set (5 samples per class), query set (6 samples per class), and test set (30 samples per class). Using an episodic batch sampler, 100 training episodes are generated. A CNN encoder with four convolutional blocks extracts embeddings from spectrograms, optimized via the Adam optimizer and cross-entropy loss. These embeddings are used by a Prototypical Network, which computes class prototypes from the support set and compares them to query embeddings using Euclidean distance, converting distances into log-probabilities for classification. A Relation Network made of fully connected layers (256 -> 128 -> 64 -> 1) takes concatenated embeddings of each query and prototype pair to compute similarity scores, optimized using MSE (mean squared error) loss. During evaluation, the model processes the test set over 100 episodes, extracting embeddings and producing final predictions using a weighted combination of prototypical probabilities (60%) and relation similarities (40%).
The model achieves 95% accuracy on a test set of 30 samples per class, evaluated over 100 episodes.
Compared to FSL2, this model uses Eucalidean distance, rather than Cosine distance with temperature scaling to compare query embeddings with class prototypes. It flattens both dimensions, unlike FSL 2 which maintains temporal structure by applying a pooling layer that compresses the frequency dimension into four representative bins to preserve the time axis. It does not utilize an attention mechanism.
Compared to FSL3, this model uses Eucalidean distance, rather than Cosine distance with temperature scaling (linear decay) to compare query embeddings with class prototypes. It flattens both dimensions, unlike FSL 3 which maintains temporal structure by applying a pooling layer that compresses the frequency dimension into four representative bins to preserve the time axis. It does not utilize an attention mechanism.