Abstract (EN):
Acoustic monitoring of road traffic events is an indispensable element of Intelligent Transport Systems to increase their effectiveness. It aims to detect the temporal activity of sound events in road traffic auditory scenes and classify their occurrences. Current state-of-the-art algorithms have limitations in capturing long-range dependencies between different audio features to achieve robust performance. Additionally, these models suffer from external noise and variation in audio intensities. Therefore, this study proposes a spectrogram-specific transformer model employing a multi-head attention mechanism using the scaled product attention technique based on softmax in combination with Temporal Convolutional Networks to overcome these difficulties with increased accuracy and robustness. It also proposes a unique preprocessing step and a Deep Linear Projection method to reduce the dimensions of the features before passing them to the learnable Positional Encoding layer. Rather than monophonic audio data samples, stereophonic Mel-spectrogram features are fed into the model, improving the model's robustness to noise. State-of-the-art One-dimensional Convolutional Neural Networks and Long Short-term Memory models were used to compare the proposed model's performance on two well-known datasets. The results demonstrated its superior performance by achieving an improvement in accuracy of 1.51 to 3.55% compared to the studied baselines.
Language:
English
Type (Professor's evaluation):
Scientific
No. of pages:
14