Sound effects and visualization
Artikulate is an application which focuses upon improving the users' pronunciation skills by repeating native speaker recordings, recording that try and comparing both. By repeating these trials, a learner can continuously improve his/her language skills. But when a user records his/her own voice through a microphone invariably there are noises in the recorded audio and it becomes harder to analyze how well the user is faring.
So we would like to have a filter implemented within Artikulate and also a way of visually representing how two recorded audio waves differ.
In the context of noise removal we need to get a profile of background noise before we can go on with filtering. We may easily implement it in the application interface by telling the user to wait for a couple of seconds after he/she presses the record button. Then we will be getting a noise profile of the background noise from the first couple of seconds in the audio.
Noise removal is one of the most widely researched field of computer science. There are numerous works in the context of cleaning background noise from speech signals alone. Following is a very brief abstract of such two algorithms.
Suppression of noise using spectral subtraction
Spectral subtraction offers a computationally efficient, processor-independent approach to effective digital speech analysis. The method, requiring about the same computation as high-speed convolution, suppresses stationary noise from speech by subtracting the spectral noise bias calculated during non-speech activity. Secondary procedures are then applied to attenuate the residual noise left after subtraction. A noise-suppressed spectral estimator is used. The estimator is obtained by subtracting an estimate of the noise spectrum from the noisy speech spectrum. Spectral information required to describe the noise spectrum is obtained from the signal measured during non-speech activity. After developing the spectral estimator, the spectral error is computed.
Adaptive noise cancelling
The least mean-square (LMS) adaptive filtering approach offers to remove the deleterious effects of additive noise on speech signal. It takes advantage of the quasi-periodic nature of the speech signal to form an estimate of the clean speech signal at time t from the value of the signal at time t minus the estimated pitch period. For additive white noise distortion, .preliminary tests indicate that the method improves the perceived speech quality and increases the signal- to-noise ratio (SNR) by 7 dB in a 0 dB environment. The method has also been shown to partially remove the perceived granularity of CVSD coded speech signals.
There are also several other algorithms present to solve this problem or rather better the solution. The ideal workpath should be to implement these algorithms into separate modules and test them before using the one which is giving better output than others.
Utilize Pulse Audio Noise Filtering
There is (experimental) work at PulseAudio to support noise filtering like e.g. echos. It could be possible to build on this work and specifically work on better PulseAudio integration.
- research paper: Echo Reduction in PulseAudio
(using it: in pacmd run "load-module module-echo-cancel"; then in pavucontrol you'll see new the input / output devices; assign the playback streams to "<hw device name> (echo cancelled with <hw device name>)" and vice versa for the recording stream)
Sound processing library
The Synthesis Toolkit (STK) is an open source API for real time audio synthesis with an emphasis on classes to facilitate the development of physical modelling synthesizers. It is written in C++ and is written and maintained by Perry Cook at Princeton University and Gary Scavone at CCRMA. It contains both low-level synthesis and signal processing classes (oscillators, filters, etc.) and higher-level instrument classes which contain examples of most of the currently available physical modelling algorithms in use today.
One way to visualize sound is to plot the spectrum. Plots may be made using Fast Fourier Transform or FFT. This gives a value for each narrow band of frequencies that represents how much of those frequencies is present. All the values are then interpolated to create the graph of frequencies (the horizontal scale in Hz) against amplitudes (the vertical scale in dB).
The major question is, which values represent the differences in pronunciation well. (like pitch is a value that can be used to justify singing quality). Possibly related literature: