[..]
Is there a more robust and more efficient way to find how much time
separate 2 (or more) audio samples recorded at the same place, but with
different devices?
Note: to get a better idea of the spectogram I working with, see:
http://tuxee.net/tmp/audiosync
The thing is that visually I can easily match them, but I do not see how
to translate that numerically, avoiding noise and other perturbations.
What you do is pretty robust although computationally heavy. Just a
couple of suggestions:
1. Compare not the spectrograms, but the time derivatives of the
spectrograms. That will cancel the static frequency skew.
Are you suggesting something like:
spectogram = spectogram(2:end,

- spectogram(1:end-1,

;
(using Matlab/Octave syntax) assuming first dimension is the time axis
and the second dimension is the frequency axis? I've tested it, but the
result is less accurate. See:
http://tuxee.net/tmp/spect-deriv.png
(Here, the expected answer is around 1855) I've slightly scaled one of
the graph to match the other one. The red graph clearly indicate the
expected answer, but the green one does not. Maybe you were talking
about a different computation?
2. Normalize the spectrograms wrt the power of the signals. That will
make the simularity measure independent of the volume.
How to compute the power of the signals? Actually, while both sound are
close together, one of the record can be more sensitive to wind, or one
can record sound not heard by the other one (but note that both
recorders are less than 1 meter apart), so it might be hard to find how
to normalize them together.
For the spectograms, I'm actually working in the log scale for the
amplitudes, so when comparing them by subtracting component together,
this should cancel possible volume difference a bit I guess. (Better
than if I was keeping FFT output without applying log when computing the
spectogram.)
It could be possible to make an adaptive filter to minimize the
difference between the audio streams. After the filter is converged,
it is simple enough to derive the time shift from the coefficients.
While usually the sound will be few seconds apart, I don't know if a
adaptive filter would work with larger time difference. But I'm not
familiar enough with such filters thought.
That would be much more accurate and less computationally demanding,
then the spectrograms. However it works only if the streams are
sufficiently correlated for adaptive algorithm to work. To my
experience, the correlation between two different microphones is
rather low, especially if there is a lossy compression on the way.
I'm not using compression (both sound are PCM, one is recorded at
44.1Khz, mono, 16 bits, and the other one at 96KHz, stereo, 24 bits, but
downsampled to the type of the first one before processing it) but as
you guessed it, unfortunately one mic can be more sensitive to some
sound (like wind for example) and this can complicate the computation.
(See my first link at the top.)
Thanks for your help!
--
Frédéric Jolliton