Main Page | Report this Page
Computers Forum Index  »  Computer - DSP  »  Syncing multiple related audio tracks...
Page 1 of 1    

Syncing multiple related audio tracks...

Author Message
Frédéric Jolliton...
Posted: Sun Oct 25, 2009 12:59 am
Guest
Hi,

I'm looking for a method to automatically synchronize various audio
tracks, recorded at the same place, with different devices. This is
intended to work at post-processing time (not in realtime.)

Basically, I'm taking two audios track: one recorded by a camcorder,
with poor mic quality, and an extra one recorded at the same time with a
dedicated sound recorder, recording the same thing.

My naive approach is as follow: I compute spectogram for both sound
(using FFT) which give me a 2D array for each spectogram, then I try
various shift to find the best match between them.

To compare two spectograms with a given shift, I take the overlapping
parts once shifted, then I compute the mean value of the absolute
difference between them, which I divide by the width of the overlap.
(Hope that make sense.. I'm lacking adequate terminology here.)

Then I keep the best (smaller) answer found while trying various shift.

This method seems to work well for few tests I made. When plotting the
value against the shift, I see a peak toward 0 where audio tracks match.

Is there a more robust and more efficient way to find how much time
separate 2 (or more) audio samples recorded at the same place, but with
different devices?

--
Frédéric Jolliton
 
Vladimir Vassilevsky...
Posted: Sun Oct 25, 2009 1:35 am
Guest
Frédéric Jolliton wrote:

Quote:
I'm looking for a method to automatically synchronize various audio
tracks, recorded at the same place, with different devices. This is
intended to work at post-processing time (not in realtime.)

Basically, I'm taking two audios track: one recorded by a camcorder,
with poor mic quality, and an extra one recorded at the same time with a
dedicated sound recorder, recording the same thing.

My naive approach is as follow: I compute spectogram for both sound
(using FFT) which give me a 2D array for each spectogram, then I try
various shift to find the best match between them.

To compare two spectograms with a given shift, I take the overlapping
parts once shifted, then I compute the mean value of the absolute
difference between them, which I divide by the width of the overlap.
(Hope that make sense.. I'm lacking adequate terminology here.)

Then I keep the best (smaller) answer found while trying various shift.

This method seems to work well for few tests I made. When plotting the
value against the shift, I see a peak toward 0 where audio tracks match.

Is there a more robust and more efficient way to find how much time
separate 2 (or more) audio samples recorded at the same place, but with
different devices?

What you do is pretty robust although computationally heavy. Just a
couple of suggestions:

1. Compare not the spectrograms, but the time derivatives of the
spectrograms. That will cancel the static frequency skew.

2. Normalize the spectrograms wrt the power of the signals. That will
make the simularity measure independent of the volume.


It could be possible to make an adaptive filter to minimize the
difference between the audio streams. After the filter is converged, it
is simple enough to derive the time shift from the coefficients. That
would be much more accurate and less computationally demanding, then the
spectrograms. However it works only if the streams are sufficiently
correlated for adaptive algorithm to work. To my experience, the
correlation between two different microphones is rather low, especially
if there is a lossy compression on the way.


Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com
 
Mark...
Posted: Mon Oct 26, 2009 5:43 pm
Guest
On Oct 25, 7:42 am, Frédéric Jolliton <comp.... at (no spam) frederic.jolliton.com>
wrote:
Quote:
[..]

Is there a more robust and more efficient way to find how much time
separate 2 (or more) audio samples recorded at the same place, but with
different devices?

Note: to get a better idea of the spectogram I working with, see:

 http://tuxee.net/tmp/audiosync

The thing is that visually I can easily match them, but I do not see how
to translate that numerically, avoiding noise and other perturbations.

What you do is pretty robust although computationally heavy. Just a
couple of suggestions:

1. Compare not the spectrograms, but the time derivatives of the
spectrograms. That will cancel the static frequency skew.

Are you suggesting something like:

  spectogram = spectogram(2:end,Smile - spectogram(1:end-1,Smile;

(using Matlab/Octave syntax) assuming first dimension is the time axis
and the second dimension is the frequency axis? I've tested it, but the
result is less accurate. See:

 http://tuxee.net/tmp/spect-deriv.png

(Here, the expected answer is around 1855) I've slightly scaled one of
the graph to match the other one. The red graph clearly indicate the
expected answer, but the green one does not. Maybe you were talking
about a different computation?

2. Normalize the spectrograms wrt the power of the signals. That will
make the simularity measure independent of the volume.

How to compute the power of the signals? Actually, while both sound are
close together, one of the record can be more sensitive to wind, or one
can record sound not heard by the other one (but note that both
recorders are less than 1 meter apart), so it might be hard to find how
to normalize them together.

For the spectograms, I'm actually working in the log scale for the
amplitudes, so when comparing them by subtracting component together,
this should cancel possible volume difference a bit I guess. (Better
than if I was keeping FFT output without applying log when computing the
spectogram.)

It could be possible to make an adaptive filter to minimize the
difference between the audio streams. After the filter is converged,
it is simple enough to derive the time shift from the coefficients.

While usually the sound will be few seconds apart, I don't know if a
adaptive filter would work with larger time difference. But I'm not
familiar enough with such filters thought.

That would be much more accurate and less computationally demanding,
then the spectrograms. However it works only if the streams are
sufficiently correlated for adaptive algorithm to work. To my
experience, the correlation between two different microphones is
rather low, especially if there is a lossy compression on the way.

I'm not using compression (both sound are PCM, one is recorded at
44.1Khz, mono, 16 bits, and the other one at 96KHz, stereo, 24 bits, but
downsampled to the type of the first one before processing it) but as
you guessed it, unfortunately one mic can be more sensitive to some
sound (like wind for example) and this can complicate the computation.
(See my first link at the top.)

Thanks for your help!

--
Frédéric Jolliton

Frederic,
be aware that there will not only be a time offset between these two
recordings but there MAY also be a SPEED offset. Consider that the
two recording devices each have a tolerance to their internal clocks
relative to the playback machine. If you achieved perfect time
alignment at the start of the recordings, and the recordings are say 1
hour long, you may find that after 1 hour they are no longer in time
alignemnt. If the clocks were off by 100 ppm, after 1 hour the time
error could be a significant fraction of a second. Depending upon
your purpose, lip sync or stereo image phasing, this will be
significant. If this is critical to you, it would be best if the
algorithm could be made continously adaptive so that it would track
and speed differences. Fortunalty with modern gear, the speed
differences should be very small.

Also remember that if the two recordings were made with different
microphones, there will be about 1 ms per foot offset in time due to
the speed of sound. So it depends what your goal is in combining
these two recordings.

This technique is a common practice in audio production and you may
get more insight asking at rec.audio.pro.

Mark


Mark
 
Frédéric Jolliton...
Posted: Tue Oct 27, 2009 8:09 pm
Guest
Quote:
be aware that there will not only be a time offset between these two
recordings but there MAY also be a SPEED offset. [..] If the clocks
were off by 100 ppm, after 1 hour the time error could be a
significant fraction of a second.

Indeed, that's actually what my (rough) measures show (0.1% drift from
the expected sampling rate). However, I'm processing only short (few
minutes maximum) footage, so I don't know worry too much about that.

Quote:
Depending upon your purpose, lip sync or stereo image phasing, this
will be significant.

Actually, I'm shooting some video footages while recording the sound
with an external device then I try to merge them using the original
sound track (with poor quality) as a mean to properly lip sync the new
audio track.

Quote:
If this is critical to you, it would be best if the algorithm could be
made continously adaptive so that it would track and speed
differences.

Maybe this sort of thing could be handled by searching various chunk
(say 1 minutes worth of samples) from one audio track into the other
one, and deducing from the various resulting place how the clock drifted
between the two.

Quote:
This technique is a common practice in audio production and you may
get more insight asking at rec.audio.pro.

Ok, I will check there. Thanks!

--
Frédéric Jolliton
 
Frédéric Jolliton...
Posted: Thu Oct 29, 2009 11:23 pm
Guest
[Lip sync]
Quote:
So I don't understand why you need an "algorithm" to do this, it is
usually simply done by hand and eye manualy in an editor...

The problem is that I can have tens of video footages per day to sync.
This is why I try to automate this process. Video with poor audio
quality come from a camera, while better sound come from a separate
dedicated device. I want to pass all these files to my program, which
pair them by their creation date (since both device record the time &
date) then I'm deducing by which amount of time the track are out of
sync, so that I can create new video footage with the better sound. Then
I can use the resulting video in a video editor as usual.

By hand, it would be too laborious.

Quote:
as well as rec.audio.pro, you can also try
rec.audio.movies.production.sound (It's RAMPS)

Ok.

--
Frédéric Jolliton
 
Les Cargill...
Posted: Sun Nov 01, 2009 4:49 am
Guest
Frédéric Jolliton wrote:
Quote:
Hi,

I'm looking for a method to automatically synchronize various audio
tracks, recorded at the same place, with different devices. This is
intended to work at post-processing time (not in realtime.)

Basically, I'm taking two audios track: one recorded by a camcorder,
with poor mic quality, and an extra one recorded at the same time with a
dedicated sound recorder, recording the same thing.

My naive approach is as follow: I compute spectogram for both sound
(using FFT) which give me a 2D array for each spectogram, then I try
various shift to find the best match between them.

To compare two spectograms with a given shift, I take the overlapping
parts once shifted, then I compute the mean value of the absolute
difference between them, which I divide by the width of the overlap.
(Hope that make sense.. I'm lacking adequate terminology here.)

Then I keep the best (smaller) answer found while trying various shift.

This method seems to work well for few tests I made. When plotting the
value against the shift, I see a peak toward 0 where audio tracks match.

Is there a more robust and more efficient way to find how much time
separate 2 (or more) audio samples recorded at the same place, but with
different devices?


If this can be done at all in an analytic manner, it can be done by
deconvoling a pair of (short) samples from the two audio streams against
each other. (meaning deconv(a,b) and then deconv(b,a)). There is a
complication with that I'll address in a later paragraph.

Once you get the deconvolution signature, pop it up in a wave
editor. The first "tallest spike" is the best guess for a good
place to offset the two.

Ok - the complication . deconvolution requires the "a" signal to be
shorter than the "b" signal. So you have to clip accordingly. And it may
be that the resulting deconvolution signature is indeterminate.

Search for "Voxengo deconvolver" for free software that does
deconvolution for you, unless you have MATLAB or Octave.

With MATLAB/Octave, you might be able to automagically do the whole
thing.
--
Les Cargill

--
Les Cargill
 
Frédéric Jolliton...
Posted: Sun Nov 01, 2009 11:49 pm
Guest
[..]
Quote:
If this can be done at all in an analytic manner, it can be done by
deconvoling a pair of (short) samples from the two audio streams against
each other. (meaning deconv(a,b) and then deconv(b,a)). There is a
complication with that I'll address in a later paragraph.

Once you get the deconvolution signature, pop it up in a wave
editor. The first "tallest spike" is the best guess for a good
place to offset the two.

Ok - the complication . deconvolution requires the "a" signal to be
shorter than the "b" signal. So you have to clip accordingly. And it may
be that the resulting deconvolution signature is indeterminate.

Thanks for pointing me to a new direction.

Actually, I've found a solution using spectrograms comparisons as I
tried to describe it in my other posts, but now restricting the search
to few seconds of shifting between them. It works nice so far for all
the pairing of video and audio I tested. But I'm feeling that it is not
optimal and could fail for certain sounds (if there are no enough
similarity in the spectrograms for example.)

So, I will look for deconvolution. I'm not familiar with these
transformations thought.

The idea is that one sound can be seen as a convolution of the other
one? But does this can handle imperfect recording? For example, one
recorder produce lot of noise. On the other hand, the other recorder is
more sensitive to wind. And of course, they have both different
frequency response, and even different recording level.

What do you suggest as size for both argument 'a' and 'b' for deconv?

My first attempt was to try with such data: (44.1KHz sounds)

octave:1> u = wavread("sound1.wav")(:,1);
octave:2> v = wavread("sound2.wav")(:,1);
octave:3> size(u), size(v)
ans =
5500495 1
ans =
5263002 1

But then trying various plot(deconv()) with various excerpt from 'u' and
'v' haven't produced obvious answer (even when testing excerpts from the
same sound as both argument). Would you mind to provide more details?

[..]
Quote:
With MATLAB/Octave, you might be able to automagically do the whole
thing.

I'm using Octave and NumPy for my tests (mostly the latter, because I'm
used to Python, but I use Octave for simpler computation.)

--
Frédéric Jolliton
 
 
Page 1 of 1    
All times are GMT
The time now is Sun Dec 06, 2009 8:52 pm