Main Page | Report this Page
 
   
Science Forum Index  »  Compression Forum  »  OT: voice/audio-processing...
Page 1 of 1    
Author Message
cr88192
Posted: Mon Mar 10, 2008 9:41 am
Guest
well, I am contrary, aren't I?...
maybe some people here might be at least vaguely interested.

namely, for so long, I ended up using LPC for nearly everything, when
alternatively MDCT is the preferred option...

then, I get to formant synthesis, which is something normally done with LPC,
and decide instead to use MDCT. actually, it is weirder, I am using DCT-II
for the input signal processing, but IMDCT for the output signal generation
(the reason for this is that I am doing a good deal of processing, and
DCT-II is a little easier to work with, but MDCT allows generating much
smoother output, and for my uses is mathematically "close enough" to the
normal DCT to allow generating reasonable output).

DCT is mostly used because it makes it fairly easy to work with the signal
and synthesize output, and also because at present I have little real idea
for to effectively tune LPC-IIR filters (for audio compression, I use FIR
filters, and some fairly generic tuning/guestimation algos, but these can't
tune to a signal in the way needed for audio synthesis...).

or, technically, I have no idea what I am doing, but have just fiddled my
way through things thus far...


so, what was my task here:
attempt more highly abstract an audio signal (making it much easier to work
with), and then resynthesize it from these abstractions (faithful is better,
but I was not able to pull this off exactly).

the reason is partly that, the higher the level of abstraction on which
something can be operated, the more that can be done with it. in the case of
audio, this would mean, for example, being able to change the voice, or
regenerate new speech from what we do have (I have a custom text-to-speech
engine).

in my case, the audio I have is a big mass of diphones that I had formerly
been using for diphone synthesis.


so, how I process the diphones:
they are all processed, and converted into bunches of DCT blocks;
all of the DCT blocks are processed in an attempt to generate a set of
orthogonal vectors representing the individual formants (in my case, this
consists of averaging all the DCT blocks and splitting up and normalizing
the components);
the DCT-blocks are then processed, and converted to vectors of respective
formant energies.

at this point, it is possible to resynthesize the output from these vectors
by using each vector-component to scale each of the formant vectors, the
results being accumulated into a DCT block, which can then be run through
the IMDCT to produce output samples.

it is also presumably possible at this point to change the voice by using a
different set of formant vectors.


processing continues, however:
these vectors are processed per-diphone, calculating an average for the
vector (representing the midpoint), the average of the left-half of the
diphone, and the average of the right half;
from these, I use linear prediction (good old '2*x-y') to estimate the
vectors for the left and right phones, and also, from the diphone names I
know which phones are represented;
I then calculate averages all of the examples of the various phones.

so, now, we we have all of the phones represented as particular points in a
large vector space.
so, 'ah' and 'eh' are points in space, as is 's', 'j', 'ch', 't', ...

at this point, it is also, possible to regenerate diphones via the use of
linear interpolation (and infact, in my tests, this is what I have done).


this was mostly just fiddling, and there is a lot of room for improvement in
terms of quality within this general design (actually, the main issue AFAIK
is in terms of finding a good set of voice vectors, as simply averaging all
the input blocks and dividing up the output, does not seem to be a good way
to do this...).

note that this "could" be regarded as (somewhat lossy) compression, since,
after all, a set of vectors representing the voice, and smaller vectors
representing the various phones, would take far less space to store than a
mass of prerecorded voice fragments...


an example being fairly short and clear enough (the audio being produced in
the general way described in this post):
http://cr88192.dyndns.org:8080/audio/example1.wav

here is another example using pure diphone synthesis (these diphones serving
as the input for all the algos described before...):
http://cr88192.dyndns.org:8080/audio/example2.wav


well, maybe relevant or interesting...

any comments?...
Thomas Richter
Posted: Mon Mar 10, 2008 10:37 am
Guest
cr88192 schrieb:

Quote:
so, what was my task here:
attempt more highly abstract an audio signal (making it much easier to
work with), and then resynthesize it from these abstractions (faithful
is better, but I was not able to pull this off exactly).

This is called "parametric compression", and it is actually one of the
many options available for MPEG-4 audio coding. One application is to
compress speech (as you said), one can also consider this for music
compression (as in "midi is a parametric audio compression"). IIRC,
there's also an open source speech audio codec available, I just forgot
the name - sorry.

Greetings,
Thomas
Thomas Richter
Posted: Mon Mar 10, 2008 2:42 pm
Guest
Thomas Richter wrote:

Quote:
This is called "parametric compression", and it is actually one of the
many options available for MPEG-4 audio coding. One application is to
compress speech (as you said), one can also consider this for music
compression (as in "midi is a parametric audio compression"). IIRC,
there's also an open source speech audio codec available, I just forgot
the name - sorry.

libspeex is the name.

So long,
Thomas
cr88192
Posted: Mon Mar 10, 2008 7:09 pm
Guest
"Thomas Richter" <thor@math.tu-berlin.de> wrote in message
news:fr42p7$7ra$1@infosun2.rus.uni-stuttgart.de...
Quote:
Thomas Richter wrote:

This is called "parametric compression", and it is actually one of the
many options available for MPEG-4 audio coding. One application is to
compress speech (as you said), one can also consider this for music
compression (as in "midi is a parametric audio compression"). IIRC,
there's also an open source speech audio codec available, I just forgot
the name - sorry.

libspeex is the name.


oh well, I guess something was missed here:
my point was actually for processing the audio data, and using it in a
text-to-speech engine, rather than compressing the audio for sake of
compressing the audio (as would usually be expected, say, for compressing
voice recordings or music).

that is why the message was marked as 'OT'...
(not like I really have any groups where this is less OT though...).


some of the technology overlaps with that of compression (some of what I was
doing with the DCT and MDCT transforms), but the point is not really
compression...


reducing a voice to a set of formant vectors though, however, could
theoretically allow a good level of compression, so that is a possible use:
for example, reducing the voice to a vector stream, would allow 16 or 32kbps
to be used for sending audio, even with each coefficient were being sent as
a 32-bit value;
linearly differencing, quantizing, and entropy-coding the values would allow
much lower, say, 4kbps;
reducing to a set of phones and timings, could get the stream down to, say,
120bps (assuming around 8 bits for each phone, and 4 bits for the timing, or
90bps with 6 bit phones, and 3 bit timings, or smaller with entropy or
dictionary coding).

or, also:
we compress the speech by running it through a very specialized
speech-recognition engine (rather than stopping with quantized DCT blocks or
entropy-coded LPC differences);
at the same time, a model is trained allowing the voice to be approximately
re-synthesized, for example, from an almost textual representation (or, we
can get by simply sending the raw phones, or text, if we don't really care
if the same voice comes out the other end...);
textual information is still plenty compressible as well.


of course, we can also note, that with all this the generality goes away.
for example, we can fairly easily reduce a large monologue to a chunk of
text, but not a song...

so, for example, DCT and LPC are likely about an upper limit if one wants to
retain generality (and regain similar-sounding output...).

all of this being analogous to if a video codec could recognize objects (for
example, from its large dictionary of known object types, or it could
dynamically build props from what it "sees"), and later resynthesize the
video using generic props (possibly in a different form, for example, live
action comes out looking like anime or whatever...). no more need for all
these frames and block-based motion estimation and similar...

at this point, neither is likely practical though...


so, the point is to raise the level of abstraction to such a level, as to
allow resynthesis with very different properties (such as generating yelling
rather than speaking, or swapping out for different voices), without having
to record and edit large numbers of diphones to make these kind of changes.

at the level of phones, we can then connect both decomposition and synthesis
to a phonetic textual representation, such as IPA or SAMPA;
this is, in turn, linked through a dictionary to the main TTS engine
(written words to speech), or, hueristically, to a speech recognition
engine.


for example, both example fragments given were the result of TTS, not
recorded speech.

beyond this, are textual-level processing tasks (good old NLP and friends).
one could go further, linking this to a simulated semantic model, ...

and so on...



Quote:
So long,
Thomas
Thomas Richter
Posted: Tue Mar 11, 2008 2:45 am
Guest
cr88192 schrieb:

Quote:
oh well, I guess something was missed here:
my point was actually for processing the audio data, and using it in a
text-to-speech engine, rather than compressing the audio for sake of
compressing the audio (as would usually be expected, say, for
compressing voice recordings or music).

That's what "parametric compression" is about. I'm not sure Speex
implements it. As said, *there are* codecs for this around. It doesn't
compress the data, but tries to extract parameters to describe the
signal on a higher level, suitable for a specific purpose (i.e. make the
language understandable).

So long,
Thomas
cr88192
Posted: Tue Mar 11, 2008 6:00 am
Guest
"Thomas Richter" <thor@math.tu-berlin.de> wrote in message
news:fr5dab$r7t$1@infosun2.rus.uni-stuttgart.de...
Quote:
cr88192 schrieb:

oh well, I guess something was missed here:
my point was actually for processing the audio data, and using it in a
text-to-speech engine, rather than compressing the audio for sake of
compressing the audio (as would usually be expected, say, for
compressing voice recordings or music).

That's what "parametric compression" is about. I'm not sure Speex
implements it. As said, *there are* codecs for this around. It doesn't
compress the data, but tries to extract parameters to describe the
signal on a higher level, suitable for a specific purpose (i.e. make the
language understandable).


yes, ok...


groan, it seems in my world I am lacking much interesting to write about
anymore...


Quote:
So long,
Thomas
 
Page 1 of 1       All times are GMT - 5 Hours
The time now is Sat Sep 06, 2008 11:12 pm