Main Page | Report this Page
Computers Forum Index  »  Computer Artificial Intelligence - Language  »  Interesting Idea. Will it Hold?...
Page 1 of 1    

Interesting Idea. Will it Hold?...

Author Message
The Randomizer...
Posted: Thu Jun 11, 2009 7:52 pm
Guest
Quote:
BUT .... if resources (cost, bandwidth, cpu, memory, complexity) are the
key issue, it would be simpler, cheaper, more reliable to *just type the
words*

I actually thought the same thing when I first got the idea, but the
fact is that I was thinking of the Pakistani environment, where only a
minority
of people actually know how to read/write english, and even fewer know
how to type and use a computer. This would actually make it possible
for
those people to communicate with friends/relatives abroad easily
without needing to place an out-of-country call.

One of the problems that I envision is that since the language spoken
by the majority of people is Urdu in Pakistan, it would be a very
daunting task to
make a speech recognition software for it. I was thinking of a system
in which rather than the words being recognized, the pronunciation of
the words would
be found and then that could be used to find suitably similar sounding
characters to represent the word. Is that possible, i.e. finding the
characters from the pronunciation
of a word?

Ted Dunning==> Yes, I accept that the output speech would not be in
the speakers voice, but I guess that is a compromise one would have to
make to make sure that speech
is delivered in a country where most Internet connections work with
speeds of around 1-2 KB per sec...
 
Mok-Kong Shen...
Posted: Fri Jun 19, 2009 1:34 am
Guest
The Randomizer wrote:

Quote:
This got me thinking, would it be possible to develop such a system?
One that would recognize words spoken into a micro-phone and assign
them a number based on their location in a dictionary. These numbers
would then be transmitted to the other end, where they would be again
matched against the dictionary and 'spoken' by the computer. [snip]

In dictionaries a noun is given in singular and a verb is given as
infinitive. So you would need some additional coding. Earlier I
suggested elsewhere that, if one limits one's vocabulary to a certain
number of dictionary words, serially number these words and append
some (fixed number of) bits to provide the needed morphological
information, then such a coding would be far more economical than
the ASCII coding.

M. K. Shen
 
Ian Parker...
Posted: Fri Jun 19, 2009 10:22 am
Guest
On 18 June, 22:34, Mok-Kong Shen <mok-kong.s... at (no spam) t-online.de> wrote:
Quote:
The Randomizer wrote:
This got me thinking, would it be possible to develop such a system?
One that would recognize words spoken into a micro-phone and assign
them a number based on their location in a dictionary. These numbers
would then be transmitted to the other end, where they would be again
matched against the dictionary and 'spoken' by the computer. [snip]

In dictionaries a noun is given in singular and a verb is given as
infinitive. So you would need some additional coding. Earlier I
suggested elsewhere that, if one limits one's vocabulary to a certain
number of dictionary words, serially number these words and append
some (fixed number of) bits to provide the needed morphological
information, then such a coding would be far more economical than
the ASCII coding.

In Arabic which have actually written a program for the words are

given as a stem. This applies both to nouns and verbs. From the stem
you can derive all the parts of speech.

http://groups.google.co.uk/group/google-translate-general/browse_frm/thread/11ac35a28bd3bdcd?hl=en

"dar" means house, which you can inflect to make "my house". In
Buckwalter there is a morphology dictionary. Morphology and morphology
types is simplar to conjugations in Latin and describes how a stem can
be modified tio give singular, plural, dual, together with various
possessive forms (my house).

My program uses the Buckwalter tables and finds all the possible
meanings of a word.

bsm (Buckwalter strict transliteration can be besm (in the name of).
bsm Allh means " in the name of God (Allah) bsm can also be b_sm
bi_sama (by poison). You have to tell by context which is which.

http://www.aclweb.org/anthology/W/W08/W08-0511.pdf

This has been done systematically by Buckwalter for Arabic. Something
similar (and a lot simpler) can be done for other languages. Instead
of quoting donner, finir and vendre a Buckwalter type system would
quote donn (v MT=1), fin (v MT=2) vend(v MT=3). fin(n) would be a
separate stem. In Arabic the stems (on the whole) can exist stand
alone. French verbs cannot exist stand alone. The stand alone case
though is defined as a morphology which may or may not exist.

In Arabic there are prefixes, which do not exist in European
languages. A computerised dictionary works on hashing. You do NOT
denote words serially, you multiply all the letters together in a
prime modulus. This is HASHING. It is fairly standard and is the fact
the method used in Mathematica for sparse matrices. Mathematica
multiplies rows and columns together. The algorithm is fast, you can
exhaustively search all the combinations of a word very fast.

In the quoted thread I have been quite trenchant in my criticism of
Google Translate. I think they could do a lot better if challenged.


- Ian Parker
 
Mok-Kong Shen...
Posted: Sat Jun 20, 2009 9:05 pm
Guest
Mok-Kong Shen wrote:

Quote:
In dictionaries a noun is given in singular and a verb is given as
infinitive. So you would need some additional coding. Earlier I
suggested elsewhere that, if one limits one's vocabulary to a certain
number of dictionary words, serially number these words and append
some (fixed number of) bits to provide the needed morphological
information, then such a coding would be far more economical than
the ASCII coding.

I like to add that there is at least one language, namely Chinese,
where no additional coding bits are needed. Each Chinese ideogram
is given in Unicode as 2 bytes. The Unicode covers very comprehensively
the Chinese ideograms. (There was/is a Chinese telegraphic code that
covers 10000 ideograms. For allday uses 4000 is fairly sufficient.)

M. K. Shen
 
 
Page 1 of 1    
All times are GMT
The time now is Thu Dec 03, 2009 11:49 am