Eric Blossom <eb@srlr14.sr.hp.com> says:
I have seen estimates that a straight forward implementation requires about 13.5 million Mulitply+Accumulates / second. Most of the time is burned up using a brute force search for the best excitation vector to use. There is a fixed 512 entry code book, and a dynamic code book with 256 entries (it may be 128). Each code book entry is an excitation vector that is 60 samples long. Therefore, to evalute each one, you have to run a 60 element vector through a 10 pole filter to get the predicted output, then compute some measure of error. This requires an additional difference operation that is implemented as some kind of "perception weighting filter" (I don't remember the details).
I have been reading the PowerPC 601 manual (MPC601, The Macs of early 1994). It is dangerous to believe performance figures. They give you the world in one chapter and then take it back here and there in bits and pieces. Here is what I see however. Simple single precision floating point operations can issue one per cycle. The book mentions several floating point ops that take more than one clock in a pipeline stage. They don't mention floating multiply-add. I think one can issue each clock. I-unit instructions can issue in the same clock as floating point ops. If you do the block trick used to multiply matrices then one load is required per multiply add. All this leads to the optimistic estimate that the 50MHz machine can sustain nearly 50 fmadd's per microsecond on a 50MHz chip. Inner products are much like matrix multiply which is a benchmark where the RS/6000 (The MPC601's father) achieved nearly one fmadd per clock, and that was double precision! 128 excitation vectors each of 60 single precision loats fit in the on chip cache, but it is tight. There may be enough margin here for it to work with no special DSP. I'll be in Yosemite for a few days so I won't be able to respond immediately to comments.
norm@netcom.com (Norman Hardy) writes:
128 excitation vectors each of 60 single precision loads fit in the on chip cache, but it is tight.
The codebooks are overlapped. The whole thing (program + data) should fit in 32K. Reduced complexity CELP can be done in less than 10Meg operations per sec, including everything. Of course, multiply-accumulate is considered one operation. Sincerely, Miron Cuperman, Software Consulting TCP/IP,UNIX,C++,DSP Voice: (604) 987 1719 Fax : (604) 986 8139 Email: miron@extropia.wimsey.com
My proposal is that we get some software working that produces poor quality speech in realtime on fast hardware that most people don't have. Then, improved search algorithms will bring higher quality.* The natural evolution of faster hardware will make it available to all. I think that we as cypherpunks have been thrown off a bit by the policy issues and the publicity we've received. It's time to get back into active development. Remember, architecture *is* policy! John * The way these algorithms work is that the sender goes through a laborious process to find the best "encoding" (literally -- out of a code book) that matches the sounds it is trying to communicate. Typically the quality depends on how much time it has to do this; spending more time looking at more possibilities, makes it more likely that you find one with a very small difference between the real signal and the encoded signal. We can start off with stupid algorithms that just give up and use the best-so-far when they run out of time, and gradually improve them to be more intelligent about the *order* in which they search. This requires no change to receivers; it's backward compatible.
participants (3)
-
gnu
-
miron@extropia.wimsey.com
-
norm@netcom.com