ASM vs portable code [WAS: Re: Java DES breaker?]

Ray Arachelian <sunder@brainlink.com> wrote:
On Wed, 11 Dec 1996, Dr.Dimitri Vulis KOTM wrote:
I happen to have a Sparc 20 box and a Linux box and a SCO box, and ActiveX won't work on any of those. I also work with a bunch of other equipment that's much faster than a PC, but doesn't run browsers. (Most of it is not connected to the 'net for security reasons, but that's besides the point.)
Right, and Active X, if those machies were on the web, would not be supported.
If Bill's client is sure to run the platforms that MS IE runs on, then this is not a consideration.
Correct, however there is one thing you have forgotten... (next paragraph)
Interpreted FORTH bytestream (which is what Java is) may be "doing quite well" when drawing GUI gizmos and widgets, but it can't get anywhere near the performance of hand-optimizer assembler that you can stick into ActiveX.
While ActiveX does support hand optmized assembler, there are Java JustInTime compilers which take JVM bytecodes and turn'em into raw assembler. They aren't hand optimized, they are natively compiled code, but they are native code non the less. A good optimizing compiler may not be 100% as cool and as fast as hand optmized code, BUT it'll be almost as fast. And Java will run on just about EVERY platform out there. And that is a bigger, more important point than a 10%-25% increase in power over non-optimized code.
Besides, I'm not arguing AGAINST an ActiveX client, there's no reason why there can't be both Java and ActiveX clients out there since there is both a compatibilty issue and a speed increase with ActiveX.
While I'm reluctant to ever find myself in the same corner as Vulis, he has a point. As one of the few folks on this list who actually writes code, I say that hand-optimized assembler will beat machine generated code every time. I have figures to back me up. As some of you know, I've been working on a DES key recovery tool. I'm have both portable C and x86 assembler versions. They are currently identical, except that the guts of the DES round is written in "C" in one, and hand-optimized Pentium assembler in the other. For this test, I modified the code to cut out the delays associated with incrementing the key schedule, leaving the most of the crunching in the DES decryption. Both versions were compiled under Visual C++ 4.0, with Optimizations set to 'Maximize speed', and inlines to 'any suitable', and run on a 90MHz Digital Celebris 590 under WinNT 3.51. Averaging several runs: "C": 102,300 crypts/sec ASM: 238,000 crypts/sec With Java, it's possible to add native code methods to the interpreter, though this requires extra work by the user - it's harder than 'click on this link to run my reely kool applet'. This violates the Java sandbox, and requires the user to make trust decisions on the methods they are adding. ActiveX lets you add and run native code with a click, but again involves trust decisions. My philosophy is 'the more the merrier'. I'd like to see people work on DES Key Recovery on a large number of platforms - we just need to standardize on the input and output formats. Peter Trei trei@process.com Peter Trei Senior Software Engineer Purveyor Development Team Process Software Corporation http://www.process.com trei@process.com

At 9:28 PM -0800 12/14/96, Dale Thorn wrote:
I remember sitting down with some ASM programmers in the mid 1980's (using x86 PCs), and at that time, looking at the Codeview tracings, it occurred to me that ASM would nearly always run 2x faster than 'C', something that is inherent in the processes.
Modern compiler peephole optimizers are quite good, and there is not much to be gained by trying to beat them. The real gains come from being able to make more restrictive assumptions than a compiler based on your superior knowledge of the program. For example, most operating system kernels have a global pointer to the current process. Assembly language kernels normally dedicate a register to hold that pointer. In C, each separately compiled routine must re-load it from its memory location because they can not coordinate register usage. Parameter passing is another place this kind of global register assignment can improve assembly programs. Another place where this global view of a program helps is in re-loads after calling externally compiled routines. The compiler must assume that the external routine has changed the variable while a smart programmer can know better and save the re-load. Even if the data is in the level 1 cache, most architectures can do at most one memory reference instruction per cycle, and memory accesses seem to be the critical path for OS kernels. These optimizations work better with register rich architectures such as the R4000, Sparc, PowerPC etc. than they do on the popular Intel architecture because there are more registers to use. BTW - My experience with Assembler over C is more like 4:1 than 2:1. YMMV! ------------------------------------------------------------------------- Bill Frantz | I still read when I should | Periwinkle -- Consulting (408)356-8506 | be doing something else. | 16345 Englewood Ave. frantz@netcom.com | It's a vice. - R. Heinlein | Los Gatos, CA 95032, USA

Peter Trei wrote:
Ray Arachelian <sunder@brainlink.com> wrote:
On Wed, 11 Dec 1996, Dr.Dimitri Vulis KOTM wrote:[snip] For this test, I modified the code to cut out the delays associated with incrementing the key schedule, leaving the most of the crunching in the DES decryption. Both versions were compiled under Visual C++ 4.0, with Optimizations set to 'Maximize speed', and inlines to 'any suitable', and run on a 90MHz Digital Celebris 590 under WinNT 3.51. Averaging several runs: "C": 102,300 crypts/sec ASM: 238,000 crypts/sec
I remember sitting down with some ASM programmers in the mid 1980's (using x86 PCs), and at that time, looking at the Codeview tracings, it occurred to me that ASM would nearly always run 2x faster than 'C', something that is inherent in the processes. Someone on this list should know if it is possible to maximize speed in a typical 'C' routine, using Register variables (particularly for loops), inlining everything possible, etc., to get executable code much closer than a factor of 2x difference. Can it be done on a PC, and how hard would it be to explain, to cover a representative variety of techniques? [snip remainder]

Bill Frantz wrote:
In C, each separately compiled routine must re-load it from its memory location because they can not coordinate register usage.
There are such things as global optimizers that are quite capable of locating heavily-used global variables.
Another place where this global view of a program helps is in re-loads after calling externally compiled routines. The compiler must assume that the external routine has changed the variable
No, it's not true that it "must" do that. There are optimizer systems that defer decisions until link time (the MIPS compilers for example). That said, it's probably the case that a hand-written DES routine could probably better a good optimizer; the size of the problem is pretty small. On the other hand, I suspect a specially-tuned optimizer that used (maybe; I'm making this up off the top of my head) some sort of genetic techniques could find faster code sequences than a human coder would. ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Mike McNally -- Egregiously Pointy -- Tivoli Systems, "IBM" -- Austin mailto:m5@tivoli.com mailto:m101@io.com http://www.io.com/~m101 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
participants (4)
-
Bill Frantz
-
Dale Thorn
-
Mike McNally
-
Peter Trei