Thought people might find this abstract of a talk being given here at Penn of some interest. Please let me know if I'm wrong <grin>. (And, no, I won't be attending; almost all of it would be over my head. What is in this abstract is probably as much of it as I could understand without considerable preparation <grin>). ------------------------------------------------------------------------ In article <119753@netnews.upenn.edu>, holland@central.cis.upenn.edu (Billie Holland) writes:
Statistical Techniques for Language Recognition: An Introduction and Empirical Study for Cryptanalysts
Alan T. Sherman Computer Science Department University of Maryland Baltimore County
In cryptanalysis, how can a computer program recognize when it has discovered all or part of the secret message? For example, how can a program recognize character strings such as ``Attack at dawn!'', ``DES@RT ST\&RM'', or ``?tta????t d?wn'' as fragments of intelligible messages? In the early days of cryptology a human would perform these language-recognition tasks manually. In this talk I will explain how to recognize language automatically with statistical techniques.
Statistical techniques provide powerful tools for solving several language-recognition problems that arise in cryptanalysis and other domains. Language recognition is important in cryptanalysis because, among other applications, an exhaustive key search of any cryptosystem from ciphertext alone requires a test that recognizes valid plaintext. Although I will focus on cryptanalysis, this talk should be relevant to anyone interested in statistical inference on Markov chains or applied language recognition.
Modeling language as a finite stationary Markov process, I will adapt a statistical model of pattern recognition to language recognition. Within this framework I will consider four well-defined language-recognition problems: 1) recognizing a known language, 2) distinguishing a known language from uniform noise, 3) distinguishing unknown 0th-order noise from unknown 1st-order language, and 4) detecting non-uniform unknown language. For the second problem I will give a most powerful test based on the Neyman-Pearson Lemma. For the other problems, which typically have no uniformly most powerful tests, I will give likelihood ratio tests. I will also discuss the chi-squared test statistic $X^2$ and the Index of Coincidence $IC$.
In addition, I will present the results of computer experiments that characterize the distributions of five test statistics when applied to strings of various lengths drawn from nine types of real and simulated English.
This is joint work with Ravi Ganesan. Most of this work was carried out while Sherman was a member of the Institute for Advanced Computer Studies, University of Maryland College Park.
Thursday, 15 April 93 TOWNE BUILDING - 337 3:00 - 4:30
-- david david@staff.udc.upenn.edu
Language recognition is important in cryptanalysis because, among other applications, an exhaustive key search of any cryptosystem from ciphertext alone requires a test that recognizes valid plaintext.
For exhaustive key search on any reasonably good symmetric cipher (like DES), some simple entropy measure for n-bit-grams should suffice to distinguish random from non-random. These other approaches in this talk seem like overkill in this context. But then again, maybe we're trying to break Enigma. :-)
Modeling language as a finite stationary Markov process,
A finite stationary Markov process is large fancy math-speak for what a travesty generator does. "finite" means that the total number of states is finite, and that means you get to use matrices instead of kernel integrals, which means that your averagely educated scientist can follow this. "stationary" means that the transition matrix is not a function of time, that is, it's a constant matrix. This means that time appears only in an exponent. A "Markov process" is a transition from one state to another, probabilistically. (Approximately. All these definitions are meant to explain, not to define.) The talk looks interesting, to be sure, but it looks more significant for making a better /etc/magic for file(1) than it does for cryptanalysis. Eric
participants (2)
-
david@staff.udc.upenn.edu
-
Eric Hughes