Thought people might find this abstract of a talk being given here at Penn of some interest. Please let me know if I'm wrong <grin>. (And, no, I won't be attending; almost all of it would be over my head. What is in this abstract is probably as much of it as I could understand without considerable preparation <grin>). ------------------------------------------------------------------------ In article <119753@netnews.upenn.edu>, holland@central.cis.upenn.edu (Billie Holland) writes:
Statistical Techniques for Language Recognition: An Introduction and Empirical Study for Cryptanalysts
Alan T. Sherman Computer Science Department University of Maryland Baltimore County
In cryptanalysis, how can a computer program recognize when it has discovered all or part of the secret message? For example, how can a program recognize character strings such as ``Attack at dawn!'', ``DES@RT ST\&RM'', or ``?tta????t d?wn'' as fragments of intelligible messages? In the early days of cryptology a human would perform these language-recognition tasks manually. In this talk I will explain how to recognize language automatically with statistical techniques.
Statistical techniques provide powerful tools for solving several language-recognition problems that arise in cryptanalysis and other domains. Language recognition is important in cryptanalysis because, among other applications, an exhaustive key search of any cryptosystem from ciphertext alone requires a test that recognizes valid plaintext. Although I will focus on cryptanalysis, this talk should be relevant to anyone interested in statistical inference on Markov chains or applied language recognition.
Modeling language as a finite stationary Markov process, I will adapt a statistical model of pattern recognition to language recognition. Within this framework I will consider four well-defined language-recognition problems: 1) recognizing a known language, 2) distinguishing a known language from uniform noise, 3) distinguishing unknown 0th-order noise from unknown 1st-order language, and 4) detecting non-uniform unknown language. For the second problem I will give a most powerful test based on the Neyman-Pearson Lemma. For the other problems, which typically have no uniformly most powerful tests, I will give likelihood ratio tests. I will also discuss the chi-squared test statistic $X^2$ and the Index of Coincidence $IC$.
In addition, I will present the results of computer experiments that characterize the distributions of five test statistics when applied to strings of various lengths drawn from nine types of real and simulated English.
This is joint work with Ravi Ganesan. Most of this work was carried out while Sherman was a member of the Institute for Advanced Computer Studies, University of Maryland College Park.
Thursday, 15 April 93 TOWNE BUILDING - 337 3:00 - 4:30
-- david david@staff.udc.upenn.edu