A secure cryptosystem with a 40-bit key?
I've been reading a bit recently on constructed languages like Esperanto. I came across one that developed out of something called "LOGLAN" that was published in Scientific American in the early sixties. The current active project is called "Lojban". It has one really curious property that gave me an idea for an interesting symmetric-key cryptosystem. All "native" Lojban words are of entirely predictable forms. "Root" words are all five characters containing three consonants and two vowels in one of two patters (CCVCV and CVCCV). "Structure" words have four forms (VV, CV, CVV, and CV'V). "Combining forms" have two forms (CVC and CV'C). All other words are not "native" words (being either proper names or borrowed words). The upshot of this is that there is a fixed limit on the size of the Lojban dictionary of 249500 words (given 17 consonants and five vowels). The grammar of the language is *so* regularized that they are able to give a YACC description for it. A message written entirely using native Lojban words can be encrypted in a codebook fashion where the particular codebook to be used is a permutation of the dictionary represented by an 40-bit number (18 bits to permut the "root word" list, 10 bits for the "structure word" list, and 12 bits for the "combining form" list). This system has the interesting property that *any* plaintext with the same grammatical structure is a potential encryption of a given cyphertext. This is similar to some more usual cryptosystems which operate at the lexical level but which are designed to create this effect, but has the curious side effect that it is *very* easy to determine a false-key which makes the transmitted message say nearly anything you want, thus making mandatory key escrow systems irrelevant. When you want to send the message "attack at dawn", you devise a grammatically identical message, "party 'til you puke" (which is grammatically identical in Lojban), generate a random key, as well as the key representing a similar permutation, but with "attack" and "party" exchanged, "puke" and "morning" exchanged, and so forth. Transmit the message with the false key in the LEAF field (or report it to your government-approved escrow agency) and government eavesdroppers get the wrong message. Other eavesdroppers get a grammatically correct, but apparrently nonsensical message ("drink by brick"). There's still the problem of borrowed words and proper names, which remain problems in any codebook approach, but represents a small portion of the language, and the words which represent individual letters are part of the "structure words" category, and could be sent spelled-out. This works well in Lojban because it never changes word forms based on grammatical usage. Most natural language declensions and conjugations would make the encrypted message ungrammatical, and make it *much* more difficult to determine a false key for the LEAF field. The irregularity of word forms makes the dictionary much more complicated, too. Comments?
-----BEGIN PGP SIGNED MESSAGE----- Hello cypherpunks@toad.com and Scott Brickner <sjb@universe.digex.net> S.B. wrote: ...[lojban]... Well, I don't know lojban, but I've had a glance at it... ...
All "native" Lojban words are of entirely predictable forms. "Root" words are all five characters containing three consonants and two vowels in one of two patters (CCVCV and CVCCV). "Structure" words have four forms (VV, CV, CVV, and CV'V). "Combining forms" have two forms (CVC and CV'C). All other words are not "native" words (being either ...
.u'u I believe that you were wrong when you expressed a symbol for the list of forms of structure words. There are also words of the form V'V. On the other hand, however, not all letter combinations are permitted; eg there are only fourteen dipthongs (of which only four are "normal").
The grammar of the language is *so* regularized that they are able to give a YACC description for it.
Yeah, and a huge beast it is. If you can make a YACCable language with one page of rules (say 16 :-) ), *then* I'll be impressed. Would you believe two grammars for mathematical expressions? No? Good! There are three (infix, prefix and postfix).
A message written entirely using native Lojban words can be encrypted in a codebook fashion where the particular codebook to be used is a ...
You have to be careful here - the structure words (cmavo) are divided into groups (selma'o) which have different gramatical functions. You can't mix up members of different selma'o (in general), so you'd have to permute within each separately. Some of these selma'o have very few members (even just one) and/or may of themselves reveal information.
This system has the interesting property that *any* plaintext with the same grammatical structure is a potential encryption of a given cyphertext. ...
Yes, but the grammatical structure itself may reveal heaps. (Except for trivial statements.) ...
There's still the problem of borrowed words and proper names, which ... could be sent spelled-out.
Yup, including font changes, if memory serves!
This works well in Lojban because it never changes word forms based on grammatical usage. Most natural language declensions and conjugations would make the encrypted message ungrammatical, and make it *much* more ...
Not really; you just need to make sure that you conjugate the coded words. (Ie substitute nouns for nouns, verbs for verbs, etc.) In Esperanto, the normal words roots (those that need an ending) would be easy enough to permute... For the rest, you'd have to be careful about structure words like "cxu" (which turns the sentence into a question), conjunctions etc which go at specific places of the sentence. There's plenty enough prepositions to permute :-) You can probably make do with 1000-2000 words in Esperanto, making the codebook somewhat more manageable than in other languages. On the other hand, you would probably have to be careful to delineate the boundaries, as confusion could result (the breaking up of an E-a word into the component roots is not necessarily unique, leading to puns ranging from beautiful (diamanto) through the weird (amoro) to the horrible). Perhaps that would be a feature, though? Adiaux! Jiri - -- If you want an answer, please mail to <jirib@cs.monash.edu.au>. On sweeney, I may delete without reading! PGP 463A14D5 (but it's at home so it'll take a day or two) PGP EF0607F9 (but it's at uni so don't rely on it too much) -----BEGIN PGP SIGNATURE----- Version: 2.6.2i iQCVAwUBMI3fzSxV6mvvBgf5AQE2VAQAxVwmHaku0rwpGswl8RBZa8q4Xm/yv5wh uMNPl1b4FXPeJplsGGRqBnwgOL0+zcAowKIvkVJBeg2zB95ZGFcQW5IKVRg7tnR8 vX8khTwnRG3y0NcvMdFjPwn38gu4j8gyvMRHk5/x9sM1228zqQ/+0FrMD063geVw Q1476RGREq4= =YhgP -----END PGP SIGNATURE-----
Jiri Baum writes:
.u'u I believe that you were wrong when you expressed a symbol for the list of forms of structure words. There are also words of the form V'V.
On the other hand, however, not all letter combinations are permitted; eg there are only fourteen dipthongs (of which only four are "normal").
Well, I was definitely oversimplifying things.
The grammar of the language is *so* regularized that they are able to give a YACC description for it.
Yeah, and a huge beast it is. If you can make a YACCable language with one page of rules (say 16 :-) ), *then* I'll be impressed.
I'm sure that one could be done on one page, but I doubt it would have the expressive power of a natural language *and* the lack of ambiguity of Lojban.
Would you believe two grammars for mathematical expressions? No? Good! There are three (infix, prefix and postfix).
And there's feedback between MEX and the non-MEX grammar since there are cmavo which covert MEX into sumti and selbri and vice versa.
A message written entirely using native Lojban words can be encrypted in a codebook fashion where the particular codebook to be used is a ...
You have to be careful here - the structure words (cmavo) are divided into groups (selma'o) which have different gramatical functions. You can't mix up members of different selma'o (in general), so you'd have to permute within each separately.
Some of these selma'o have very few members (even just one) and/or may of themselves reveal information.
To achieve the goal of the cryptosystem it may not be necessary to encode the cmavo, since they have no real meaning on their own, just the gismu and rafsi. The goal is to hide the *meaning*, not the structure. The selma'o that only have one member are especially meaning-free, as they're typically elidable terminators and such.
This system has the interesting property that *any* plaintext with the same grammatical structure is a potential encryption of a given cyphertext. ...
Yes, but the grammatical structure itself may reveal heaps. (Except for trivial statements.)
In a natural language this might be true, but in Lojban the grammar's regularity eliminates much of this information. In English it's "strange" to say "the red big dog", while "the big red dog" is fine. Lojban doesn't have these features. Lojban bridi are essentially the same as function calls in a programming language, from a grammatical perspective. The only distinguishing feature of a selbri is the number of sumti that it takes, and it's unusual for all of them to be specified, and extra ones may be added using the BAI selma'o.
This works well in Lojban because it never changes word forms based on grammatical usage. Most natural language declensions and conjugations would make the encrypted message ungrammatical, and make it *much* more ...
Not really; you just need to make sure that you conjugate the coded words. (Ie substitute nouns for nouns, verbs for verbs, etc.)
Irregularities make this nearly impossible for computers, though. There are also problems due to ambiguity. The even bigger inconvenience with natural laguages comes in defining the codebook. The limited forms of Lojban gismu and rafsi makes the whole dictionary a well-defined list, permitting the codebook to be specified as a single number that anyone could use --- even without prior exchange of the wordlist.
-----BEGIN PGP SIGNED MESSAGE----- Hello cypherpunks@toad.com and Scott Brickner <sjb@universe.digex.net> S.B. writes:
Jiri Baum writes:
.u'u I believe that you were wrong when you expressed a symbol for the ... Well, I was definitely oversimplifying things.
No problem. You just have to be careful when you are generating your wordlist, that's all. ...
Yeah, and a huge beast it is. If you can make a YACCable language with one page of rules (say 16 :-) ), *then* I'll be impressed.
I'm sure that one could be done on one page, but I doubt it would have ...
One would need a great deal of inspiration to make it work. However, I do think that it is possible. ...
A message written entirely using native Lojban words can be encrypted in a codebook fashion where the particular codebook to be used is a ... You have to be careful here - the structure words (cmavo) are divided ... To achieve the goal of the cryptosystem it may not be necessary to encode the cmavo, since they have no real meaning on their own, just ...
How about the numerical cmavo? You'd want to encode numbers. And you don't want people to know they are numbers, because they could count the digits (to get order of magnitude). Same for spelling cmavo. How about the tense system? You'd want to encode that because it could give important hints to locations (and times). Then again you could probably avoid using the "a little to the north and a long way east" tense altogether... How about the attitudinals?
The selma'o that only have one member are especially meaning-free, as they're typically elidable terminators and such.
Like I said, I only glanced at it, but how about NAI and GAI? ...
Yes, but the grammatical structure itself may reveal heaps. (Except for trivial statements.)
In a natural language this might be true, but in Lojban the grammar's regularity eliminates much of this information. ...
I'm not sure I'd agree here. I suspect you are overestimating the regularity of lojban grammar (then again maybe I'm underestimating...). ...
Not really; you just need to make sure that you conjugate the coded words. (Ie substitute nouns for nouns, verbs for verbs, etc.)
Irregularities make this nearly impossible for computers, though.
Yes (though I'd feel quite confident doing it for Esperanto).
There are also problems due to ambiguity.
Yup. If it's really a problem - ambiguity in language has been with us for a long time and nobody minds much. But I guess you wouldn't want arbitrary ambiguity in your text (you could have an interactive coder which immediately alerts you to all alternative meanings). Or you could put marks into your text to separate the word parts (like some beginner Esperanto books do) thus eliminating the problem.
The even bigger inconvenience with natural laguages comes in defining the codebook. ...
I'm sure you could easily find wordlists giving the "first X" words of Esperanto - you could just standardize on one of them. Mi esperas ke tio cxi sencas... Adiaux - Jiri - -- <jirib@cs.monash.edu.au> <jiri@melb.dialix.oz.au> PGP 463A14D5 -----BEGIN PGP SIGNATURE----- Version: 2.6.2i iQCVAwUBMI7wlCxV6mvvBgf5AQG2nAQA66Xej6FaC0cRfQXXDgr2fP4B/xLgd8J0 orN0/H6yOkyyFYaIFE47PI0/4MbfWD8Myoh9J9JtY/kU6Qji3tBpnS6Mo+gDuCQb Th2uwECCi0xEEookESI1+bNJXRiEO62YyCIZVLKm0v9DYndSR9FIIr9yytZ7zBO5 WR9SdebT8N8= =oEqt -----END PGP SIGNATURE-----
Jiri Baum writes:
How about the numerical cmavo? You'd want to encode numbers. And you don't want people to know they are numbers, because they could count the digits (to get order of magnitude). Same for spelling cmavo.
I agree. There are some cmavo with enough meaning to warrant some encoding. Ok, for numbers we take our original 40 bit (or whatever) key and, by convention, run it through a md5 to produce the key used for the next digit. By writing all numbers with 20 digits (padding zeros on the front or back as desired) magnitude (and/or precision) is hidden.
How about the tense system? You'd want to encode that because it could give important hints to locations (and times). Then again you could probably avoid using the "a little to the north and a long way east" tense altogether...
I'm not sure it gives anything away. The cryptanalyst would only know that there's a location, not anything important about the location, since it could be padded with "null" direction and temporal operators. And the tense selma'o could probably be treated as a single group for encoding. They're only semantically different, not syntactically. You'd end up with more confusing cyphertext, but that's no real problem unless you're trying to hide the fact that you're encoding --- for which you could stego the cyphertext in a rant generator.
How about the attitudinals?
There are enough that encoding them as a group hides their meaning.
The selma'o that only have one member are especially meaning-free, as they're typically elidable terminators and such.
Like I said, I only glanced at it, but how about NAI and GAI?
...
Yes, but the grammatical structure itself may reveal heaps. (Except for trivial statements.)
In a natural language this might be true, but in Lojban the grammar's regularity eliminates much of this information. ...
I'm not sure I'd agree here. I suspect you are overestimating the regularity of lojban grammar (then again maybe I'm underestimating...).
I think you're underestimating because you're discarding the effect of having a cypher which can arbitrarily substitute gismu and rafsi. With my original examples of "Attack at dawn" vs "Party 'til you puke", there's nothing to relate the items. A long sequence of similarly simple statements wouldn't add anything. Increasing the complexity of the statements makes it more difficult to find a false key, but the regularity of Lojban should give you enough leeway to do it. You'd probably want to avoid really complicated bridi, but these sorts of things tend to appear more in literary works than in ordinary communication.
There are also problems due to ambiguity.
Yup. If it's really a problem - ambiguity in language has been with us for a long time and nobody minds much. But I guess you wouldn't want arbitrary ambiguity in your text (you could have an interactive coder which immediately alerts you to all alternative meanings). Or you could put marks into your text to separate the word parts (like some beginner Esperanto books do) thus eliminating the problem.
I had in mind the sort of ambiguity that comes from "Time flies like an arrow", in which any of the first three words could be the verb. A computer translator would have to know which to conjugate and which to decline.
The even bigger inconvenience with natural laguages comes in defining the codebook. ...
I'm sure you could easily find wordlists giving the "first X" words of Esperanto - you could just standardize on one of them.
Yep, but it would have to be part of the cryptosystem's definition, as opposed to the language's.
Mi esperas ke tio cxi sencas...
.o'anai mi na cusku fi la .esperantos.
participants (3)
-
Jiri Baum -
Jiri Baum -
Scott Brickner