I've done some further thinking on the text tagging problem, spurred by a question on sci.crypt about tagging pictures (under the subject line "Permanent signatures for pictures"). Here's a summary. ---- Let's say Dow Jones wants to sell newswire subscriptions to individuals, but someone is anonymously forwarding their articles to a newsgroup. Can they succeed in tagging the text to detect the thief? The idea is to make some small twiddle to each subscriber's copy of the text, so that the stolen copy can be matched with some subscriber and their subscription cancelled. Short answer: the thieves win. At first, I thought the answer was the opposite. ---- There are two issues which must be addressed in order to show that the tagger wins: 1. The taggee must not be able to "smooth away" all of the tag bits. 2. The taggee must not be able to cross-correlate multiple copies of the data in question in order to produce a "clean" version. Regarding issue #1, the basic techique is to alter a few features of your data which are important enough that your opponent can't afford to randomize ALL such bits. In the case of text, small changes in word choice are a good candidate. Two criteria are: A. The changes must be "important" enough that the thief can't smooth them all away. B. The changes shouldn't be "important" enough that the newswire becomes worthless! The tagger has an advantage in this case, though. He can change, say, 1 in 1000 of these "important, non-smoothable-away" candidate bits. If the thief wants to cancel them out and only has a single copy of the picture, he must somehow canonicalize _all_ of the candidate tag bits, or some very large proportion of them. So if your tagging process does a little bit of "damage" to your data, like in the map-maker case of adding an extra small street here and there, then the opponent must either try to detect exactly where your damage is, or must make wholesale changes to the data (such as removing all small roads altogether). The thief, in trying to cover up your damage, must make a thousand times as much damage. Choose your damage level appropriately so that your level of damage isn't too much but the thief's is. ---- Issue #2, thieves cross-correlating between multiple copies of the data, is a bit more subtle. Here's the scenario: Dow Jones has 10,000 customers, 64 of whom are in a conspiracy to steal and re-sell the newswire. Dow Jones tries various tagging strategies, altering whitespace and word choice individually for each subscriber. The thieves try to cross-correlate between their copies of the text in order to "cancel out" the tags from the copy which they wish to re-sell. Can Dow Jones detect the thieves and cancel their subscriptions? In the discussion below, when Dow Jones "twiddles a bit" of their newswire, they do so by substituting a word's synonym at a chosen location, using a separate (possibly biased) coin flip for each subscriber. Here are the strategies I've considered. Dow Jones strategy: twiddle some bits with probability 0.5. If the thieves use majority vote, each thief will have a reasonably high correlation with the output bits. (In fact, the probability of a match will exceed 50% by approximately the chance of a tie vote among the thieves, which is about 0.8/sqrt(n) where n is the number of thieves. This computation is a bit hairy.) Thief countermeasure: reliably detect which bits are being twiddled (by cross-checking between, say, 64 different subscriptions) and flip a fair coin to determine the output. There's a chance of only 2 in 2^64 that the thieves fail to detect the twiddle. Dow Jones strategy: twiddle some bits with low probability (e.g. p=0.01). Reasonably often, the bit values will be the same for all thieves. If the thieves use the flip-a-coin strategy, we can determine which tag bits they've failed to detect, and identify them that way. Thief countermeasure: use a majority vote. Dow Jones strategy: hybrid of the two. Thief countermeasure: hybrid of the two. Flip a coin if the vote is fairly even, go with the majority if the vote is uneven. For example, get 64 subscriptions, go with the majority vote if fewer than 16 dissenters, flip a fair coin otherwise. This last strategy for the thieves is the one I can't beat. Theoretical help, anyone? -- Marc Ringuette (mnr@cs.cmu.edu)