MapReduce for Decentralized Computation - zLabWiki

1 Dec 2004

      <http://zlab.commerce.net/wiki/index.php/MapReduce_for_Decentralized_Computation>

MapReduce for Decentralized Computation

I was reading Dean and Ghemawat's MapReduce paper this morning. It
describes a way to write large-grain parallel programs to be executed on
Google's large clusters of computers in a simple functional style, in C++.
It occurred to me that the system as described might be well-suited to
computational cooperation between mutually untrusting entities, computation
in what we at CommerceNet refer to as a "decentralized" system.
 [edit]

 Decentralized computational cooperation

 There are five major problems with regard to computational cooperation, by
which I mean outsourcing some amount of your computation to servers run by
people you don't trust completely:
 Cracking 
 The server trusts the client not to break the server.
 Storage 
 The client trusts the server to store the client's data safely.
 Correctness 
 The client trusts the server to execute the client's code accurately and
produce results quickly.
 Confidentiality 
 The client trusts the server not to disclose the client's data against the
client's wishes.
 Payment 
 The server usually trusts the client to pay them for the service provided.

 Cracking can be dealt with by known methods: metered resource usage and
well-known isolation techniques.

 Payment can be dealt with in any number of ways; it should be noted that
it is not entirely separable from cracking.

 Storage can be dealt with by replicating or erasure-coding data across a
number of administratively-independent servers.

 This leaves the problems of correctness and confidentiality; I think the
MapReduce approach can help with correctness.
 [edit]

 Correctness

 The "map" function converts a record from an input file into a set of
records in an intermediate file; typically, each input record is replicated
in several places on the cluster, and the "map" function is deterministic.
The "reduce" function converts the set of all intermediate records sharing
the same key (gathered from the many map-stage output files) into a set of
output records, which are written to an output file; typically the "reduce"
function is also deterministic.

 In the usual case, where these functions are deterministic, they can be
executed on two administratively-independent servers, and the results
(which, in the Google case, are merely files) can be compared. If they
differ, the same results can be recomputed on more
administratively-independent servers to see which ones were correct.

 (It may be worthwhile to compare Merkle tree hashes of the output files,
rather than the output files themselves, since moving the output files over
the network may entail significant expense.)

 This prevents any single broken or dishonest machine or administrative
entity from affecting the correctness of the overall computation. Higher
levels of redundancy can be used to defend against stronger attackers at
relatively modest cost.

 Some threats can be defeated by even weaker means with negligible
computational cost. Computing each function for a randomly selected 1% of
input or intermediate records, then comparing the results, may provide an
acceptable probability of catching faults or compromises if they are
expected to affect a significant proportion of the output; and it requires
negligible computational cost. In the ten-billion-record performance tests
mentioned in the Google paper, a corrupt "map" function would have to
affect fewer than 694 of the input records to have less than a 50% chance
of detection by the 1% sample (including 100 million randomly selected
records). A corrupt "map" function that affected 5000 input records ---
only one out of every two million --- would have only an 0.7% chance of not
being caught.

 This probably deals adequately with machine failures and gross attacks,
but a careful attack might corrupt the output for only a single input
record --- and would have only a 1% chance of being caught. This may still
be enough if the problem is a result of a deliberate attack and the
attacker is vulnerable to sufficiently severe penalties.
 [edit]

 Confidentiality

 Confidentiality is a more difficult problem; computing with confidential
data on hardware that belongs to someone you don't trust requires that you
compute with encrypted data. In the general case, this is a very difficult
problem.

 (Perhaps this could be written with a real example.)

	 	 Article
	 	Discussion
	 	Edit
	 	History

	 	Create an account or log in

Navigation

	? 	Main Page
	? 	Community portal
	? 	Current events
	? 	Recent changes
	? 	Random page
	? 	Help

Search

Toolbox

	? 	What links here
	? 	Related changes
	? 	Special pages
	? 	 This page was last modified 21:20, 29 Nov 2004.

	? 	 This page has been accessed 36 times.

	? 	 About zLabWiki

	? 	 Disclaimers

-- 
-----------------
R. A. Hettinga <mailto: rah@ibuc.com>
The Internet Bearer Underwriting Corporation <http://www.ibuc.com/>
44 Farquhar Street, Boston, MA 02131 USA
"... however it may deserve respect for its usefulness and antiquity,
[predicting the end of the world] has not been found agreeable to
experience." -- Edward Gibbon, 'Decline and Fall of the Roman Empire'

MapReduce for Decentralized Computation - zLabWiki

R.A. Hettinga