[p2p-hackers] good-bye, Mnet, and good luck. I'm going commercial! plus my last design doc

Tue Mar 8 13:04:46 PST 2005

contents of this message: "good-bye, Mnet", "The Fully Connected
Topology", "Rate-Limited Structured Flood", "A Nice Slow Network", "Did
you say 'BROADCAST SEARCH'?" "Persistent Tit-for-Tat == Bilateral
Accounting"

Dear mnet-devel and p2p-hackers:

I am about to accept an exciting job that will preclude me from
contributing to open source projects in the distributed file-system
space.

I will miss the Mnet project!  Good luck without me!

I'm writing the following as a record of the most advanced design that
I have thought of for Mnet.  (See Acknowledgements section below.)
Most or all of the design written below has previously been published
in different web pages, e-mail messages, and IRC transcripts, and a
brief presentation I made at Privacy Enhancing Technologies Workshop
2004.

The design described below is almost but not quite what is currently
implemented, by myself and others, in Mnet v0.7, available at [1].

 *** Design of Mnet v0.7+:

I.  Network connectivity -- the Fully Connected Topology

I.A.  Local peer database.  Each node remembers the nodeId of for each
node that it has ever heard of or received a message from.  There is a
maximum number of nodeIds, in deference to memory and computation costs
(your local memory and your local computation).  I don't know what that
maximum number should be.  If you have more than the maximum number of
nodeIds in your database, you can flush some of the
least-recently-alive ones.

I.B.  Exponential Backoff.  With every peer in your db is stored a
"deadness level".  When the deadness level is equal to 0, that means
that the most recent thing that happened is that you receive a message
from that peer -- whether it was a reponse to a request of yours or if
it were a request from him to you.  We say that the peers with deadness
level 0 are "the live peers".  If a peer has deadness level 1, then
that means that the most recent time that you sent a request to that
peer, he didn't write back.  Now, whenever you want to choose from a
set of peers in order to send a request to one of them, the set you
choose from is all of the deadness level 0 peers, plus with 50%
probability all of the deadness level 1 peers, plus with 25%
probability all of the deadness level 2 peers, and so on.  Deadness
level 2 means that the peer was in deadness level 1, and you chose to
query him (with 50% probability), and he didn't write back again.

I.C.  Lookup of Peer Contact Info.  When you want to find the current
contact info (i.e. current IP address+port number, or current Relay
Server) for a peer, you send a query to a certain number of other peers
(called "MetaTrackers" when they are serving this purpose) -- a "lookup
contact info" query.  How many peers?  Approximately log(N) where N is
the number of peers in your local peer database.  Which peers?  You use
the Chord distribution -- you query the peer closest to your target,
the one closest to the point halfway around the circle from your
target, the one closest to the point a quarter of the way around the
circle, and cetera.

Obviously, you need to publish your current contact info to *your*
MetaTrackers whenever you join the network or whenever your contact
info changes.  Your MetaTrackers are the peer in your db which is
closest to you, the peer in your db which is closest to the point
halfway around the circle, etc.

I.D.  Discovery of New Peers.  Rate-Limited Structured Flood.  Every 60
minutes, you send an update to each of your MetaTrackers.  That update
contains the list of the peerIds of every new peer.  A "new peer" for
this purpose is defined as follows: you've never before announced this
peer to MetaTrackers, and this peer currently has deadness level 0 in
your local db.  If the complete (compressed) message containing the
information about the new peers would exceed 256 KB, then select a
random subset of the new peers to announce in this announcement so that
the message doesn't exceed 256 KB.  This is a rate-limited structured
flood.  The flooding nature of it means that eventually you will find
out about the arrival of every new peer.  The structured nature of it
means that it will take only log(N) time intervals before you find out,
and you will receive only (;-)) log(N) separate notifications of the
arrival of a new peer.  (And you will send log(N) separate
announcements of new peers -- one announcement to each of your
MetaTrackers.)  The rate-limited nature of it means that if new peers
are arriving so fast  that these notifications would take more than
log(N)*256 KB per hour bandwidth, that instead they take up only
log(N)*256 KB per hour and it takes longer for you to find out about
the arrival of every new peer.

I.E.  Relay.  You choose some peer which you can make a TCP connection
to and appoint it to be your RelayServer.  How you choose it, and
dynamically update your choice, is complicated -- see the
implementation in RelayClient.py.  Whenever someone wants to send a
message to you and they find it impossible to open a TCP connection to
you (which is all the time when you are behind a NAT or firewall that
prevents incoming TCP connections) then they send the message to your
RelayServer instead.  See also [2, 3, 4].

II.  Filesystem.

II.A.  Encoding of a file.  This is already described fully and
succinctly in [5].  Here is a capsule summary:  1.  Erasure code the
file, 2.  Encrypt the blocks, 3.  Put the list of the Ids of the
encrypted blocks into a new file named the inode, 4.  Encrypt the
inode, 5.  The Id of the inode, combined with the secret key used for
encryption are the mnetURI of the file.

II.B.  Push each block to the BlockServer which has an nodeId closest
to the blockId (in the Chord metric).

II.C.  In order to download a file, given its mnetURI, you first have
to download the inode, and then you have to download a sufficient
number of the erasure coded blocks.  For each block that you want to
download, you query the BlockServer whose Id is closest to that block's
Id.  If he doesn't have it, then you ask the BlockServer whose Id is
the next-closest.  Etc.  If the file is not actually reconstructable at
all because there are not enough blocks present at all on the network,
then this search algorithm devolves to a broadcast search.

This concludes the basic description of the design of Mnet v0.7+.
There are other aspects of the design that are not included here,
including a metadata search facility invented by Myers Carpenter.  In
addition, there are many specifics or added features in the core
network+filestore that are excluded from this description for brevity.

 *** Discussion:

III.  A Nice Slow Network.  The deliberate pace of announcement of new
peers is a feature, not a bug.  For starters it is convenient to
implement, because for various reasons it is easier to efficiently
manage 256 KB every hour than 72 bytes every second even though they
amount to the same bandwidth over the long term.

However, beyond it being convenient, it is also useful for
discriminating among our peers, because we don't want to start using
new peers until they've been around for a while anyway.  Fresh peers
are more likely to disappear than older ones.  In earlier versions of
this system, we heard about new peers more quickly, and then we had to
add logic to avoid using them until time had passed.

However, beyond it being convenient and useful, it is also desirable to
me, because I wanted Mnet to be more amenable to deliberate and
long-term use than to urgent and impulsive use.  Brad Templeton argued
to me in the year 2002 that if a distributed filesystem were primarily
for acquiring novelty videos, pop songs, and porn, then instant
gratification would be important, but if it were primarily for
acquiring classic works of literature, political and historical
documents, or backup copies of your own data, then instant
gratification isn't so important.  His argument made an impression on
me.

IV.  Scalability Issues and other problems.

IV.A.  Size of the local Peer DB.  One limitation on scale is the size
of the local peer db.  For a modernish desktop machine, the local peer
db could probably be on the order of 2^20 entries with no noticeable
problem.  The limit could probably be made much higher by optimizing
the db.  The current db is trivial -- it consists of a Python dict in
memory which gets serialized by Python pickle and then written to a
file every hour or so.

IV.A.i.  In practice, almost all peers which fall into high deadness
levels will never revive.  If you purge a peer from your local db and
then that peer does revive and reconnect to the network, and if he
sends you a request, then you will re-add him to your db just as if he
were a brand new peer.  So if your local peer db is too big, you could
purge peers, starting with the most dead ones.

IV.B.  Announcement of New Peers.  Each newly arrived peer will be
announced to each current peer.  If the rate of arrivals of new peers
is sufficiently small, then it will take log(N) time intervals for all
peers to learn about the new peer.  My estimate is that "sufficiently
small" is about 2000 new peers per hour, with the current
implementation.  It could doubled or quadrupled by better compression
of the announcement messages.  If the rate of arrival of new peers
exceeds this, then the time before all peers have heard about the new
peer increases proportionally to the rate.

IV.C.  Block store churn.  There is currently no way for a block server
to indicate that its store of blocks is full and that it refuses to
accept new blocks being pushed into it, so you should give them to
someone else.  Absent this "back pressure", the stores of nodes with
small storage capacity are likely to get flushed out by new
publications even while the stores of nodes with large  capacity are
underutilized.  This inefficient distribution reduces the half-life of
files by some unknown constant.

IV.D.  Did you say "BROADCAST SEARCH"?  Well, consider what would
happen if you stopped searching for a block after your first query.
Then we would similar likelihood of finding a block as in e.g. the
Chord File System, with minimal message complexity -- a single query.
The problem is that maybe the block isn't on that node but is on a
nearby node (either because that node joined the network after the
block was published, or because the publisher of the block wasn't yet
aware of that node when he published the block).  This problem can be
solved in one of three ways:

1.  When a new node arrives, it requests copies of all of the relevant
blocks from the node(s) that it shadows.  This is a bad idea for this
system.
2.  As nodes arrive and leave, they keep track of which other nodes
they shadow, then they forward requests which might be satisfied by
other nodes.  Interesting possibility, but I couldn't figure out how to
do it in a way that wouldn't be very complicated and still end up
falling in the worst case to the third approach:
3.  When a downloader doesn't find the desired block on the server
closest to the blockId, it chooses whether to broaden the search and
query other servers that are further from the blockId.

The interesting thing about 3 is that the decision about how broad to
make the search is made by the agent who has the most information about
its importance (such as whether other erasure coded blocks have been
found that can replace the missing block, or whether the user considers
downloading this file important enough to impose additional burden on
the network).  This same agent that has the most information is also
the one that has the incentive to want the download to succeed, so it
is this agent who should choose whether or not to impose the additional
burden on the network of a broader search.  See also section V --
"Persistent Tit-for-Tat == Bilateral Accounting".

V.  Persistent Tit-for-Tat == Bilateral Accounting.  Mnet v0.7+ lacks
something -- an incentive mechanism to limit excessive costs imposed on
the network and simultaneously to motivate users to contribute
resources to the network.  I was always deeply impressed with the
simplicity and robustness of Bram Cohen's "Tit-for-tat" incentive
mechanism for BitTorrent.  Suppose we wanted a similar "tit-for-tat"
mechanism for Mnet v0.7++, but we wanted the peer relationships to
extend through time and across multiple files and multiple user
operations.  Then we would probably invent a bilateral accounting
scheme for each node to keep track of how much goodness each of its
peers has done for it, and to reward helpful peers.  This would then
turn out to be more or less identical to the "bilateral accounting"
scheme that was originally invented by Jim McCoy and Doug Barnes in
Mojo Nation [footnote *].

 *** Acknowledgements

I know that I am doomed before I start to accidentally exclude
important and deserving people from this section.  Sorry.  This is in
roughly chronological order of their earliest contributions to this
design, as far as I can remember.

Obviously this design owes a great debt to the original designers of
its direct ancestor Mojo Nation: Jim McCoy and Doug Barnes, as well as
to the Evil Geniuses -- especially Greg Smith, Bram Cohen, and Drue
Lowenstern.  Also: Raph Levien, Sergei Osokine, the Freenet folks --
Ian Clarke, Oskar Sandberg, and Adam Langley among others -- Justin
Chapweske of Swarmcast, Brandon Wiley, Martin Peck, the Chord folks,
the Pastry folks, the CAN ("Content-Addressable Network") folks, the
OceanStore folks, The Mnet Hackers [7] -- Hauke Johannknecht, Jukka
Santala, Myers Carpenter, Oscar Haegar, Arno Waschk, Luke Nelson --
Mark S. Miller, Bram Cohen again (and Drue Lowenstern again) with
BitTorrent, the Kademlia folks, numerous contributors to the
p2p-hackers mailing list and the #p2p-hackers IRC channel.  Like I said
-- sorry about those omissions.

[footnote *]  Once upon a time there was digital cash, as pioneered by
David Chaum.  When Jim McCoy and Doug Barnes invented Mojo Nation, they
used digital cash, and added bilateral accounting so that a pair of
peers wouldn't require a transaction with a central token server in
order to incentivize each other.  The first three employees they hired
to implement Mojo Nation in 1999 were Greg Smith, Bram Cohen, and
myself.  (Greg might have started in 1998 -- I'm not sure.)  Several of
the ideas in BitTorrent -- which Bram started writing in 2001 -- can be
understood as radical simplifications of ideas in Mojo Nation.  One
such perspective is to think of BitTorrent's tit-for-tat incentives as
being time-limited, file-specific, and non-transferrable bilateral
accounting.  This is not condemnation of BitTorrent's ideas, but praise
of them -- they demonstrate the virtue of radical simplification.

[1] http://mnetproject.org/
[2] http://mnetproject.org/repos/mnet/doc/network_overview.html
[3] http://mnetproject.org/repos/mnet/doc/EGTPv1_Architecture.txt
[4] http://mnetproject.org/repos/mnet/doc/messages_overview.html
[5] http://mnetproject.org/repos/mnet/doc/new_filesystem.html
[6] http://mnetproject.org/repos/mnet/CREDITS

_______________________________________________
p2p-hackers mailing list
p2p-hackers at zgp.org
http://zgp.org/mailman/listinfo/p2p-hackers
_______________________________________________
Here is a web page listing P2P Conferences:
http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences

----- End forwarded message -----
--
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net

[demime 1.01d removed an attachment of type application/pgp-signature]