cypherpunks-legacy archives 1992-2013

Greg Newby gbnewby at pglaf.org
Tue Dec 10 13:46:37 PST 2019


Back in July there was discussion about preserving the available archives. I have now made the available archives from 1992-2013 available via Mailman, and placed copies of the available mbox files within.

Visit them here:
 https://lists.cpunks.org/pipermail/cypherpunks-legacy/

The main info page, which links to the sources and a description of what was involved in getting them into Mailman, is here:
 https://lists.cpunks.org/mailman/listinfo/cypherpunks-legacy

Finally, here is the README-processing file that describes what I did, and why I decided to not try to incorporate those older archives with the current list archives at  https://lists.cpunks.org/pipermail/cypherpunks/. Suggestions welcome. 

--------
Processing of Cypherpunks Archives

Available archives of the Cypherpunks email list are incomplete, and
in fact there is evidence they have been tampered with and/or redacted
over the years. 

This project was to do some basic clean-up of available archives,
which are in mbox format, so that they could be ingested and be viewed
within Mailman.

The archives contain many poorly formed messages, and Mailman defaults
to the current date (December 2019, at the time of this writing) when
it encounters a problem with the date. So, a separate
'cypherpunks-legacy' list was deployed to make the archives available
without overlapping with the current active 'cypherpunks' list, which
goes back to July 20 2013. Otherwise, the legacy archive messages
would have been peppered into the current archives, in ways that would
be difficult to predict or undo.

There were especially many anomalies from the larger source, spanning
1999-2015. In addition to many poorly formed messages (i.e., messages
that, in one way or another, could not be cleanly ingested with the
Mailman 'arch') comment, there were invalid dates, and lines that had
an errant "From " at the start.

To ready the sources for ingestion to Mailman two automated tools were
utilized, followed by some ad hoc edits and changes:

1. 'sortmbox.py' uses a Python library to put messages in by-date
order. This proved to be less confusing to Mailman (i.e., fewer messages
were inserted to the current month).

2. 'cleanarch' (/var/lib/mailman/bin/cleanarch) is part of the Mailman
package. It fixes errant "From " entries at the start of lines.

3. I also used 'sed' to replace invalid dates with valid ones that
were in the same ballpark day+time as the message. I found these either
when 'arch' complained (such as for dates before the Unix epoch), or
when Mailman was showing messages in the future:

's/ 0101 / 1999 /' | sed 's/ 0102 / 1999 /'
's/Thu Dec 31 22:40:39 1903/Thu Jul  5 22:40:39 2018/g'
's/Jan 1904/Jul 2018/'
's/Date: Sun, 1 Apr 2029 03:07:16 +0200/Date: Sat, 31 Mar 2001 15:59:46 -0800/'
's/Date: Fri, 3457 Jan 4 61400:2064:61300 +0200/Wed May 29 15:00:02 2013/'

There might have been a few other small edits made within the files,
which I didn't record, simply to help Mailman to do a better job of
creating browsable archives.

4. I then concatenated all the mbox files (one each for 1992-1998, plus
one larger file for 1999-2015), and reran sortmbox.py and cleanarch.

5. I edited the resulting file to remove everything after the new list
archives were set up, on July 20 2013.

To do the work above, I created a temporary Mailman list, and repeatedly
used the 'arch' command to ingest the archives and fix problems. This
was an iterative process.


Once 'arch' was giving sane output, I created a new Mailman list,
'cypherpunks-legacy.' I put the single unified + fixed mbox file where
Mailman tools would find it:
  /var/lib/mailman/archives/public/cypherpunks-legacy.mbox/cypherpunks-legacy.mbox

At this point, I could use the Mailman to slurp the mbox files in, and
create the browsable structure.

The 'arch' command:
  /var/lib/mailman/bin/arch --wipe cypherpunks-legacy

This served to populate the list archives, which are browsable here:
  https://lists.cpunks.org/pipermail/cypherpunks-legacy

The single large mbox file resulting from the steps above is linked at
the top of the Archives page. Here is a direct link:
  https://lists.cpunks.org/pipermail/cypherpunks-legacy.mbox/cypherpunks-legacy.mbox (615MB, containing approximately 180149 messages)

Please be aware that Mailman's placement of messages by author, date,
subject and thread - including the correct by-month folder - is not
perfect. Messages sometimes end up in the wrong place, and sometimes
threads are split across different years.

Also note that these mbox files did not include attachments. The
current Mailman archive does include attachments, but this legacy
archive does not. It seems they were not included with the archive
input sources (though it's possible some messages have MIME-encoded
attachments within them).

Anyone interested in doing a serious dig into the archive should also
consult the original mbox files. These can be ingested into any
capable email client program, and viewed as separate messages. They
may be sorted and searched, just like any other email folder.

The same sorts of issues as those described above will likely be
evident in any email client, and clients will even show a different
total number of messages. The types of sorting, editing, and displaying
described above would have somewhat different results, if a different
toolset is used.

These archives are freely available, and the effort to make them
available via Mailman is freely given. 

 - gbn

--------

Earlier thread on this concluded below:
 Subject: Re: newsflash! cypherpunks mailing list is behind cloudflare-NSA

On Fri, Jul 12, 2019 at 06:34:07PM -0400, grarpamp wrote:
> On 7/12/19, Greg Newby <gbnewby at pglaf.org> wrote:
> > Newsflash! This happened in April, and was announced here:
> >   https://lists.cpunks.org/pipermail/cypherpunks/2019-April/045250.html
> > We have been on Cloudflare's DNS since then for the email lists.
> 
> Use of CF or any other CDN was not mentioned in the announcement,
> whether for DNS, or HTTPS. The entire internet is NSA anyway.
> 
> If CDN for HTTPS, consider multihoming on I2P or Tor
> so users can still access when CDN javascript captcha
> or otherwise arbitrarily blocks them or goes down.
> 
> As to caching bandwidth and archives...
> 
> You really should fork that 335MiB mbox file off now
> or no later than year end, and compress it, and
> then once yearly thereafter, and sign them all.
> People will eventually seed them into IPFS, etc.
> 
> Try using a modern unix compression tool like zstd,
> they are faster, smaller, available for all systems...
> 
> https://github.com/facebook/zstd
> https://facebook.github.io/zstd/
> https://code.fb.com/core-data/zstandard/
> https://en.wikipedia.org/wiki/Zstandard


More information about the cypherpunks mailing list