USG pulls 'sensitive' info off net

Bill Stewart bill.stewart at pobox.com
Mon Oct 8 08:56:00 PDT 2001


>> > > Must've never heard of caching..
>> > > http://www.latimes.com/news/nationworld/nation/la-100301safe.story
>>
>> > Inevitable next step: Enterprising cypherpunk registers
>> > censoredfedinfo.org, hunts through google's cache, posts everything
>> > there, etc.
>>
>>Note that there are a relatively small number of Googles on the Net.
>
>The trouble with Google and most other spiders is that they cannot access 
>the DBs behind the sites.  Various industry estimates place the amount of 
>data not accessible to crawlers at up to 500x the html content.  What's 
>needed are open access data mining sites using more sophisticated crawlers 
>like http://telegraph.cs.berkeley.edu/

More to the point, most spiders don't cache images, only text,
so much of the interesting content isn't cached.
I'm not sure how many of them cache PDFs; some of those are
searchable and indexable, while some are just bitmaps.

On the other hand, the Feds generally don't have as much
fancy-graphics-design-for-inaccessibility, so more of their text
may be cacheable than typical business sites.

Shutting down web sites with data that terrorists could use
has been going on for a few years - apparently many of the
haz-mat sites are no longer accessible to the public,
including one of the Bay Area sites that shut down a few weeks
before we had a major refinery fire.  Yes, there are potential
threats to public safety if terrorists can use this data,
but there are more serious threats if the public can't use it
to determine what's near them, and far more serious threats
if fire departments can't access the data conveniently.





More information about the cypherpunks-legacy mailing list