... I was under the impression that the only documents that most web crawlers will search are documents that are link-accessible. Are you saying that this isn't true? Are you saying that Alta-Vista will search EVERYTHING that's publicly accessible, whether by anonymous FTP or web?
Don't archie servers already pick up the anonymous ftp fairly well? Also, aside from no-robots conventions, you can build a cgi program for access to files that might be more effective at blocking searches while still preserving access. Also, it wouldn't be hard for a web-crawler to follow ftp links, as long as the root of an anon-ftp site is pointed to by a URL somewhere. #-- # Thanks; Bill # Bill Stewart, stewarts@ix.netcom.com, Pager/Voicemail 1-408-787-1281 # # "Eternal vigilance is the price of liberty" used to mean us watching # the government, not the other way around....
... I was under the impression that the only documents that most web crawlers will search are documents that are link-accessible. Are you saying that this isn't true? Are you saying that Alta-Vista will search EVERYTHING that's publicly accessible, whether by anonymous FTP or web?
I'm not sure about alta-vista, but most spiders just follow the Web doing some sort of graph search algorithm: pages are nodes and links are directed edges. If a page is not linked anywhere, I don't see how a spider could find it. But you might be suprised at how quickly links to your pages can be made, in unexpected ways. Before alta-vista went online, I set up an archive of a private mailing list for a class, put it on the web, and figured obscurity would keep it safe. Within six hours of putting this page online and emailing about it to my class, the alta-vista spider had found it. Now maybe that six hours was just random chance, but I was pretty impressed. I still don't know how the spider found it - my guess is someone had made a Netscape bookmark to my page and had put their bookmark file online. All the spiders and Usenet search engines imply is that the haystack is becoming easier to search for needles. The Web and the Usenet are fundamentally public media - a spider has as much right to index your pages as JoeBob has a right to make a bookmark to it. The good thing is these spiders are fundamentally useful critters. alta-vista is about to replace Yahoo for my preferred way to find things. See http://www.santafe.edu/~nelson/hugeweb.html for a little thought I had one evening.
Bill mentions 'archie'; it's interesting to note that the problem of stuff that wasn't supposed to be public turning up in archie listings dates back to at least 1991. Amongst the problems were hosts of the form ftp.<foo>.com which had anyonmous ftp, but which weren't supposed to be public, and of files put up on such sites but not announced, usually by support people transferring a file to some customer which then got picked up in the sweep. Then of course, there was the time in 1993 when someone left a world-writable directory on the X consortium web site intowhich someone uploaded 300Mb of pornographic jpegs. This happened over the weekend, so they had a nice long chance to sit there while all the mirror sites happily duplicated them. If it was your turn to be archied whilst those files were there, you were in the database till your next sweep. All those Horny net geeks who found the directories empty would then send plaintive messages asking where the files were, and how to join the gif club. Simon (defun modexpt (x y n) "computes (x^y) mod n" (cond ((= y 0) 1) ((= y 1) (mod x n)) ((evenp y) (mod (expt (modexpt x (/ y 2) n) 2) n)) (t (mod (* x (modexpt x (1- y) n)) n))))
participants (3)
-
Bill Stewart -
Nelson Minar -
Simon Spero