[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Spiderspace



>... I was under the impression that the only documents that most web crawlers
>will search are documents that are link-accessible.  Are you saying that this
>isn't true?  Are you saying that Alta-Vista will search EVERYTHING that's
>publicly accessible, whether by anonymous FTP or web?

I'm not sure about alta-vista, but most spiders just follow the Web
doing some sort of graph search algorithm: pages are nodes and links
are directed edges. If a page is not linked anywhere, I don't see how
a spider could find it.

But you might be suprised at how quickly links to your pages can be
made, in unexpected ways. Before alta-vista went online, I set up an
archive of a private mailing list for a class, put it on the web, and
figured obscurity would keep it safe. Within six hours of putting this
page online and emailing about it to my class, the alta-vista spider
had found it. Now maybe that six hours was just random chance, but I
was pretty impressed. I still don't know how the spider found it - my
guess is someone had made a Netscape bookmark to my page and had put
their bookmark file online.

All the spiders and Usenet search engines imply is that the haystack
is becoming easier to search for needles. The Web and the Usenet are
fundamentally public media - a spider has as much right to index your
pages as JoeBob has a right to make a bookmark to it. The good thing
is these spiders are fundamentally useful critters. alta-vista is
about to replace Yahoo for my preferred way to find things. See
  http://www.santafe.edu/~nelson/hugeweb.html
for a little thought I had one evening.