The hidden, invisible, and private web

Everyone knows that Google and the other search engines between them crawl, spider, and slurp up the whole internet, right? Wrong! The millions of websites that are obviously available on the internet are readily searchable, Google Bing, Yahoo, and their ilk have seen to that, we can usually find documents, pages, digital images, videos, music, and public scientific datasets at low cost, rapidly and accurately. But, that’s just the surface, there are countless resources that are simply inaccessible to search engine bots, not least emails, FTP sites, IRC, and IM.

Then there is the Invisible Web, something about which I first wrote way back in the mid-1990s. The Invisible Web is the term used to describe the contents of publicly accessible databases that are revealed on a per-user basis on demand and mostly off-limits to search engines, with a few exceptions.

Definitely off-limits to all public search engines and all members of the public for that matter are private databases, corporate and institutional sites that are locked behind firewalls, passwords, and protective scripts.

However, some owners of chunks of the private web might be amenable to letting trusted users access their private parts, it’s just that the users don’t know the private data is there and the owners don’t know who to trust. Now, Peter Mork and colleagues at the Mitre Corporation in McLean, Virginia, have come up with a way to bring the two parties together. They have developed a way to publicize the existence of private web resources that draws on various summarization strategies and demonstrates a way to create a database summary, which they call a digest, that then becomes part of the announcement. They have then looked at the trade-off between the data owners’ desires to minimize disclosure of sensitive information and the searchers’ desires to maximize the accuracy of their searches.

As an example of the kind of private web Mork and colleagues are alluding to. Imagine a specialist in the spread of flu during an epidemic hoping to trawl medical records to figure out how many people might become infected, these are strictly off-limits to the general public and to most researchers for that matter? Or, what about an economist hoping to spot trends in stock market dealings to help warn of another credit crunch well before it happens? Again private deals, are…private, so they will have no access to that information. On the other, anonymized data might be available to help the specialist find data sources relevant to current research. Similarly, summarized data can point the economist to data relevant to his inquiry. But, these data can only be utilized if they can be found.

Mork and colleagues’ digest approach allows data owners to publish less sensitive versions of their data so that searchers can determine with which data owners they should negotiate access. In this way, the private web maintains its privacy, while becoming a little more searchable, thereby allowing researchers to spend more time doing research and less time struggling to find data.

Peter Mork, Ken Smith, Barbara Blaustein, Christopher Wolf, Ken Samuel, Keri Sarver, & Irina Vayndiner (2010). Facilitating discovery on the private web using dataset digests International Journal of Metadata, Semantics and Ontologies, 5 (3), 170-183