This report is also available as an Acrobat file.
Contents
Approaches to Wide Area Indexing
Martijn Koster
There are a number of approaches which can be taken to indexing. These
include:
- manual indexing
- robot assisted indexing
- automated distributed indexing
Manual Indexing
Manual indexing includes both personal hotlists and public hotlists. Both
have a number of major problems. Personal hotlists tend not to be up to
date or to be comprehensive. They also tend to collect references which
have become outdated. Public hotlists have a high signal to noise ratio and
permission needs to be sought to update or remove information.
Robot Assisted Indexing
An example of this is Lycos. On the positive side, these tools provide
automatic indexing. However, they do have problems. They tend to
overload the network and/or the host. They can also give the wrong
impression regarding the resources to the person searching for
information. They can also provide too much information. The indexing
is centralised.
Manual Distributed Indexing
An example of this is ALIWEB which is described as:
"ALIWEB is a system that automatically combines distributed
WWW server descriptions into a single searchable database.
ALIWEB basically does for the WWW what veronica does for
gopher or Archie does for anonymous FTP. Because the original
server descriptions are maintained by server administrators, the
information is likely to be correct and up-to-date. It also uses a
special format that makes the results look very concise."
Aliweb is a public service run by NEXOR. See
http://web.nexor.co.uk/public/aliweb/aliweb.html
Aliweb has a number of advantages. It is simple and cheap. It has high
quality summarising. There are likely to be fewer stale references. It does
still need manual effort though and uses centralised or mirrored searching
Automated Distributed Indexing
An example of this is Harvest.
see http://harvest.cs.colorado.edu/
Harvest is an integrated set of tools to gather, extract, organise, search,
cache and replicate information across the internet. It is therefore designed
to help users find information as well as helping in its management.
Harvest has a number of advantages, not the least of which is that it is
available. It is automatic, extensible, scalable. On the negative side it is
complex. It has the potential to offer a general search interface, automated
summarising and distributed searching.
In Summary
We need to have indexing tools which are automated. The solutions need
to be distributed ones. We need to adjust peoples' expectancies so that they
understand the reality of the problems and available solutions. Harvest may offer a solution. We also need to accompany the use of tools with
high level manual resources which complement what we can achieve
automatically.
Contents