This report is also available as an Acrobat file.

Contents

Discussion Sessions

Technical Issues

Caching

HENSA cache is not as widely known as it might be (http://unix.hensa.ac.uk/). Need to know how to communicate with computer services people to let them know. What mailing lists to use? Is the Web- support list suitable?

Should we use local caches as well as HENSA? Yes, in short term. Longer term national caches and local ones if low bandwidth to site.

How do you decide what to cache? Automatic, depends on expiry times but can be manually affected. CERN server can be configured to check every time - good if pages are frequently updated or can explicitly not cache it. Good documentation with CERN server. With some browsers once you have specified use Proxy, it can be hard to get round. Can manually disable for specific documents. A check box would be better. System manager can use standard UNIX tools to remove cache if short on space.

If there are 5 caches between you, and the document you want is not in any, have to go though all 5 first before going direct. Hit rate is dropping as number of docs on WWW increases, and slowing access time down may cause people to not to use caches. Need national infrastructure with big disks for caching. Need measurements on this.

It might be good if JISC funded a caching strategy - both local and national. Local servers are do-able now, but central servers of chains of servers need investigation. HENSA cache is 2Gig used to have 60% hit rate, but dropping, probable less than 50% now. Disks are relatively cheap - go for 9 Gig disk (Ł3,000) is sensible for local cache with relatively small machine.

Other scenario have multicast - ask does anyone have this file in cache. Problem if all caches answer together - problem with any multicast protocol. Something for the future.

Now - put in place physical and organisational infrastructure. Makes it easier to switch to new strategy.

Brian Kelly's handbook good starting point for new web managers, then look at CERN documentation. Does documentation exists about why its a good thing to cache, why at your site etc.?

Role for funding on going research?

A cache is not going to be able to cache all transactions, especially CGI scripts which are being increasingly used, eg web crawlers which generate large numbers of hits. As well as general caches, identify the services we want to replicate in the UK - useful, good for bandwidth etc.

Which services to replicate? Lycos and HTML checker spring to mind. Find out about use in general academic community, this group not good sample. Need some kind of monitoring. Can't collect logs of individual machines, but if everyone uses caches can analysis cache log. Useful to have some kind of pattern tracking. Lycos - resource discovery - a necessary resource better candidate for replication. Some stats sampling on backbone traffic to get idea of network hotspots. This is already happening in Kent (JISC project checking reponse time - not actually where traffic is going).

But this data will change - may show one thing being used, we mirror this, but 3 months later something else is being used instead. Needs to be on going thing, continually reviewing top-10-20 services. Could have questionnaire/CGI script what services do you use - but tend to get low response to this kind of thing, and the people who do reply may not represent the majority of the users. Anything involving human effort is probably not going to work well, may be better to analysis cache logs. Caching does respond to what is being most used now. Works for plain text files, images etc. Mirroring better for services, but problem keeping them up to date. Given choice caching better than mirroring, but active things, eg search engines, can't be cached therefore have to be mirrored. Could get arrangement to have search software mirrored eg. getting agreement with Lycos to mirror their engine.

Before web, in the UK we tended to use sites like HENSA and src.doc to get docs and files. These services are proven to be useful, and similar resources in the Web is what we need to support now. This type of data doesn't change too often, and lends itself to both mirroring and caching, but mirrors/caches may not be used if other sites in US have better search engines for example.

Hardware

Imperial said they will limit number of web servers, does this mean everyone needs access to the server machine? No, have home pages on separate file store. Best leave this issues to operational staff, a local issue. Vendors generate lots of hype trying to sell powerful machines. What you need depends on your usage. Universities may want figures to say what kind of machine you need, but can't predict this. What we can do is provide guidelines on how to setup machines so that it will be easier to migrate to more powerful machines. Potentially a useful appendix to Brian Kelly's document.

If we have an infrastructure where caches and web masters are talking to each other then will get lots of support from within this. Want guidelines on server hardware and software from experienced web masters. Abstract concepts rather than how to install version N of software X.

Indexing

W30 goal is to have server registry. UK HE community should do this. This registry would be useful. If we do have this what servers do your include? Every server or have some quality control? Have 2 registries one with all servers, one 'quality controlled'. Have list of every department in every university with server, and list with all the other servers also.

From user perspective may want to know eg the biology server of a particular university. This may not be possible, for example the biology server may be part of a main university server. What they may want is information based on subject areas. Problem with this is who will administer it? Library community keen on this, subject librarians. Is there a requirement for both approaches (institutional and subject)? Institutional level may already exist in active maps that are available.

This is a valuable resource. May want to make this an official resource.

Why are people looking for this info. Do they want info on an institution or a particular subject. Both. eg sixth former would want to look at subject area first broken down by institution. Lots of ways of looking at it. Need search engine, but maps are also very useful.

Aliweb, gets description of site from someone at site. Good in theory, but may get inappropriate description. Need this info and need someone managing it.

Network managers of official university servers should complete meta data about server and pass it to infrastructure.

Indexing at lower levels. Harvest for example has some facilities for this. This indexing is something librarians and archivists do (but not on the network). Volume problem as well, what to do with all the info if you get it. It is believed they force their users to index - every server has to be registered and part of registration is agreement to index, but how to do this hasn't been decided.

Whois ++ similar template to aliweb has centroids - kind of text index - possible to generate centroids for web servers, then harvest those and use those as key into HE services. This is just research at the moment. Too early for any recommendations.

Danger with recommendations in indexing, that people may use them, and when new development comes along we will need another short term solution. i.e this is an ongoing situation.

What will the user want to do when searching for info. What do current search tools do eg. may want to search for server name, file name, title, something in header limited by domain etc. Now spider type engines do that sort of thing eg., WWW Worm - doable now and useful. Robots have some fundamental problems, but are liked. Worm, Web Crawler, Lycos etc., are private bits of software. Harvest is more public and freely available and can be configured as a robot. Has the advantage that you could give it a database eg., UK database and let it soft through it.

Would be more acceptable to a robot visit your site that was being funded to provide information to the community.

Definite need for indexing.

One tool will never do everything, but can recommend a current tool to provide a service. Need somebody funded to monitor what is required and available. Need a single interface that is flexible enough to allow you to use it with different tools/search engines. What to be able to direct user to a particular centrally funded site that will let them do the search they need - a general search engine that acts on the parameters the user provides, eg., want to be able to say I want to search n this term in these fields, and this then sends it off to the appropriate engine , rather than having to say I want to do this so I will have to use this particular engine.

Users want to search and get results with good signal/noise ratio. Don't want to build up peoples expectations unrealistically. May use librarians to help with searches, very used to refining searches on bibliographic databases. Need to teach people to use the tools correctly and they have realistic expectations.

Should be able to allow effective searching with the UK. Need to ask people to add bibliographic data to their document, but can't force them. Publications are now being accepted on the network, say that bibliographic (cataloguing etc.) data must be included to get the papers to be accepted.

But will still get lots of documents without cataloguing info. included. Even if do get all this info what do you do with it? Do you leave it distributed or try and collate it all. Technical problems do exist with this. May need to compartmentalise data eg., in library wouldn't search for something in all books, only in a specific area. May be best to organise web searches in same way. Say 'this web server contain the following categories of info'.

Can any current search mechanisms address non-HTML documents? Harvest can do this to some extent. With images may need to provide some text to go with them.

Should information providers be encouraged to put info into their documents? Depends how stable format is - must be general and stable enough. May be too soon to ask for this - wait for URNs etc. to stabilise, then ask for these.

URNs are Internet drafted and are beginning to be implemented. URC is not stable. URNs are an abstract name, so can move around where object actually is. URN is key into URC which is a collection of attributes and values, but what the attributes and values are has not been established. URCs would be cached, so would probably include some kind of expiry date.

CERN daemon can handle URNs and will cache URCs. (Not standard version of CERN server).

Recommendations

Now

Education about taking advantage of current caching facilities available and getting funding for this
Need recognition and support for current caches
Need document explaining how to setup local caches
Operational expertise documentation

6 months

Setup national cache infrastructure
Creation of Support Group - a group running and coordinating specific services for the whole UKHE community to use. Web equivalents of services like mailbase, HENSA Unix/micros archives etc.
UK searching services / registration of UK academic servers - coordinated by support group

Later

Chaining caches
Replication of search engines (HENSA already keep top 100 of things accessed)

Contents

Graphics Multimedia Virtual Environments Visualisation Contents