Document Standards

SGML and HyTime

Paul Ellison and Michael Popham

The SGML Project is a three initiative funded by JISC, due to end in Autumn 1994. The Project has been dedicated to raising awareness and use of the Standard Generalized Markup Language (and related International Standards) within the UK Higher Education and research communities.

During the Project's lifetime, uptake of SGML has spread dramatically throughout the non-academic community. It has been adopted by industries as diverse as STM and legal publishing, commercial aviation, military aviation, international pharmaceutical agencies, the European Commission, the European Patent Office, HMSO and many others. Moreover, a significant number of commercial multimedia products such as the "Cinemania", "Encarta", and "Grollier Multimedia Encyclopedia" CD-ROMs have been produced from databases of SGML tagged files.

Within the academic community, SGML is having a profound affect on the development of HTML the markup language which underpins the presentation of information on the World-Wide Web). The production of the Text Encoding Initiative's "Guidelines for the Encoding and Interchange of Machine-Readable Texts" (founded on an SGML markup scheme) is also certain to have a fundamental influence on the work of scholars and information users/providers for decades to come.

Conventional opinion holds that as far as possible, the fact that a text has been marked-up with SGML should be concealed from the end-user. Yet however much SGML is concealed, we cannot afford to ignore it. SGML-aware tools make it possible to access, exchange and reuse information in ways that have previously been either too impractical, too expensive, or only possible within the restricting confines of proprietary, single-manufacturer environments.

HyTime the Hypermedia/Time-based Structuring Language (ISO 10744) is an application of SGML. It is currently the only available International Standard for structuring and linking files of multimedia, hypermedia, or other forms of time-based information.

HyTime relies on SGML, and just as an SGML document can hold information written in other notations (e.g. TeX, JPEG images, QuickTime movies etc.), HyTime hub documents can be used to structure and synchronize webs of hypermedia documents (e.g. PostScript texts, MPEG movies, SGML documents, HyperODA files, QuickTime 2 movies, holographic animations etc. etc.) Like SGML, HyTime places no constraints upon the types of document content only on the information that indicates how different content types interrelate.

Using HyTime does not preclude the use of other de facto or de jure standards which exist now or may emerge in the future; it simply offers a standard way in which other types of information can be combined and made to interoperate. SGML documents can be readily processed by a "HyTime Engine" (software which understands HyTime markup), and any existing SGML document can be easily extended to become a fully conformant HyTime document.

SGML is fast becoming the major International Standard for information representation. HyTime is likely to become equally fundamental in applications concerned with processing hypermedia and time-based information. The UK's H.E. community must remain fully aware of these Standards and their implications not least when considering what tools to buy and which long-term information strategies to implement. SGML and HyTime will be essential to the exchange and reuse of conventional and hypermedia information both within the H.E. community, and between the H.E. community and the wider, non-academic world.

Text Encoding Initiative Guidelines

The TEI was mentioned on a number of occasions at the workshop. The guidelines have now been published and the text below from Lou Burnard gives details.

On May 16, the Text Encoding Initiative (TEI) publishes its "Guidelines for Electronic Text Encoding and Interchange."

This report is the product of several years' work by over a hundred experts in fields ranging from computational linguistics to Ancient Greek literature. The Guidelines define a format in which electronic text materials can be stored on, or transmitted between, any kind of computer from a personal microcomputer to a university mainframe. The format is independent of the proprietary formats used by commercial software packages.

The TEI came into being as the result of the proliferation of mostly incompatible encoding formats, which was hampering cooperation and reuse of data amongst researchers and teachers. Creating good electronic texts is an expensive and time-consuming business. The object of the TEI was to ensure that such texts, once created, could continue to be useful even after the systems on which they were created had become obsolete. This requirement is a particularly important one in today's rapidly evolving computer industry.

To make them "future-proof", the TEI Guidelines use an international standard for text encoding known as SGML, the Standard Generalized Markup Language. SGML was originally developed by the publishing industry as a way of reducing the costs of typesetting and reuse of electronic manuscripts but has since become widely used by software developers, publishers, and government agencies. It is one of the enabling technologies which will help the new Digital Libraries take shape.

The TEI Guidelines go beyond many other SGML applications currently in use. Because they aim to serve the needs of researchers as well as teachers and students, they have a particularly ambitious set of goals. They must be both easily extensible and easily simplified. And their aim is to specify methods capable of dealing with all kinds of texts, in all languages and writing systems, from any period in history.

Consequently, the TEI Guidelines provide recommendations not only for the encoding of prose texts, but also for verse, drama and other performance texts, transcripts of spoken material for linguistic research, dictionaries, and terminological data banks.

The Guidelines provide detailed specifications for the documentation of electronic materials, their sources, and their encoding. These specifications will enable future librarians to catalogue electronic texts as efficiently and reliably as they currently catalogue printed texts.

The TEI Guidelines also provide optional facilities which can be added to the set of basic recommendations. These include methods for encoding hypertext links, transcribing primary sources (especially manuscripts), representing text-critical apparatus, analyzing names and dates, representing figures, formulae, tables, and graphics, and categorizing of texts for corpus-linguistic study. The Guidelines also define methods of providing linguistic, literary, or historical analysis and commentary on a text and documenting areas of uncertainty or ambiguity.

The TEI Guidelines have been prepared over a six-year period with grant support from the U.S. National Endowment for the Humanities, Directorate General XIII of the Commission of the European Union, the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada. The effort is largely the product of the volunteer work of over a hundred researchers who donated time to share their experience in using computers and to work out the specific recommendations in the Guidelines.

The project is sponsored by three professional societies active in the area of computer applications to text-based research: the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics, which have a combined membership of thousands of scholars and researchers worldwide.

Many projects in North America and Europe have already declared their intention of applying the TEI Guidelines in the creation of the large scale electronic textual resources which are increasingly dominating the world of humanities scholarship.

The Guidelines are available in paper form or electronic form over the Internet. For more information contact the TEI editors by e-mail at tei@uic.edu or lou@vax.ox.ac.uk.

The SURFdoc project: Storing, Accessing and Processing Electronic Documents

Roel Rexwinkel

SURFnet bv offers electronic information and communication services for the target group of higher education and research in the Netherlands. Besides these services SURFnet also carries out several development projects by order of the SURF foundation.

One of these development or innovative projects is the SURFdoc project. Results and findings of this project may contribute to the topic of this workshop.

The SURFdoc project is divided into three project components, each component with its own objective:

A project to stimulate and establish the cooperation between university libraries and computer centres in the university.
A project that deals with the storage and distribution of electronic documents on a document server, carried out by university libraries in close cooperation with the computer centres. Offering electronic documents to end-users in three academic institutes on the basis of cooperation between libraries and computer centres, and research into user response;
A project that deals with the processing of images in end-user environments, based on available hardware and software. Technical realization of facilities for the reception, processing and sending of scanned material (images) from the user's workplace.

The sub-projects 2 and 3 deal with issues of interest for this workshop and will be discussed in more detail in the following paragraphs. For the sake of simplicity the two sub-projects will be designated as SURFdoc/server and SURFdoc/Images respectively.

SURFdoc/Server project

At four universities in the Netherlands libraries and computer centres are working close together setting up servers for electronic text documents. At this moment tests are done with several formats for text (PostScript, PDF (Adobe Acrobat), SGML and ASCII) and ways to find and retrieve the text at the server (WAIS, Gopher, WWW).

SURFdoc/Images Project

In this project five institutes are creating environments for end-users to receive scanned documents (images) and process these images either by viewing/printing or further processing by editing or OCR (Optical Character Recognition). The scanned documents can be text(b/w)- and full-colour (slides, pictures, photographs etc.) images. Special attention is paid to image file-formats (TIFF (uncompressed), TIFF (FAX IV), JPEG, GIF) and the exchange of these formats in a multi-vendor environment (Macintosh, MSDOS/Windows and UNIX).

Special technical support is given by two laboratories which were set up at the beginning of the project. The laboratories have selected and tested various user platforms (Macintosh, MSDOS/Windows and UNIX) for working with scanned material (images). The platforms ranged from a low-budget platform (for Macintosh and MSDOS/Windows only) with public-domain software to a more professional platform with commercial software. The software ranged from simple software to view an image to software packages that were able to manipulate and process the image.

The interchange of images between the different platforms was a major subject of investigation, leading towards a set of recommendations for image file-formats. TIFF(uncompressed), GIF, JPEG and PostScript were recommended, all formats with their pros and cons.

During the last year several developments in different disciplines (libraries, archives) pointed to the usage of TIFF (FAX IV) as the image file-format for b/w text-images. These developments made us decide to pay extra attention to this file-format. At this moment the two laboratories are testing and investigating TIFF (FAX IV) for its interchangebility and possibility to process at different platforms.

Discussion

Roel presented some results. SURFNet are using SGML, PDF, ASCII and PostScript for their work. Access is also being experimented with using WWW, WAIS and gopher. They have used TIFF with Group 4 compression for black and white documents. This cannot be used in a multivendor environment as there are no viewers for the Mac. SGML seems to be the way forward.

Steve Price had made some comments in a written input:

Features of SGML, ODA etc.:

SGML: supports string processability (there is no formatting information at all), publishing databases and input to intelligent layout processes;
ODA: (processable form) supports object oriented processability with layout directives to enable sensible layout to take place;
ODA: (formatted/processable form) layout with integrity and object oriented processing;
ODA: (formatted form) layout conveyed with integrity, also basis of raster image file formats (e.g. AIIM MS-53, Gp 4 fax);
Acrobat: page formatted form layout conveyed with integrity;
Postscript: page image form.

Alan Francis made the following points in his paper:

For document exchange he perceives problems with all of the existing formats. SGML is not widely used and the SGML specification alone is not sufficient to fully specify the document format. Extra information has to be provided in the form of the Document Type Definition (DTD) and this has to include the format(s) for any graphics. Until recently SGML was only being used within large corporate environments and so the software was expensive. However lower cost software is now becoming available making use by the HE community more feasible.

ODA, like SGML an international standard, has its supporters but ODA products do not seem to be widely available. CGM, SGML and ODA all "suffer" from the fact that they are not the "product" of one company and therefore do not have large advertising budgets promoting their use.

Acrobat, on the other hand, is a product of Adobe Systems and is being heavily promoted as a document interchange format. However it has a number of drawbacks. Firstly, it is the property of one company and so can be changed at will by that company. Secondly, while it does allow finished documents to be exchanged, the transferred document cannot be edited at the receiving end, only annotation can be added. Also the document structure is not maintained. Adobe claim that solutions to these latter two points will be forthcoming but not in the near future.

Fred Cole made the following points in his paper:

SGML, ODA, Acrobat

There is no clear winner among these formats until the needs of the users are clearly specified.

Acrobat is still rather experimental and currently seems to have little to recommend it, but I have access to a test version so I would find out more about it before the workshop.

ODA would be useful if we insist on using a variety of common word processors and can all afford commercially available converters between our own favourite word processor and ODA. Its two main main strengths are: (i) it allows users to insert idiosyncratic layout into transmittable documents and yet still allow the documents to be editable by the receiver, and (ii) there is little or no retraining needed by those already expert in the use of their word processor. SGML is an elegant general solution that allows a document to be formatted and re-formatted in different styles without further editing. There is a variety of public domain software to support SGML. I have a particular axe to grind here I believe that we should move towards the use of structured documents because they are more flexible and are re-usable. Although the underlying architecture of the ODA standard could support structured documents it is not actually used in that way, whereas structure is natural in SGML.

HyTime, HyperODA and MHEG

It is probably too early to make choices between these, but in any case they also have different target uses. The characteristics of HyTime and HyperODA are roughly the same as SGML and ODA respectively. MHEG, not mentioned in the workshop notice, should also be considered here, but it is still at a very early stage.

HTML

HTML was not mentioned in the workshop notice, but it should be considered. It is in some ways rather primitive (although an improved version is on the way) but it is up and running and has enormous popular support. It may be that we do not need to consider it however, because it seems that it has already arrived and is a defacto standard.

Mono-Media Files

There is no real need to make any rules about using particular formats for mono-media files such as JPEG etc. because converters and display tools are freely available.
Further Discussion:

If we want useful documents we need to put structure into them. The presentation can then be flexible.

Standards are needed to allow future-proofing. How do we tell people who do not want to know about the value of non-proprietary formats for long term storage and presentation?

We need to consider storing information, or maybe archiving and delivering information, using different formats. We may wish to provide an unalterable version to ensure integrity when delivering while archiving an editable version. Security is important note ODA has security enhancements.

It was noted that there are now a lot of SGML tools around, some of the WYSIWYG for both viewers and editors. We can expect to see considerable improvement in the tools over the next few years. DTD tools are also becoming available.

There are a lot of concerns re conversion and the storage of legacy data.

If the information is important over a long period we need to bite the bullet and got for storing information in structured ways.

Links are emerging between the word processor and the SGML worlds.

Noted that publishers all seem to have their own DTDs it gives them a competitive advantage.

Document Exchange and the Print Disabled

Tom Wesley

The development of Multimedia Applications within the UK Higher Education sector seems to make the implicit assumption that the users of such applications have normal vision. There is however a significant number of people with print disabilities who will be unable to use these applications, or at best, use them with difficulty.

Considerable research has already been done on increasing the access to information by people with print disabilities. This position paper briefly summarises the issues; makes it clear that the needs of people with print disabilities will have to be considered in the design stage of multimedia applications; points out the crucial importance of standardised structured electronic document formats and suggests some practical ways forward.

People With Print Disabilities

People with print disabilities include the blind, the deaf-blind, the partially sighted, the dyslexic and those with motor impairments which make it difficult to physically control paper documents.

One of the significant limiting factors for the print disabled is the difficulty they face in accessing the predominant form of information provision, which is almost entirely oriented to printed and other visual forms [1] [2]. The proportion of information easily accessible to the print disabled is very small. There is a growing understanding that a vital factor needed to improve access is to develop methods in which the provision of information for the print disabled is, as far as possible, an automatic supplementary process related to the normal information creation processes. Technologically, this can be achieved through application of the developments of standardised structured electronic documents.

Electronic documents are the key to linking into the commercial information production processes, as increasingly these processes are electronically based; and to the transformations required to make the information accessible to the print disabled.

The importance for the print disabled of structure in electronic documents can be realised when it is considered how the normally sighted reader obtains a significant amount of information from the layout of a document titles in bold, bulleted indents, emphasised sections in italics. These are crucial when browsing through a large document. To make this information available to aid the print disabled user to browse or navigate within a document, the structure needs to be defined explicitly within the electronic document.

Current Research

The main research into methods of improving access to information for the print disabled is being carried out by the CAPS Project and the ICADD Committee. The author of this position paper is connected to both of these, being a Partner in CAPS and Vice-President and member of the Board of Directors of ICADD. Also of relevance in the present context is work proceeding on making Graphical User Interfaces accessible to people with print disabilities.

The CAPS Project

The CAPS Consortium (Communication and Access to Information for People with Special Needs, is a European Union funded project in the Technology Initiative for Disabled and Elderly People (TIDE) Programme [3]. United Kingdom partners are the University of Bradford and the Royal National Institute for the Blind.

A major part of the CAPS Project is devoted to developing methods whereby SGML is used at the heart of a generic model for dramatically improving the access to information for the print disabled. Within the current phase of the Project, which will complete at the end of September, 1994, a Pilot Electronic Library is being set up. Access will be provided interactively using synthetic speech on both an adapted work station and also through the home telephone using a voice response system with high quality real time text to speech.

ICADD

In developing the concept of Associated Specifications to allow SGML documents to be accessible to people with print disabilities, the CAPS project has worked closely with ICADD, the International Committee on Accessible Document Design [4]. This Committee, a non-profit organisation, incorporated in the State of New Hampshire, has the aim of developing techniques and raising awareness to enable documents to be made available to persons with print disabilities at the same time and at no greater cost as the print enabled community enjoys. CAPS has through this cooperation gained advantage from the legislative push being provided by ADA, the Americans with Disabilities Act [5].

In a remarkable development, ICADD has managed to have its mechanisms for accessibility incorporated into a new ISO Standard DTD for Electronic Manuscript Preparation and Markup [6]. This is, as far as is known, the first time that disability issues have been directly incorporated into a standard for commercial use. If, as seems likely, publishers start using the standard, accessibility will be automatically built in to any document instances that are produced.

GUI Access

The growth of the graphical user interface (for example, Macintosh, Windows) has created an enormous problem for people with print disabilities, particularly the blind. Both in Europe and the USA there is considerable research and development being devoted to providing solutions [7]. It is already clear however that long term solutions will need the human computer interface design to take these special needs into account.

SGML and HyTime

There is a growing conviction that the Standard Generalized Markup Language, SGML, can play an important role as an enabling technology to increase access to information for blind and partially sighted people [8] [9]. By an obvious extension, it seems likely that for hypermedia applications, HyTime will be the key enabling technology.

Implications

The growth of information technology has enabled many people with print disabilities to take a more active role in society. This development is now threatened paradoxically by the further developments of information technology itself, in that increasingly information systems are conceived as having predominantly visual interfaces.

There are no easy solutions to this problem. However the following approaches seem fruitful:

the use of International Standards for multimedia documents, such as SGML and HyTime;
development of generic solutions to the human computer interface, which will allow predominantly non-visual access. A potential model for this is being developed in the MIPS ESPRIT Project [10];
the provision within the application of alternate representations of visual material, such as textual summaries;
the ability to interact with mathematical equations through an audio interface, being developed in the MATHS TIDE Project [11].

References

J. Engelen and J. Baldewijns, Digital Information Distribution for the Reading Impaired: from Daily Newspaper to Whole Libraries. The 3rd International Conference on Computers for Handicapped Persons, Vienna, 1992, pp. 144-149.
J. Engelen and B. Bauwens, Large Scale Text Distribution Services for the Print Disabled: The Harmonisation and Standardisation Efforts of the TIDE- CAPS Consortium, in Rehabilitation Technology, Strategies for the European Union, Proceedings of the 1st TIDE Congress, Brussels, April 1993, edited by E. Ballabio et al., IOS Press, ISBN 90 5199 131 2, pp. 24-29.
Full details of the CAPS Consortium can be obtained from the Coordinator, Professor Jan Engelen at the Katholieke Universiteit, Leuven, Belgium. The Consortium maintains an ftp site, gate.esat.kuleuven.ac.be in the directory /pub/CAPS and its sub directories which provides access to its latest public documents.
The latest information about the International Committee on Accessible Document Design (ICADD) can be obtained from the President, Michael G. Paciello, 110 Spit Brook Road, Nashua, NH. USA 03062, phone: +1 603 881 1831, Email: Paciello@Shane.Enet.Dec.Com.
S. Carruthers, A. Humphreys and J. Sandhu, Rehabilitation Technology in the United States and its Impact upon the Creation of a Viable European Industry Sector, in Rehabilitation Technology, Strategies for the European Union, Proceedings of the 1st TIDE Congress, Brussels, April 1993, edited by E. Ballabio et al., IOS Press, ISBN 90 5199 131 2, pp. 189-193.
ISO 12083 : 1993 Information processing Text and Office systems Electronic Manuscript Preparation and Markup. International Organisation for Standardisation.
J. Gill, Access to Graphical User Interfaces by Blind People, Royal National Institute for the Blind, London, ISBN 1 85878 004 7.
B. Bauwens, J. Engelen, F. Evenepoel, C. Tobin and T. Wesley, SGML An Enabling Technology for the Reading Impaired, SGML Europe U94, Montreux, Switzerland. Database Publishing Ltd, Swindon, UK, pp. 101-107.
B. Bauwens, J. Engelen, F. Evenepoel, C. Tobin and T. Wesley, Structuring Documents: The Key to Increasing Access to Information for the Print Disabled, paper submitted to the 4th International Conference on Computers for Handicapped Persons, Vienna, September 1994.
A. Bruffaerts, Encoding hypermedia application with HyTime, The MIPS approach, SGML Europe U94, Montreux, Switzerland. Database Publishing Ltd, Swindon, UK, pp. 231-242.
Full details of the MATHS Consortium can be obtained from the Coordinator, Mark Elsom-Cook, Electric Brain Company, 13 Queen Square, Leeds, West Yorkshire, LS2 8AJ, United Kingdom, phone: +44 532 428696.

Graphics Multimedia Virtual Environments Visualisation Contents