During the Project's lifetime, uptake of SGML has spread dramatically throughout the non-academic community. It has been adopted by industries as diverse as STM and legal publishing, commercial aviation, military aviation, international pharmaceutical agencies, the European Commission, the European Patent Office, HMSO and many others. Moreover, a significant number of commercial multimedia products such as the "Cinemania", "Encarta", and "Grollier Multimedia Encyclopedia" CD-ROMs have been produced from databases of SGML tagged files.
Within the academic community, SGML is having a profound affect on the development of HTML the markup language which underpins the presentation of information on the World-Wide Web). The production of the Text Encoding Initiative's "Guidelines for the Encoding and Interchange of Machine-Readable Texts" (founded on an SGML markup scheme) is also certain to have a fundamental influence on the work of scholars and information users/providers for decades to come.
Conventional opinion holds that as far as possible, the fact that a text has been marked-up with SGML should be concealed from the end-user. Yet however much SGML is concealed, we cannot afford to ignore it. SGML-aware tools make it possible to access, exchange and reuse information in ways that have previously been either too impractical, too expensive, or only possible within the restricting confines of proprietary, single-manufacturer environments.
HyTime the Hypermedia/Time-based Structuring Language (ISO 10744) is an application of SGML. It is currently the only available International Standard for structuring and linking files of multimedia, hypermedia, or other forms of time-based information.
HyTime relies on SGML, and just as an SGML document can hold information written in other notations (e.g. TeX, JPEG images, QuickTime movies etc.), HyTime hub documents can be used to structure and synchronize webs of hypermedia documents (e.g. PostScript texts, MPEG movies, SGML documents, HyperODA files, QuickTime 2 movies, holographic animations etc. etc.) Like SGML, HyTime places no constraints upon the types of document content only on the information that indicates how different content types interrelate.
Using HyTime does not preclude the use of other de facto or de jure standards which exist now or may emerge in the future; it simply offers a standard way in which other types of information can be combined and made to interoperate. SGML documents can be readily processed by a "HyTime Engine" (software which understands HyTime markup), and any existing SGML document can be easily extended to become a fully conformant HyTime document.
SGML is fast becoming the major International Standard for information representation. HyTime is likely to become equally fundamental in applications concerned with processing hypermedia and time-based information. The UK's H.E. community must remain fully aware of these Standards and their implications not least when considering what tools to buy and which long-term information strategies to implement. SGML and HyTime will be essential to the exchange and reuse of conventional and hypermedia information both within the H.E. community, and between the H.E. community and the wider, non-academic world.
On May 16, the Text Encoding Initiative (TEI) publishes its "Guidelines for Electronic Text Encoding and Interchange."
This report is the product of several years' work by over a hundred experts in fields ranging from computational linguistics to Ancient Greek literature. The Guidelines define a format in which electronic text materials can be stored on, or transmitted between, any kind of computer from a personal microcomputer to a university mainframe. The format is independent of the proprietary formats used by commercial software packages.
The TEI came into being as the result of the proliferation of mostly incompatible encoding formats, which was hampering cooperation and reuse of data amongst researchers and teachers. Creating good electronic texts is an expensive and time-consuming business. The object of the TEI was to ensure that such texts, once created, could continue to be useful even after the systems on which they were created had become obsolete. This requirement is a particularly important one in today's rapidly evolving computer industry.
To make them "future-proof", the TEI Guidelines use an international standard for text encoding known as SGML, the Standard Generalized Markup Language. SGML was originally developed by the publishing industry as a way of reducing the costs of typesetting and reuse of electronic manuscripts but has since become widely used by software developers, publishers, and government agencies. It is one of the enabling technologies which will help the new Digital Libraries take shape.
The TEI Guidelines go beyond many other SGML applications currently in use. Because they aim to serve the needs of researchers as well as teachers and students, they have a particularly ambitious set of goals. They must be both easily extensible and easily simplified. And their aim is to specify methods capable of dealing with all kinds of texts, in all languages and writing systems, from any period in history.
Consequently, the TEI Guidelines provide recommendations not only for the encoding of prose texts, but also for verse, drama and other performance texts, transcripts of spoken material for linguistic research, dictionaries, and terminological data banks.
The Guidelines provide detailed specifications for the documentation of electronic materials, their sources, and their encoding. These specifications will enable future librarians to catalogue electronic texts as efficiently and reliably as they currently catalogue printed texts.
The TEI Guidelines also provide optional facilities which can be added to the set of basic recommendations. These include methods for encoding hypertext links, transcribing primary sources (especially manuscripts), representing text-critical apparatus, analyzing names and dates, representing figures, formulae, tables, and graphics, and categorizing of texts for corpus-linguistic study. The Guidelines also define methods of providing linguistic, literary, or historical analysis and commentary on a text and documenting areas of uncertainty or ambiguity.
The TEI Guidelines have been prepared over a six-year period with grant support from the U.S. National Endowment for the Humanities, Directorate General XIII of the Commission of the European Union, the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada. The effort is largely the product of the volunteer work of over a hundred researchers who donated time to share their experience in using computers and to work out the specific recommendations in the Guidelines.
The project is sponsored by three professional societies active in the area of computer applications to text-based research: the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics, which have a combined membership of thousands of scholars and researchers worldwide.
Many projects in North America and Europe have already declared their intention of applying the TEI Guidelines in the creation of the large scale electronic textual resources which are increasingly dominating the world of humanities scholarship.
The Guidelines are available in paper form or electronic form over the Internet. For more information contact the TEI editors by e-mail at tei@uic.edu or lou@vax.ox.ac.uk.
One of these development or innovative projects is the SURFdoc project. Results and findings of this project may contribute to the topic of this workshop.
The SURFdoc project is divided into three project components, each component with its own objective:
The sub-projects 2 and 3 deal with issues of interest for this workshop and will be discussed in more detail in the following paragraphs. For the sake of simplicity the two sub-projects will be designated as SURFdoc/server and SURFdoc/Images respectively.
Special technical support is given by two laboratories which were set up at the beginning of the project. The laboratories have selected and tested various user platforms (Macintosh, MSDOS/Windows and UNIX) for working with scanned material (images). The platforms ranged from a low-budget platform (for Macintosh and MSDOS/Windows only) with public-domain software to a more professional platform with commercial software. The software ranged from simple software to view an image to software packages that were able to manipulate and process the image.
The interchange of images between the different platforms was a major subject of investigation, leading towards a set of recommendations for image file-formats. TIFF(uncompressed), GIF, JPEG and PostScript were recommended, all formats with their pros and cons.
During the last year several developments in different disciplines (libraries, archives) pointed to the usage of TIFF (FAX IV) as the image file-format for b/w text-images. These developments made us decide to pay extra attention to this file-format. At this moment the two laboratories are testing and investigating TIFF (FAX IV) for its interchangebility and possibility to process at different platforms.
Steve Price had made some comments in a written input:
Features of SGML, ODA etc.:
For document exchange he perceives problems with all of the existing formats. SGML is not widely used and the SGML specification alone is not sufficient to fully specify the document format. Extra information has to be provided in the form of the Document Type Definition (DTD) and this has to include the format(s) for any graphics. Until recently SGML was only being used within large corporate environments and so the software was expensive. However lower cost software is now becoming available making use by the HE community more feasible.
ODA, like SGML an international standard, has its supporters but ODA products do not seem to be widely available. CGM, SGML and ODA all "suffer" from the fact that they are not the "product" of one company and therefore do not have large advertising budgets promoting their use.
Acrobat, on the other hand, is a product of Adobe Systems and is being heavily promoted as a document interchange format. However it has a number of drawbacks. Firstly, it is the property of one company and so can be changed at will by that company. Secondly, while it does allow finished documents to be exchanged, the transferred document cannot be edited at the receiving end, only annotation can be added. Also the document structure is not maintained. Adobe claim that solutions to these latter two points will be forthcoming but not in the near future.
Fred Cole made the following points in his paper:
There is no clear winner among these formats until the needs of the users are clearly specified.
Acrobat is still rather experimental and currently seems to have little to recommend it, but I have access to a test version so I would find out more about it before the workshop.
ODA would be useful if we insist on using a variety of common word processors and can all afford commercially available converters between our own favourite word processor and ODA. Its two main main strengths are: (i) it allows users to insert idiosyncratic layout into transmittable documents and yet still allow the documents to be editable by the receiver, and (ii) there is little or no retraining needed by those already expert in the use of their word processor. SGML is an elegant general solution that allows a document to be formatted and re-formatted in different styles without further editing. There is a variety of public domain software to support SGML. I have a particular axe to grind here I believe that we should move towards the use of structured documents because they are more flexible and are re-usable. Although the underlying architecture of the ODA standard could support structured documents it is not actually used in that way, whereas structure is natural in SGML.
If we want useful documents we need to put structure into them. The presentation can then be flexible.
Standards are needed to allow future-proofing. How do we tell people who do not want to know about the value of non-proprietary formats for long term storage and presentation?
We need to consider storing information, or maybe archiving and delivering information, using different formats. We may wish to provide an unalterable version to ensure integrity when delivering while archiving an editable version. Security is important note ODA has security enhancements.
It was noted that there are now a lot of SGML tools around, some of the WYSIWYG for both viewers and editors. We can expect to see considerable improvement in the tools over the next few years. DTD tools are also becoming available.
There are a lot of concerns re conversion and the storage of legacy data.
If the information is important over a long period we need to bite the bullet and got for storing information in structured ways.
Links are emerging between the word processor and the SGML worlds.
Noted that publishers all seem to have their own DTDs it gives them a competitive advantage.
Considerable research has already been done on increasing the access to information by people with print disabilities. This position paper briefly summarises the issues; makes it clear that the needs of people with print disabilities will have to be considered in the design stage of multimedia applications; points out the crucial importance of standardised structured electronic document formats and suggests some practical ways forward.
One of the significant limiting factors for the print disabled is the difficulty they face in accessing the predominant form of information provision, which is almost entirely oriented to printed and other visual forms [1] [2]. The proportion of information easily accessible to the print disabled is very small. There is a growing understanding that a vital factor needed to improve access is to develop methods in which the provision of information for the print disabled is, as far as possible, an automatic supplementary process related to the normal information creation processes. Technologically, this can be achieved through application of the developments of standardised structured electronic documents.
Electronic documents are the key to linking into the commercial information production processes, as increasingly these processes are electronically based; and to the transformations required to make the information accessible to the print disabled.
The importance for the print disabled of structure in electronic documents can be realised when it is considered how the normally sighted reader obtains a significant amount of information from the layout of a document titles in bold, bulleted indents, emphasised sections in italics. These are crucial when browsing through a large document. To make this information available to aid the print disabled user to browse or navigate within a document, the structure needs to be defined explicitly within the electronic document.
A major part of the CAPS Project is devoted to developing methods whereby SGML is used at the heart of a generic model for dramatically improving the access to information for the print disabled. Within the current phase of the Project, which will complete at the end of September, 1994, a Pilot Electronic Library is being set up. Access will be provided interactively using synthetic speech on both an adapted work station and also through the home telephone using a voice response system with high quality real time text to speech.
In a remarkable development, ICADD has managed to have its mechanisms for accessibility incorporated into a new ISO Standard DTD for Electronic Manuscript Preparation and Markup [6]. This is, as far as is known, the first time that disability issues have been directly incorporated into a standard for commercial use. If, as seems likely, publishers start using the standard, accessibility will be automatically built in to any document instances that are produced.
There are no easy solutions to this problem. However the following approaches seem fruitful:
Graphics Multimedia Virtual Environments Visualisation Contents