Conversion Issues and Tools

John Knight's paper presented a useful case study of the need for file conversion.

Electronic Publishing over the World Wide Web

Jon Knight

Publishers typically hold documents in an electronic format at some point prior to the production of paper based journals and books. Often this is in the form of author supplied electronic documents. These can be in any number of formats, usually dictated by what the publisher is prepared to handle. Some publishers accept files from some of the "heavyweight" professional word processing systems such as Microsoft Word or WordPerfect. Others, typically in the more scientific and technical fields, prefer the TeX and/or LaTeX typesetting language. PostScript files are also sometimes taken, although these are usually only used for review purposes with the data being re-entered when a document is accepted for publication. A very few publishers accept SGML directly from authors, probably due to the relative lack of widely deployed SGML authoring tools.

If the publisher wishes to make his documents available via the World Wide Web, it is to his advantage to make use of one of the widely deployed document formats so that the widest user base possible will be able to access his information with their existing tools. The three most common document formats in the Web are plain ASCII text, HTML marked up documents and PostScript files. Each has advantages and disadvantages:

ASCII text is the easiest to create for simple documents and can be viewed by all currently deployed browsers. It is also is very space efficient but it can not convey graphical or typographical information, nor allow hyperdocuments to be linked out from it (it can of course be linked to),
HTML allows documents to be linked together using hyperlinks which allows the readers to explore the structure of documents in a new and powerful way. It also allows citations and related objects (such as video or audio clips) to be linked to existing documents, and makes the incorporation of graphics relatively easy. However, it gives much of the presentation control to the user, which until now has been the preserve of the publisher. It also has limited presentation oriented tags, and no support for tables or mathematics (although that is coming soon with HTML 2.0),
PostScript allows text, mathematics, tables and figures to be presented in exactly the same format as the printed page. However it is display oriented and is thus more difficult that the other two formats to index and search. PostScript files are also larger, cannot be previewed on some systems and, like plain ASCII, can only used as leaf nodes in a hyperdocument structure.

In Project ELVYN, a research project funded by the British Library Research and Development Department, the Institute of Physics Publishing agreed to allow electronic versions of an existing paper journal to be delivered to a number of sites in the UK and Europe. Each site was free to choose between TeX, SGML and PostScript for the document format delivered by the publisher and how this was delivered to its patrons. At Loughborough University of Technology we opted for the SGML format and devised a conversion process to generate HTML documents which could be viewed using normal WWW browsers such as NCSA Mosaic or MacWeb.

This project has demonstrated some of the short comings of the current HTML markup languages and the underlying HTTP transfer mechanism. Specifically, as the journal was very technical, the lack of mathematics and table generating constructs in HTML lead to the use of a large quantity of inlined bitmaps generated from TeX codes embedded in the publisher's SGML. This has resulted in unacceptably long downloading times for each page, even if each journal article is split into a number of smaller hyperlinked documents based on the logical sections in the paper. It has shown the need for multiple objects to be retrieved with one HTTP transaction using the MIME multipart response, as much of the overhead is connection setup (especially as we have to have identity information logged for usage profiling in the project).

It would also be desirable for a standard vector drawing package to be embedded in popular WWW browsers in much the same way as GIF and X bitmap rendering engines are. Vector graphics can deal with a whole class of figures (such as graphs), can be scaled and printed accurately and may not take up as much bandwidth to transmit as equivalent bitmaps. However, HTML does seem to be a good compromise between the simplicity of plain ASCII and the full scale presentation abilities of PostScript. The use of SGML as the publisher's own format also made conversion relatively straight forward as much of the mark up was structural in nature and mapped easily into the available elements in HTML's DTD.

In his presentation Jon also made the following points:

The project used the Copenhagen SGML Tool (CoST)

CoST is a public domain SGML processing tool developed by Klaus Harbo at the Danish Euromaths Centre in Copenhagen.
CoST is based on the sgmls parser, TCL and [incr tcl].
CoST uses object oriented [incr TCL] scripts to process the data in the ESIS data stream from the sgmls parser.
A 1200 line CoST script is used to convert docu ments conforming to the publishers own DTD in HTML documents.
Although CoST is rather resource intensive, markup conversion only needs to be done once for each journal issue.
The entire retrieve, conversion and mounting pro cedure can be done automatically.

"Classic" HTML has no markup for maths and tables (this is coming in HTML+). However, neither did the publisher's own DTD! The maths and tables in the publishers SGML source files appear as embedded TEX codes. The CoST processing script strips the TEX codes for maths and tables out into separate files, pro cesses them with TEX and then converts the resulting DVI files into X bitmaps for inlining in the HTML documents.

The figures are supplied as TIFF files. The filenames are included in the publisher's source SGML documents and are used to generate hyperlinks to the external figures. These may soon also appear as thumbnails inlined in the documents.

Points Raised in this Project:

Content oriented document markup is easier to convert to other file formats than presentation oriented formats.
CoST is an excellent tool for converting SGML documents between DTDs and to other formats.
Publishers will need to either standardise on one SGML DTD or be prepared to aid libraries and users in processing their documents as each DTD requires a new processing script to be written.
The WWW with HTML provides a very flexible document delivery mechanism but it will require a number of enhancements before it is able to deliver complex technical documents efficiently.
HTTP server and client authors should be encouraged to support MIME multiple documents as the multiple connection setups and tear downs is a significant overhead when a document contains many inlined images.

Conversion Tools

Chris Osland

In his presentation Chris discussed a number of conversion tools. Many of these are public domain and are best found using network tools such as Archie. The choices when doing conversion are:

Use convertors built in to applications
Use convertors built in to applications
Use separate tools

RALCGM

This is held at:
UMXFE.CC.RL.AC.UK:/PUB/GRAPHICS/RALCGM
It allows conversion from CGM in all 3 encodings to CGM in a different encoding or to PostScript, EPS, HPGL and X.

MPEG

Examples can be found in:

SRC.DOC.IC.AC.UK:/WEATHER/MET.ED.AC.UK/ANIMATIONS/SRC/
MPEG_PLAY-2.0.TAR.Z
FTP WARWICK.AC.UK
utilities
FTP.DEMON.CO.UK
pub/ibmpc/mpeg
NIC.SWITCH.CH
MAC

JPEG

SRC.DOC.IC.AC.UK:
- /COMPUTING/DOCUMENT/FORMATTING/TEX/UK-TEX/TOOLS/GOPHER/MACINTOSH_TURBOGOPHER/HELPER-APPLICATIONS/
  JPEGVIEW33.sit.hqx

Conversion

We need to consider:

What is source?
What is target?
What content has to be preserved?

Tools

The available tools include:

Utah raster Toolkit (URT)
PBMPLUS
San Diego Super Computer Center tools (SCSD)
JPEG Consortium Toolkit
MPEG Converters
PC Applications
RALCGM

URT is available via:
SRC.DOC.IC.AC.UK:

/COMPUTING/OPERATING-SYSTEMS/LINUX/SUNSITE.UNC-MIRROR/APPS/GRAPHICS/
HREF="ftp://src.doc.ic.ac.uk/computing/operating-systems/linux/sunsite.unc-mirror/apps/graphics/urt-3.1b-bin.tar.z">URT-3.1B-BINN.TAR.Z

and

IACRS1.UNIBE.CH:/PUB/
URT-3.1A.TAR.Z

Graphics Multimedia Virtual Environments Visualisation Contents