Home | Search | Demo | News | Feedback | Members Only
Ross MacIntyre, Manchester Computing, University of Manchester
SuperJournal Technical Report SJMC100
1. Purpose of the Report
3. Data Analysis
4. Data Conversion
5. Data Handling Process Summary
6. Problems and Issues
Appendix A SuperJournal Data Handling Reports
Before designing the data handling processes, we needed to answer some basic questions about the data itself:
We therefore embarked on a data analysis exercise to understand the files that could be supplied, so that scalable processes could be designed around them.
The publishers' role was content creation, and MC's role was to deliver an application with maximum functionality, within the constraints of the content they could supply.
Each of the publishers participating in SuperJournal publishes journals in printed form, and the electronic files they could supply were generally a by-product of the print production process. The particular files available from each publisher depended on the nature of their production process, the suppliers they used, and whether they were already developing electronic journal products to accompany print.
The eLib Programme offered some general recommendations on file formats in their eLib Standards Guidelines , but these could not be forced on publishers. We needed to find out what files they could supply, evaluate the processing requirements, and assess the functionality that could be delivered in the SuperJournal application by using them. Then we could decide on submission formats.
In February 1996 a questionnaire was sent to each publisher to find out the files each could submit. Sample files were sent to MC for testing and evaluation. Once the DTD analysis outlined in Section 3.2 was completed, we agreed on the following submission formats:
From the start, virtually all Publishers were working with SGML. Most could supply "header" information in SGML format, where the header contains information about the article, e.g. author, title, volume, issue, date, abstract. The situation has moved on considerably, but in general the DTDs in use were either "bespoke" or variants of Majour, the Elsevier DTD, and latterly SSSH . Moves towards ISO 12083 standard DTDs are underway.
A few publishers could supply one or more journals with the full articles in SGML format, and several indicated plans to do so. In the case of full article SGML, each publisher developed their own DTD or was having one created. The use of SGML is explored further in the next section.
Most of the publishers could supply the journal articles in PDF format, typically generated by their print suppliers from PostScript files. The article can be viewed with the Adobe Acrobat Reader and looks exactly like the printed version it is derived from. It is cheap to produce and protects the "look" of their journals, something they have invested serious money in designing.
A few publishers could supply the articles of individual journals as HTML, typically generated when the publisher offers a Web version of the journal.
Encapsulated PostScript (EPS) & Tagged Image File Format (TIFF) are used widely in publishing, and most publishers can provide them. However, SuperJournal is a Web application, so we needed graphic files that could be used with Web browsers. As more publishers developed Web applications, we found they could supply graphics in Graphic Interchange Format (GIF), often multiple files of different sizes and quality. Joint Photographic Experts Group (JPEG) format files were few and far between initially, though use has grown during the project.
Virtually none of the publishers generate sound, video, or other multimedia files for their journals as a matter of course. A notable exception was a music journal which included a yearly CD-ROM. This area was changing, and we sought to encourage and assist in experimentation.
Files were also received for testing in PostScript, tagged ASCII, and TeX/ LaTeX format. In cases where the publisher could only supply articles as PostScript or TeX/LaTeX, this was converted to PDF. Where TeX/ LaTeX was used for equations and formulae within an article, this was accommodated by creating GIFs and displaying them in-line.
In May 1996 the project decided to accept SGML headers and PDF articles for the first journal cluster. It should be stressed that this was not a decision advocating one format over another. A "preferred format" was a possible outcome of the project, but the preference would have to be demonstrated after providing users with different choices.
The initial decision to accept SGML headers and PDF articles for the first cluster, Communication and Cultural Studies (CCS), was largely based on what most publishers could provide. One year later, in May 1997, the second journal cluster, Molecular Genetics and Proteins (MGP), included several journals with articles submitted full text SGML. The third cluster, Political Science, was similar to CCS, in that all journals were supplied as SGML headers and PDF articles, with one exception which was provided in Postscript. The final cluster, Materials Chemistry, featured a mix of full-text SGML article files and SGML header/PDF article files. Over the life of the project new file formats were included as the publishers, or their suppliers, developed their capabilities to supply them.
Full details of the types of files submitted in contained within a Technical Appendix to SJMC130 Production Process, Section 15.2 Journal Data Supply.
To assist planning, we sought an expert view on the use being made of SGML by our contributing publishers: Was the use of a project standard DTD a forlorn hope? Alden Electronic Products  performed a formal DTD analysis on the six DTDs then in use by the publishers (one applied to the header data only).
They concluded that:
Headers: The DTDs varied considerably, but they contained basically the same data elements. It should be feasible to develop a single DTD to accommodate the variations.
Full Text: The DTDs were similar in many respects and could be harmonised, but only up to a point. Where the DTDs differ, the relative importance of each difference must be determined. It would be possible to totally harmonise the DTDs, but the consequence would be loss of document structure.
So the DTDs were compatible at a high level, but not at the detailed level. This did mean that a generic header DTD was possible, which was good news. However, it would be naive to expect the publishers to adopt a DTD of our choice. We would have to accommodate the mark-up that each individual publisher used, and design data handling methods to combine the data and process it.
Data mapping is a familiar activity in any data migration project. However, for this project it was unusual in that the source data descriptions were defined, but the target was not. One form of documentation which was used was a spreadsheet, indicating the mapping of DTD elements to each other. The target DTD evolved from this process. The rules for derivation were also noted, these being the starting point for the conversion code.
We started the activity with the six DTDs covered in Alden's analysis, but extended this to include SSSH, as this was recent, authoritative, and a move to developing a standard DTD for publishers. The same process was repeated to establish a target full-text DTD. The mapping was conducted for 17 different DTDs, including Dublin Core , (see SJMC121 Data Analysis, and SJMC122 Full Article DTD). Obviously different versions of each DTD were also mapped during the project. Of about 170 different tagged items from the header DTDs, we carried about 70 forward into SuperJournal. When expanded to full-text, about 100 were carried forward from around 300 tagged items.
The objective of the data conversion was to process the data into a consistent format for indexing and loading into the SuperJournal application, not simply to generate HTML versions of the data for display on the Web. See SJMC140 Data Conversion Process, for a detailed description of the data conversion process.
Although we had accepted that we would have to deal with multiple input types, we sought to minimise any conversion code duplication. This lead us to treat header and full-text data in compatible but slightly different ways. The approach adopted is outlined below.
An SGML DTD was an appropriate language to describe what, in data modelling terms, could be called the synthetic data architecture: a synthesis of real world data, but remaining artificial. A bonus was that parsing offered syntax QA "for free". The principle of treating conversion in a staged fashion was well established, see Alschuler and the Rainbow DTD . The DTDs can be viewed as "halfway houses".
Microstar's Near & Far  was used to document the DTD development, albeit retrospectively. It was not used during the DTD creation, but was used for documentation and subsequent maintenance.
The project added Dublin Core metadata to the HTML pages generated during the conversion process, which is documented in SJMC141 Metadata Specification. As the prime use of metadata was for Web information discovery, and SuperJournal articles were only readable via the application, the generation was for experimental purposes, rather than use.
The project did not explore the use of DSSSL , which became ISO/IEC 10179 in April 1996, and may be suited to this kind of activity. This may be examined in retrospect, together with JADE, its associated (public domain) engine.
First we developed a "generic" header DTD for SuperJournal which contained elements from all the various publisher DTDs. This was the input file definition for the conversion process, and all incoming SGML header files were parsed against it using the "sgmls" parser .
Next we developed a "SuperJournal header DTD" (SJ-DTD) which defined the data elements that were extracted for the electronic journal application. This could be viewed as the conversion output file definition, and the data definition was the basis for the objectbase and database design. The conversion process resulted in a subset of the header data provided by the publisher, which was then parsed, and produced files for input to the subsequent data load processes.
An advantage of this approach appeared to be that it would be easy to add header data from a new source. The "generic DTD" was extended to accept input from other DTDs, e.g. SSSH. Having one SuperJournal header DTD would mean that only one filter need be written to produce output. So, to produce output for Dublin Core, for example, no new filters needed to be written. It should be noted that the header data is not converted/stored as HTML.
However, the resultant size of the "generic" DTD meant that it became increasingly difficult to maintain. This is one reason why DTDs supplied later in the project were not included. (See SJMC120 DTD Definition, for fuller discussion).
It was not feasible to define a "generic DTD" for full text, i.e. it wasn't possible to accommodate the differences in data description, given the current SGML mark-up used by the publishers. A "target" full-text SuperJournal DTD (FSJ-DTD) has been defined, but conversion requires more in the way of pre-processing and re-formatting. Therefore, for each DTD, separate conversion modules have been created to convert the data to conform to FSJ-DTD, which in then used to convert to HTML, though other formats may be included, e.g. XML .
The use of FSJ-DTD also means data can be supplied in a consistent format to SGML-capable datastores, indexers, etc. The header elements of FSJ-DTD were consistent with SJ-DTD, supporting the processing of header data for articles submitted in full-text SGML format.
A number of options were considered for the conversion coding, and OmniMark , an SGML aware 4th generation hypertext programming language, was selected. However, due to time constraints, the header conversion coding actually commenced with a UNIX utility (Lex) and C++. The conversion code ran to approximately 5,000 lines, and the same for the DTD definitions. It coped with nine DTDs, including SJ-DTD. SJ-DTD was used by two publishers as a stop-gap measure for header mark-up while their own DTDs were being written.
Full-text conversion is performed using OmniMark scripts, which consist of approximately 200 lines of code per DTD. The parsing of full-text SGML is done using the parser included with OmniMark. We have found the error messages more "developer friendly" than from sgmls.
As mentioned earlier, no header information is actually stored as HTML, it is converted for display "on-the-fly". Only full articles are converted and stored in HTML for performance reasons.
The articles produced include:
Internal links, e.g. from reference to bibliography, and from text to footnotes, figures, equations, and tables.
Intra-SuperJournal Links, where it was found that an article was citing an article already available via SuperJournal, a link was created, allowing the user to go immediately to the full-text of the referenced article.
External links were implemented for MEDLINE , though others could have been supported. Where the references were tagged, the required data was extracted and sent via email to MEDLINE, who returned, for matched citations, the unique identifier, which provided subsequent access. This link was then tagged as a MEDLINE reference within the FSJ-DTD and is converted to an embedded URL (actually a call to a logging script, see SJMC260 Usage Statistics, within the HTML file.
Forward Chaining was implemented by creating a list of articles that had cited an article already loaded in SuperJournal, found during the intra-linking mentioned above. These were termed "cited-by" links.
Illustrations were converted to an appropriate format for display. Small GIF thumbnails were the default display within HTML, linked to the full-sized GIF image, which was displayed in a separate window.
Hyperlinks. Where URLs were either explicitly tagged, or identifiable by string-matching, they were tagged in FSJ-DTD as a URL, allowing a hyperlink to be created when converted to HTML.
Mini-Contents. We have also created what might be termed "graphical contents pages". We picked out "like objects", e.g. figures, which could be viewed separately from the article, allowing the user to scan through an issue's articles quickly. Images were expanded, or displayed the reference point within the article.
Bibliography. For those journals available in full-text SGML, HTML, or where their reference section had been extracted from the PDF (see below), the bibliography was made available essentially as extra header data.
Special Characters. Greek letters were replaced with in-line gif images while other special characters were displayed in square brackets, e.g. [pi]. An alternative approach for the Greek characters was to exploit the font handling extensions supported in later versions of Web browsers, e.g. <FONT FACE="Symbol">a</FONT> would be rendered as a . However, the gif approach was more universal and did not depend upon a particular version of a browser being used.
Illustrative screen shots are contained in the SJMC230 SuperJournal Application: Special Features.
Any HTML data files received were processed to conform to requirements of the application and enhanced where additional functionality could be added in a scaleable fashion as detailed above.
The PDF files were referenced as external data files, so were essentially "untouched", however, there were exceptions, most notably where an article bibliography was extracted from certain PDF files as text. Once parsed, it was displayed and processed in a similar way to those from a full-text SGML file, including the insertion of weblinks into the PDF article.
The handling of multimedia data could not be considered scaleable, and no repeatable processing requirements were established. Our experience with one publisher's data is detailed in SJMC160 Multimedia.
The schematic below provides an overview of the conversion process:
Full details of the entire production process are available in SJMC130 Production Process, but a thumbnail sketch of the end-to-end process would be:
There are lower-level problems and issues listed in the various reports covering the Data Handling. A good many are technical, but the overriding problem throughout the project was the submission of poor and invalid SGML. It must be acknowledged that bug-free data was received throughout from some publishers and substantial improvements were achieved by others, but by no means all.
Appendix A. SuperJournal Data Handling Reports
This web site is maintained by firstname.lastname@example.org
Last modified: July 07, 1999