[SJ Logo]SuperJournal Data Conversion Process

Home | Search | Demo | News | Feedback | Members Only


Ann Apps, Manchester Computing, University of Manchester

SuperJournal Technical Report SJMC140

Contents:
1.  Overall Approach
2.  Tools Used
3.  SGML Pre-Processing
4.  SGML Header Conversion
5.  Full Text SGML Conversion
6.  Usage Logging
7.  Error Logging
8.  Problems and Issues
9.  Technical Appendices

1. Overall Approach

The SuperJournal project takes electronic journal data from several publishers, most of whom use different DTDs. Some journals are delivered as full articles in PDF, and maybe HTML, with metadata (header) information as SGML. Others are delivered as articles in full-text SGML, mostly with accompanying PDF.

It was realised very early in the project that any attempt to ask the publishers supplying the data to all use the same DTD was not feasible. So two SuperJournal DTDs were developed, a metadata/header DTD (SJ.dtd) and a full-text DTD (FSJ.dtd) (see [SJMC120, SJMC121]). The publishers' supplied SGML is translated into SuperJournal SGML during the data conversion process.

The article header information is used for loading the article metadata into the SuperJournal application, and then for display of the headers and tables of contents. The publishers' SGML is translated into SuperJournal Header SGML via an all encompassing generic DTD. During this generation of SuperJournal Header SGML, each article is assigned a unique SuperJournal identifier (SJAID).

Full text SGML supplied by the publishers is translated into Full SuperJournal article SGML, which is then translated into HTML for display to the end user. This SGML is also processed to provide extra functionality to the end user, e.g. display of article references, and to extract the article metadata where this is not provided separately.

2. Tools Used

2.1 SGML Processing Tools

2.1.1 sgmls

sgmls is a freely available, validating SGML parser. It outputs a simple, easily parsed, line oriented, ASCII representation of an SGML document. SuperJournal downloaded sgmls from ftp.ex.ac.uk (directory: /pub/SGML). It is available for many platforms with C source code provided. sgmls was chosen for SGML parsing within the SuperJournal Header data conversion process because of its availability early in the project. The output format from SGML parsing by sgmls of the generated SuperJournal SGML article header is used as the input format to the SuperJournal application data load.

2.1.2 OmniMark

OmniMark is an SGML aware, 4th generation programming language which includes powerful pattern analysis and manipulation facilities. It is used for the full-text SGML data conversion process within SuperJournal, both for conversion and for SGML parsing. It is also used within the SuperJournal application for dynamic displays of the article header information (see Section 4.6) and article references (see Section 5.8.4).

OmniMark is a rules-based language, which is SGML-aware through its in-built SGML parser. For SGML processing, a rule may be written describing an action to be taken for each SGML element within a document. General text documents may be processed, using OminMark's pattern matching capability, by writing rules with actions to be taken when particular patterns of text are encountered. This pattern matching may also be employed when processing the content of an SGML element.

OmniMark is available for many platforms. The programming language itself is rather esoteric, but was not too difficult to learn. Once the language was learnt, rapid program development was possible. SuperJournal has found one OmniMark licence to be adequate for both the data conversion work (largely performed by one person) and the dynamic application displays, all of which are performed on a single machine. SuperJournal was able to obtain an academic licence for OmniMark but commercial licence rates may be prohibitive for some companies, especially where they require many licences for people working in different locations.

2.2 Programming Languages

2.2.1 Unix Shell Scripts

Because SuperJournal is on a Unix platform, all the controlling data conversion scripts are Unix shell scripts. These scripts generally loop through every file of a particular type in a directory (i.e. journal issue) and call the relevant program with any requisite arguments. These Unix shell scripts would probably not be portable to another platform.

2.2.2 Lex

Lex is the Unix Lexical Analysis program generator. It provides powerful string pattern matching facilities. When used in conjunction with C or C++, the generated lexical analyser can be used to process specific portions of text within a document.

Lex scripts are used for many of the pre-processing actions which are required before the main SGML header data conversion. A lex-generated script is used in conjunction with a yacc-generated parser (see Section 4.3) to process the input for the main SGML header data conversion program.

Being a specifically Unix tool, Lex is probably not portable to other platforms. It was chosen because of existing expertise in its use. Like all Unix programs its syntax is very cryptic, but provides a powerful tool once learnt.

Other choices for this lexical analysis would have been Perl or TCL, but there was no initial skill with these. OmniMark would also have been a good choice, but it was not available early in the project.

The problem with using any pattern matching or text recognition software occurs when text input has been typed differently from expected. It has been necessary to update these Lex scripts many times during the SuperJournal project to allow for different typing styles. However, the quality of the SGML provided by the publishers has improved during the project, making some of the actions in the pre-processing scripts redundant.

2.2.3 Yacc

Yacc ("yet another compiler-compiler") is a Unix tool which is used to produce a document parser from a grammatical specification of the language used in the document. In SuperJournal it is used in conjunction with a Lex lexical analyser to parse, and thus control the processing of, the input for the main SGML header conversion program.

Like Lex it is a specific Unix tool with a cryptic syntax. Again, it was used because of existing expertise and availability. OmniMark would probably have been a better choice of tool, but it was not available early in the project.

There were one or two instances when it was necessary to make the syntax allowed by the yacc-generated parser more restrictive than that allowed by the Generic SuperJournal DTD, because yacc seemed unable to cope with too many conflicting options, but generally it was a good tool for the job.

2.2.4 C++

The SuperJournal Header data conversion program is written in C++ (with front-end input via lex/yacc). C++ was chosen because of availability and existing expertise. OmniMark was not available early in the project when the Header data conversion was written. C++, being an object-oriented language, allows programs to be written in a modular fashion, thus improving maintainability.

2.3 PDF Tools

2.3.1 Adobe Acrobat Exchange

Adobe Acrobat Exchange is an essential tool where any PDF manipulation is required. It has been used during the SuperJournal data conversion process for: repairing damaged PDF files; removing security settings from PDF files; assembling or splitting PDF files; adding web links to PDF files.

Unfortunately Acrobat Exchange, which has a manually operated graphical interface, does not lend itself to batch processing of PDF files, and so is not a good tool for scaleable automatic processing of large amounts of data. It would be possible to write batch processing programs for PDF files using the libraries provided by the Abode Developers' Association. Also there are several commercially available tools for batch processing PDF files. Performance of the Unix version of Acrobat Exchange was not good, with a slow response to mouse clicks.

2.3.2 Adobe Acrobat Distiller

Acrobat Distiller was used to generate PDF from PostScript for two journals whose articles were supplied in this format. Acrobat Distiller can be used to batch process an issue of PostScript files. It is capable of generating a single PDF file from a set or a directory of PostScript files, a facility which was used where a journal issue was supplied as single PostScript pages.

2.3.3 pdftotext

pdftotext extracts the text from a PDF file. It is supplied as part of a package, xpdf, (http://www.aimnet.com/~derekn/xpdf) written by Derek B. Noonburg (derekn@ece.cmu.edu). It is freely available for academic use; for commercial use contact Derek. It runs on several platforms.

pdftotext is used during the SuperJournal data conversion process for PDF validation. It reports damaged and secured files. It is unable to extract the text from secured PDF files, whether or not a password has been used, because of US encryption export laws.

The extracted text is not perfect, but is generally good. There was a possibility that the extracted text could be used:

2.3.4 Adobe Developers' Association

Joining the Adobe Developers' Association, Adobe Acrobat Programme:

http://www.adobe.com/supportservice/devrelations/main.html

email: euroADA@adobe.com

provides the Adobe Acrobat Plug-ins SDK and the Adobe Acrobat ToolKit library, plus support for writing PDF processing applications. Using these libraries, it would be possible to develop batch processing tools, for example to:

SuperJournal has not, in fact, developed PDF tools but the Adobe Acrobat ToolKit library was required for the Reference and Thumbnail Extraction tools (described below).

2.3.5 Reference Extraction and Linking Tool

The reference extraction tools were developed for SuperJournal by the PDF Research Group in the Computer Science Department at Nottingham University, under Prof. David Brailsford. There are two tools refnum and refpos. The first works on numerically listed references within the PDF file, the second on PDF files where the positioning delimits references. Both tools extract the references from the PDF article as text. If these references are then processed to find their Medline references, URL links to Medline may be added into the PDF file using the same tools. Use of these tools requires a little initial setting-up for each journal to determine whereabouts the references are on the pages of the PDF article and exclude page headers and footers. To run these tools requires the Adobe Acrobat ToolKit library provided with membership of the Adobe Developers' Association, Adobe Acrobat Programme.

The reference extraction and linking tool has been used to generate article reference linking within some of the journals in the Molecular Genetics and Proteins cluster which are not supplied in full text SGML format.

2.3.6 Thumbnail Extraction Tool

The thumbnail extraction tool, exthumbs, was developed for SuperJournal by the PDF Research Group in the Computer Science Department at Nottingham University, under Prof. David Brailsford. It extracts the page index thumbnails from a PDF file as GIF images (the thumbnails must have previously been generated in the PDF file). To run this tool requires the Adobe Acrobat ToolKit library provided with membership of the Adobe Developers' Association, Adobe Acrobat Programme.

It was intended to use these thumbnail images, similar to the GIF thumbnail images provided with full text SGML articles, within a "flip index" tool or a "minicontents" index, but only experimental versions were produced.

2.4 Graphics Manipulation Tools

2.4.1 ImageMagick

ImageMagick (http://www.wizards.dupont.com/cristy/ImageMagick.html) is freely available by FTP from ftp://ftp.wizards.dupont.com/pub/ImageMagick/ (the only requirement being to send a picture postcard of the area where you live to the author John Cristy: cristy@dupont.com). ImageMagick provides tools for display of many graphical types, and conversion between types and sizes. It runs on several platforms where X11 is available.

ImageMagick is used within SuperJournal to convert images to GIF where they are supplied in different formats, and to create thumbnail images where they are not supplied. It is also used for investigative image display if there are problems during data conversion processing.

Because ImageMagick has a command line interface for image conversion it was simple to use for batch conversions. It was found to be costly in machine time to process the figures for a whole journal issue, though better when run as an overnight job. Because of lack of optimisation of the generated GIF images, the file sizes after conversion were rather large.

2.5 Referencing Tools

2.5.1 Medline Citation Matcher

Medline Citation Matcher is used to determine Medline identifiers for references within article bibliographies. It is accessed via email at:

citation_matcher@ncbi.nlm.nih.gov

and returns a list of identifiers by email. An email sent to this address with the word "help" in either the subject or the body of the message will return a "help" document detailing Medline formats.

The URL for accessing a Medline abstract using a returned identifier is:

http://www4.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&uid=<Med_UID>&Dopt=r

where <Med_UID> is the Medline identifier.

2.6 Search Engine Tools

2.6.1 BRS Tools

BRS tools are part of the BRS Search Engine which underlies the NetAnswer search tool included in the SuperJournal application. The SuperJournal SGML Header data conversion program generates files for loading into the appropriate BRS database (there is one for each cluster). This BRS data load occurs as part of the SuperJournal data conversion process.

2.6.1.1 BRS Verification

The generated BRS data load file is verified by calling brsvfy prior to load. This call to brsvfy occurs within the main SuperJournal SGML Header data conversion program.

2.6.1.2 BRS Load

Articles are added to the relevant BRS database by calling brsload with the -add option.

2.6.1.3 BRS Modification

If modifications are required to an article already in BRS this is done by calling brsload with the -modify option. But for modification the BRS document number must be included at the start of the BRS data load file. This is ascertained by using brsmate (see below). Because of the requirement for inclusion of the BRS document number in the file, this is a manual, unscaleable operation

2.6.1.4 Simple Search Interface

BRS provides a simple search interface, brsmate. This is used within the data conversion process to determine BRS document numbers where a modification is required.

3. SGML Pre-Processing

The SGML and the filenames supplied by some publishers require some pre-processing before they are suitable for input to the main SuperJournal data conversion programs. In general, the pre-processing is performed by Lex-generated scripts, which act on specific portions of text in the SGML header, called for each SGML header in the issue via a Unix shell script.

3.1 General Pre-Processing

The SGML from many (but not all) publishers contains unnecessary control characters, such as "carriage return", probably introduced by word processing software. These are removed before any other processing occurs.

SuperJournal data conversion requires the SGML files to have the file extension .sgml, or .hd for a header where both full-text SGML and the header SGML exist, .pdf for PDF articles and .html for articles supplied directly in HTML. Because SuperJournal runs on a Unix platform, case is significant. Unix shell scripts were written to change the file extension as required, by taking as argument the existing extension, then looping through all files of this type in the directory and renaming them with the expected extension. Separate scripts were provided to change the file extension to: .pdf, .ps, .sgml, .hd. Similar scripts, but without an argument, add a file extension where files are supplied with no extension: .ps, .sgml. For SGML files, these scripts are run only when no other pre-processing is required. There is also a script to make all filenames in a directory lower-case, which may be necessary to make figure references and filenames consistent.

After any journal specific pre-processing (described below) has been performed, a further Lex generated program is run for SGML headers. The call to this program is included in the main header data conversion script. It does not edit the base SGML files, but makes various changes to the SGML document which is input to the header conversion program. Specifically this pre-process stage:

A similar pre-process script is run before full text SGML data conversion for some journals.

An awk script (another Unix utility) is also run within the header data conversion script which attempts to shorten lines which are too long, e.g. within an abstract, which would cause problems for the input buffers for some of the data conversion software.

3.2 Packed SGML Headers

The SGML headers for journals CCS1, CCS3, CCS8, PS3, MGP6, MGP11, MGP15, MC11 are supplied packed into one file per issue. This file has to be unpacked into separate SGML headers with filenames consistent with the full article files.

For all these journals except MC11, this is performed by a Lex generated program. Articles are delimited by <art>...</art> tags. Header SGML files are named vvpppp.sgml, except for CCS8 where the names are iipppp.sgml, where: vv is volume number; ii is issue number; pppp is start page. These names correspond to the names of the full article files supplied by the publisher.

For MC11, the SGML headers are packed into one file delimited by <arthead>...</arthead> with article header elements defined according to the publisher's DTD. The approach here was to process this file with an OmniMark script, which, as well as splitting up the file into its constituent article headers, converts the SGML to SuperJournal Header conformant SGML. It is these SuperJournal SGML headers which are then processed through the main header data conversion program. The reason for using this route for this journal was that it was included at a late stage in the project when OmniMark was available, and it uses a DTD which is substantially different from other publishers' header DTDs. To include this new DTD in the Generic SuperJournal DTD and the yacc generic-SGML parser would have involved more work than writing the OmniMark script. This OmniMark script identifies the SGML elements which correspond to SuperJournal SGML elements for each article, as it parses the SGML against the publisher's DTD, and then outputs the article in SuperJournal-SGML format. Each article file name is of the form vvpppp.sgml. The OmniMark program also outputs a Unix shell script containing a list of commands to rename the PDF articles according to the same convention, which is run after all the article headers have been converted.

3.3 Journal Specific Pre-Processing

Apart from any manual pre-processing, journal specific pre-processing is performed by Lex-generated scripts, called for each file in an issue with any required arguments.

This pre-processing may include:

The pre-processing required for each journal is detailed in the table in Appendix 9.2. This reflects the pre-processing currently performed in the SuperJournal data conversion process. Earlier in the project it was necessary to perform more pre-processing to include missing fields in the SGML such as cover date. The reduction in pre-processing reflects an improvement in the quality of the SGML supplied.

3.4 Publisher Supplied HTML

Journal MGP6 is supplied with SGML headers plus full articles in both PDF and HTML. Some changes are required to this supplied HTML for display in SuperJournal. These are made by a Lex-generated script, with the volume and issue numbers included as typed-in arguments. Changes made are:

4. SGML Header Conversion

All journal headers are processed through the SGML Header Conversion program to generate the application load files. Journals whose supplied SGML is processed by this program are: all CCS journals except CCS7, all PS journals except PS8 and PS17, MGP5, MGP6, MGP10, MGP11, MGP12, MGP13, MGP15, MC2, MC3, MC5, MC7, MC9. In fact, some of these journals are supplied in SuperJournal SGML format. For all other journals, SuperJournal SGML headers are first generated from the supplied, generally full-text, SGML. These generated SuperJournal SGML headers are then processed through the SGML Header Conversion Program.

Supplied SGML headers for journals CCS7, PS8 and PS17 were previously processed directly by the SGML Header Conversion program but, after significant changes to the publisher's DTD at a late stage in the project, they are now pre-processed by an OmniMark program which transforms them to SuperJournal Header SGML first. It was decided that writing this OmniMark script would be simpler and faster than trying to integrate this new DTD into the SuperJournal Generic Header DTD and generic processing program.

SuperJournal Header Conversion is controlled by a Unix shell script. The files generated during this conversion are listed in the tables in Appendix 9.1, which also indicates the base files input to the process. This shell script leaves the issue directory in a clean state by removing any temporary files used.

4.1 Header Conversion Script

The header conversion script, which is run within the issue directory, performs the following actions:

  1. For every SGML header file
    1. Perform general pre-processing, as detailed above, resulting in a .tmp file.
    2. Parse .tmp file against Generic SuperJournal DTD using sgmls, resulting in a .gen file in the format described below.
    3. Generate SuperJournal SGML header in .sj file, also producing .brs file for BRS data load (see Section 4.3)
    4. If .sj file created successfully
      1. Parse .sj file against SuperJournal DTD, to validate SGML, resulting in .odb file, in the format described below, which is used for SuperJournal application data load.
      2. Make corrections to character entities in .odb and .sj file (see Section 5.9.1)
      3. Precede any double-quote characters in the .odb file with a back-slash so that they will be read correctly by the SuperJournal application data load program.
      4. If .brs file created successfully
        1. Make any corrections to character entities (see Section 5.9.1), including adding images for Greek characters, and convert any `\n' back to newlines in .brs file.
        2. Verify .brs file against the relevant BRS database.
      5. Extract text from PDF file by calling pdftotext.
  2. Make catalogue file for SuperJournal application data load, and journal catalogue entry for later usage-data processing (see Section 4.4.2 and Section 4.4.3).
  3. Create and update Tables of Contents files (see Section 4.4.1).
  4. Make necessary files world-readable for web access.

There is another similar script which is used for validating the supplied SGML before the full conversion above is run. This "validation" script omits the PDF text extraction, and the catalogue and tables of contents creation, allowing a quicker turn-round time during the error correction phase of the data conversion process.

4.2 Parsing Using sgmls

As well as validating the SGML, parsing using sgmls generates output in a format suitable for further processing by simple lexical analysis. This output differs from the input SGML in the following ways:

For example:

<cd year="1998" month="7" day="1">July 1998</cd>

becomes after sgmls parsing:

AYEAR TOKEN 1998
AMONTH TOKEN 7
ADAY TOKEN 1
(CD
-July 1998
)CD

4.3 SuperJournal SGML Generation

SuperJournal Header SGML is generated by a program sjsgml which is written in C++ with a Lex/yacc generated front-end.

At the start of the program an SJSgml object is created. The Lex-generated lexical analyser recognises syntactic tokens in the input SGML (in sgmls output format) which it passes to the yacc-generated parser. This parser checks the syntax of the input and then takes appropriate action. This generally consists of passing information to the SJSgml object, by calling one of its public operations.

When all input has been parsed, the SJSgml object outputs its accumulated information in SuperJournal SGML format, and also in BRS data load format.

The program sjsgml consists of the following C++ classes, whose public operations are listed in Appendix 9.3:

4.4 Tables of Contents and Catalogues

4.4.1 Tables of Contents

After a journal issue has been successfully converted, an SGML table of contents file is created within the issue directory, named

<jid>V<v>I<i>.toc

where <jid> is the SuperJournal journal identifier; <v> is the volume number; <i> is the issue number. A similar journal SGML table of contents file, which lists available issues, is updated. This file is called

<jid>.toc

within the journal directory. Also a main SuperJournal contents file

SJ.toc

in the main Journals directory, which lists journals and latest issues, is updated if the converted issue is now the latest issue.

These tables of contents files effectively catalogue all the articles loaded within SuperJournal. They are used during the generation of reference links within SuperJournal (see Section 5.8.4). Potentially they could also be used to create a different SuperJournal application which does not make use of the ODB-II object database. There may be an excess of information in these tables of contents files, but it was included partly for completeness and partly for possible future use.

All the tables of contents files generated are headed by Dublin Core Metadata, defined using <META> tags. This metadata is described in report [SJMC141].

4.4.1.1 Journal Issue Table of Contents

The journal issue table of contents is created by an OmniMark script which processes all the SuperJournal SGML header, .sj, files in the issue, and extracts the required information from them. This OmniMark program parses the SGML against a DTD (sjarts.dtd) which allows for a list of articles, ie it is a copy of the SJ-DTD but with an encompassing new root element:

<!ELEMENT sjarts o o (sj+)>

The created table of contents file is validated by parsing against the SuperJournal TOC DTD (sjtoc.dtd) using the OmniMark SGML parser.

The journal issue table of contents contains:

4.4.1.2 Journal Issue List

The journal table of contents, which is a list of available issues for the journal, is an SGML file in the journal directory. It conforms to the SuperJournal Issues DTD (sjissue.dtd).

A temporary issue list file is created in the issue directory by the same OmniMark program which creates the issue table of contents. This file includes information about the newly converted issue only. If this is the first issue of a particular journal to be loaded into SuperJournal, this temporary issue list becomes the journal issue list. Otherwise this new issue is merged into the existing journal issue list. This is done by an OmniMark program which reads both the new issue file and the existing issue list file, and inserts the new issue information in the appropriate position. This program parses its input data against a DTD which allows for multiple issue lists. It is a copy of sjissue.dtd but with an encompassing new root element:

<!ELEMENT sjisslist o o (jid, sjissue+)>

The journal issue list contains the following information:

4.4.1.3 SuperJournal Contents List

The main SuperJournal contents list is an SGML file, SJ.toc, in the main Journals directory. It conforms to the SuperJournal Journals DTD (sjjnls.dtd). Any new journals added to SuperJournal require a skeleton entry to be added manually to this contents list file.

The SuperJournal contents list is updated by an OmniMark program which reads the temporary issue information file as well as the existing contents list file, parsing against a DTD which allows for the two input files. It is called only when the newly converted issue will become the journal's current issue, which is determined during updating of the journal issue list file.

The SuperJournal contents list contains:

4.4.2 SuperJournal Application Data Load Catalogue

The SuperJournal application data load program, which loads converted data, requires a catalogue file to provide correlation between the SJAID and the data load, .odb, file for each article. The created file consists of an entry for each article in the issue of the form:

SJAID, Full file path of .odb file

This catalogue file is created by calls to a Lex-generated script which extracts the required information from each SuperJournal SGML header, .sj, file in the issue directory. This catalogue file is also used to resolve links when a user follows article references within SuperJournal (see Section 5.8.4).

4.4.3 SuperJournal Usage Statistics Catalogue

Information about journal articles available within SuperJournal is required to produce the SuperJournal usage statistics (described in SJMC260). A journal catalogue is maintained for all journal issues which have been successfully converted and loaded. This consists of a file for each issue named:

<jid>VvvvIiii

where <jid> is the SuperJournal journal identifier; vvv is the volume number as 3 digits,; iii is the issue number as 3 digits.

The first line of the file is:

SJLoad Load date (yy.mm.dd)

Following this is a line for each article of the form:

SJArt cluster volume issue year <jid> Article number Base filename

where `cluster' is the SuperJournal cluster, i.e. CCS, MGP, PS or MC.

This journal catalogue file is created by calls to a Lex-generated script which extracts the required information from each SuperJournal SGML header, .sj, file in the issue directory.

4.5 Multimedia

It was originally intended within the SuperJournal project to explore the possibilities of including multimedia elements along with the articles. In fact only one example of multimedia has been included. This consisted of links from the PDF article to an HTML page showing several peoples' photographs. Each of these was linked to a recording of their speech. Because this was a one-off example, all the links were created manually.

4.6 SGML Header Display

When a user selects "View Abstract" (i.e. view article header/metadata) in the SuperJournal application, the displayed HTML file is generated "on the fly" by an OmniMark script which processes the SuperJournal SGML .sj header file. This same script is also used to display an article's header when a user follows reference and "cited by" links (see Section 5.8.4). It identifies the required SGML elements and outputs them as HTML elements. Output is on the "standard output", so it can be included within other HTML generated by the calling program.

Several links are provided in the generated HTML, where appropriate, for the user to:

For all these links, the URL includes a call to a cgi script which logs the user's action before performing it (see Section 6). This script writes any diagnostics to a defined log file, which is checked and cleared occasionally.

5. Full Text SGML Conversion

The journals which are supplied as full-text SGML are converted to a common Full SuperJournal SGML format, conformant to FSJ-DTD. There is a separate OmniMark script to convert each publisher's SGML to Full SuperJournal SGML (generally publishers use the same DTD across their journals). After this initial conversion, all further Full Text SGML processing is performed on the Full SuperJournal SGML in a .fsj file.

The OmniMark scripts used during the full text SGML conversion are called for each relevant file in the issue directory from controlling Unix shell programs. These shell programs leave the issue directory in a clean state by removing any temporary files used, and set appropriate file permissions on the created files.

Journals supplied as full-text SGML with accompanying graphics are: MGP1, MGP2, MGP13, MC1, MC4. These journals' articles are displayed to the end user in HTML format. In most cases, the publisher also supplies articles in PDF, which is offered to the user as an alternative display format.

Journals MGP3 and MGP7 are supplied in full-text SGML (though previously supplied as SGML headers), but without accompanying graphics. They are not displayed to the end user in HTML format. The generated .fsj file contains just the header information and the fine-grain tagged references.

5.1 Full Text SuperJournal SGML Generation

Each OmniMark script which generates Full SuperJournal SGML identifies the required elements in the publisher's input SGML as it parses the SGML against the publisher's DTD, and outputs them in Full SuperJournal SGML format. Any implicit missing information such as journal title or ISSN is added. Missing cover dates are added as typed-in arguments to the Unix shell program which calls the OmniMark script.

This conversion to full SuperJournal SGML includes:

5.2 Marking URLs

Following generation of each full SuperJournal SGML, .fsj, file, any URLs, i.e. text beginning "http://", which are not already tagged as <url> are identified and enclosed in <url> tags. This is performed by a Lex-generated script, which is called by the Unix shell script controlling this conversion phase. This ensures that these URLs will be "hot-linked" when the article is translated into HTML.

5.3 Full Article HTML Generation

All HTML articles generated by SuperJournal from SGML are produced by a single OmniMark script which processes the already-generated full SuperJournal SGML .fsj file, parsing against FSJ-DTD. This script outputs the relevant HTML mark-up for the SGML elements within the article.

5.4 SuperJournal SGML Header Extraction

For all the journals which are processed through the full text SGML conversion route, the SGML headers are extracted from the Full SuperJournal SGML, .fsj, file. These SGML headers are then processed through the main SGML Header Data Conversion process to create the SuperJournal application data load files. The only exception is MGP13 where the SGML headers are sent separately.

This SGML header extraction is performed by an OmniMark script which parses the .fsj file against the FSJ- DTD. Translation from Full SuperJournal SGML to SuperJournal Header SGML is straightforward.

5.5 Graphics

SuperJournal requires figures to be supplied as a full-size GIF or JPEG image with an accompanying thumbnail GIF image. Where figures are supplied in a different format, they are converted to GIF images of the correct size using ImageMagick. A Unix shell program which performs these conversions is generated during the Full SuperJournal SGML generation. For some journals, tables are also supplied as GIF images. These are treated similarly to figures. Some figures are supplied as "schemes".

During generation of Full SuperJournal SGML, publisher's references to figure filenames are converted to standard SuperJournal figure file references. A Unix shell program which renames the files is generated during Full SuperJournal SGML generation. The SuperJournal figure naming convention is:

Standard size figure <art>f<n>.gif
Thumbnail <art>f<n>th.gif
Larger size figure, if supplied <art>f<n>big.gif

where: <art> is the article's base file name; <n> is the figure number, which may have a letter suffix (e.g. 1b). Naming of tables and equations is similar but with `t' or `e' respectively instead of `f'. (Equations do not have a thumbnail).

Where the position of a figure is not defined in the SGML, it is included at the end of the paragraph in which it is first referenced.

5.6 Equations

Mathematical formulae marked up in SGML cause problems when attempting to generate full article HTML because the web browsers display only a subset of the defined SGML character entities. Equations became a problem when the "Materials Chemistry" cluster was included. Within the other clusters any formulae are usually simple.

Some publishers supply equations as separate GIF files. When these are provided they are displayed within the HTML as inline graphics. Occasionally it has been necessary to manually edit the publisher's supplied SGML to include these images rather than the SGML marked up formulae.

Where it is necessary to translate formulae into HTML a "best attempt" approach has been used, e.g. using `/' for division. Where possible Greek and other symbols are displayed as inline GIF images (see Section 5.9). It seems not possible to improve on this method for dealing with equations until the web browsers provide more support for mathematical display. A future solution could involve MML (Mathematical XML or HTML) but this was thought to be too immature during the timescale of the SuperJournal project, and is not yet supported by the web browsers. Future web browser rendering of MML will probably involve Java applets. Another possible solution could utilise style sheets and dynamic fonts.

5.7 MiniContents

For journals supplied as full SGML and displayed through the SuperJournal application as HTML, a MiniContents HTML file is created for each article. A MiniContents file provides a quick index into the HTML article file. As well as some article header information, including the article abstract, it includes links to:

Because the MiniContents file includes links to the "next" and "previous" MiniContents within the journal issue, it provides the user with the ability to flick quickly across the articles.

The MiniContents HTML file is generated by an OmniMark script which processes the full SuperJournal SGML, .fsj, file, and extracts the required information. The calling Unix shell script passes to this OmniMark program the base filenames of the "next" and "previous" articles for creation of the inter-MiniContents links.

All the links in the MiniContents file enable logging of usage (see Section 6). The HTML file generated is headed by Dublin Core Metadata, defined using HTML <META> tags. This metadata, described in report [SJMC141], is created by a separate OmniMark program.

5.8 Bibliography Display and Linking

5.8.1 Bibliography Display

For some journals, SuperJournal provides the end user with the option to view an article's bibliography in a separate web browser window. These are journals which are supplied as full text SGML, with the articles' references tagged to a fine-grain, so that individual elements of the references, e.g. authors, title, volume, etc., can be identified. Links to access these bibliography pages are provided with the article's header information (on `View Abstract' or following reference links), in the HTML article, and in the MiniContents for an article. It has been possible to include this functionality in more journals during the time of the project, as more have been supplied as full-text SGML, and as publishers have updated their DTDs to include fine-grain reference tagging.

An article's bibliography is displayed dynamically by an OmniMark program which parses the Full SuperJournal SGML, .fsj, file, and generates an HTML page as it identifies the appropriate SGML elements. Any references which could not be tagged in a fine-grain way within the Full SuperJournal SGML file, e.g. because they were incomplete, are displayed as their original text. The references displayed contain any "hot" links to Medline or to articles in SuperJournal (described below). There is an option for the user to download the list of references in a bibliographic format, and an internal link from the end of the bibliography back to the `Top'.

5.8.2 PDF References Extraction

Using the PDF Reference Extraction Tool (see Section 2.3.5), it is possible to extract the bibliography from a PDF file. If this bibliography is then parsed to identify the separate elements of each reference, generally by position and punctuation, the references may then be displayed and processed in a similar way to those extracted from a full text SGML file.

So far this PDF references extraction has been performed for only one journal. Before the references can be extracted from a PDF article, some positioning information, i.e. co-ordinates within the page, must be ascertained so that any page headers and footers are ignored. This tool is not perfect: sometimes not all extraneous information is ignored, particularly where the bibliography starts part way down a page; sometimes characters are lost on page boundaries; a reference which crosses a page boundary appears in two incomplete parts. But, generally a large proportion of the references are extracted successfully. References are extracted one per line. For further processing using refpos, i.e. to add Medline links into the PDF file (see below) it is important to maintain the order and number of these lines exactly.

For each PDF article, the references are extracted, using refpos, into a temporary file. They are assembled, along with the article header information from the .sj file, into a temporary full SuperJournal SGML file, with each reference as a text string demarcated by simple SGML bibliography tags, using Lex-generated scripts. At the same time any non-keyword characters from the PDF file are altered. An OmniMark program is then used to parse the references to generate a full SuperJournal SGML .fsj file which contains just the article header information and the bibliography with fine-grain tagging where possible.

5.8.3 Medline Links

Within the "Medical Genetics and Proteins" (MGP) cluster, links to Medline abstracts are added, where the articles are known to Medline. These links may be displayed for:

A list of all the articles' references for an issue is generated, in the format understood by Medline, using an OmniMark program which identifies the relevant SGML elements in the articles' full SuperJournal SGML files. Medline format is:

JournalTitle|Year|Volume|FirstPage|AuthorName|YourKey|

For `AuthorName' just the first author (surname and initials) is included.

Medline returns the list in the same format with Medline identifiers appended to each line. The key used by SuperJournal is:

<filename>R<n>

where <filename> is the base filename of the article; <n> is a unique reference number within the article, allocated sequentially. A dummy line in Medline format is generated even where the full reference cannot be determined, in order to preserve the line ordering for the PDF reference extraction tool. Medline simply returns NOT_FOUND for these dummy lines.

The reference list returned from Medline is split into sets of

SuperJournal key Medline Identifier

in a temporary file for each article, using a Lex-generated script. A further Lex-generated script inserts the Medline identifiers in the full SuperJournal SGML, .fsj, file, at the end of each bibliographic reference, enclosed in <medline> tags. When HTML is generated, either for the full article, or for the dynamic bibliography page, this identifier is included in a Medline URL.

Where the references have been extracted from a PDF file, the Medline identifiers and URL are added as links in the PDF file, by a further call to refpos for each article, the information returned from Medline being first split into a separate temporary file for each article by a C++ program.

In a similar way, a list of Medline references is also generated and processed for the actual articles in a journal issue, the OmniMark program operating on the SuperJournal Header SGML, .sj, file. Links to Medline for the article are made available from:

Processing of these Medline identifiers for the actual articles in an issue has been problematical because it has been found that often Medline has not yet recorded the latest issues being loaded by SuperJournal. This has meant repeated attempts over time to ascertain these Medline identifiers, which conflicts with SuperJournal's aim to use scaleable data conversion processes.

5.8.4 SuperJournal Internal References

5.8.4.1 Adding Reference Links within SuperJournal

Where referenced articles are available in SuperJournal, links are included with the references, both in the dynamically displayed article references and in the HTML article, if any. This link takes the user to the header information for the referenced article, which in turn contains a link to the full article.

Referenced articles available in SuperJournal are identified by processing the same list of references in Medline format as that sent to Medline, using a C++ program, sjrefaids. The classes and public operations of this program are listed in Appendix 9.4. For each reference, journal name (expanded from any abbreviated form), volume and page are checked against a hard-coded table to determine whether the referenced article could be in SuperJournal. Then table of contents files, for the journal and for the issue which covers the given page number, are read to find the SuperJournal identifier of the referenced article.

During this process any references which cannot be found are recorded in a SuperJournal global directory of "back references". This is to cover a potential problem with late-loading of some journals. If the referenced issue is later loaded into SuperJournal, any journal issues which reference it should be re-linked. In reality, it has been found that most of the unfound references recorded are caused by invalid references (usually bad typing) rather than late loading of journals.

A "Medline-like" file of SuperJournal references and identifiers is accumulated, which is then split and the identifiers inserted into the full SuperJournal SGML, .fsj, file, enclosed in <sjaid> tags, by Lex-generated scripts, in a similar way to the addition of <medline> tags. The <sjaid> element content is:

<publisher_directory>/<SJAID>

where <publisher_directory> is the publisher directory name within SuperJournal; <SJAID> is the SuperJournal article identifier. The publisher directory name is included here for later resolution of the link.

5.8.4.2 Displaying Reference Links within SuperJournal

When an article's bibliography is translated into HTML, either statically for the full article, or dynamically for the separate bibliography display, internal references within SuperJournal, denoted by <sjaid> tags in the .fsj file, become a URL which when activated cause the referenced article's header to be displayed. This action is implemented by a Lex-generated cgi program which resolves the reference (given by a `publisher directory / SJAID' pair), by reading the relevant journal issue SuperJournal Data Load Catalogue file (see Section 4.4.2) to find the file path name of the .sj header file of the referenced article. The actual header display is performed by the same OmniMark script which displays all article headers (see Section 4.6).

5.8.4.3 `Cited By' Links within SuperJournal

Where an article has been referenced by other articles in SuperJournal, a list of the citing articles may be displayed in a separate web browser window, via a link from the article header information. This list includes links to the headers of the citing articles, and hence to the articles themselves. A `cited by' list for an article is an SGML file, with extension .cit, conformant to sjcitby.dtd. It contains:

`Cited by' files are created and updated during the process of finding reference links within SuperJournal described above.

The list of `cited by' links is displayed, via a link on the header display page, by two OmniMark programs. The first one processes this cited article's .sj header file to generate the HTML page headings showing this cited article's title and authors. The second generates the `cited by' list in HTML from the .cit file. Links to citing articles' headers are enabled using the .sj file paths in the .cit file.

This forward chaining of references will include links from an article to any "Erratum" only if the publisher included a backward reference to the corrected article within the "Erratum" article. In reality, publishers do not include these backward correction references. SuperJournal has taken no action to include them because this would be a manual process with identification of "correction" articles probably undefinable.

5.9 Special Characters

5.9.1 SGML Character Entities

During SGML parsing, using either sgmls or OmniMark, substitutions are made for SGML character entities as defined in the declared entity files. Because different substitutions are required in different situations, it was found necessary to maintain several sets of entity files, and hence several copies of the DTDs each declaring the relevant entity set.

5.9.1.1 sgmls Parsing

There are two entity sets for sgmls parsing:

Special Character Conversion Using sgmls

Characters whose replacement value is defined as surrounded by square brackets (i.e. [abc]), are further surrounded by `\|' (i.e. \|[abc]\|) by the sgmls parser. After each sgmls parse a Lex-generated script is run to remove these surrounding `\|'.

Characters which are displayable in HTML by their character code (e.g. `&#163;' for a pound sign, `&pound;'), have a replacement value in the SGML entity file for the parse against SJ-DTD as e.g. `[[[163]]]'. These character entities are corrected after all the sgmls parsing is complete, by a Lex generated script. The correction is made in the files: .sj, .odb, .brs. This implementation method was found necessary to achieve the required character coding in the final output following all parses.

A further character conversion is run on the .brs file. This is an OmniMark script which processes the .brs file against a BRS-DTD (a simple DTD which reflects the format of the BRS data load file). This step has the effect of including any HTML inline images, e.g. for Greek characters (see below), in the .brs file, so that they will display to the end user.

5.9.1.2 OminMark Parsing

There are three entity sets used for OmniMark parsing:

<img src="http://www.superjournal.ac.uk/sj/GreekGifs/alpha.lc.gif" alt="[alpha]">.

5.9.1.3 Greek and Mathematical Characters

The GIF images included in the HTML for Greek and some mathematical characters were downloaded from:

http://www.anachem.umu.se/graphics/symbols/symbols.html

and are free to educational and non-commercial organisations.

The disadvantage to the SuperJournal approach of using inline GIF images for character entities, by entity file substitution, is that no account is taken of font size. The image is the same size whether included in a bold heading or in plain text. It was felt that the improvement to the displayed page by using these images out-weighed this font-size problem, provided reasonable font sizes were used for headings.

6. Usage Logging

All the links provided to the end user within files generated during the data conversion, including those created dynamically via calls from the application, are calls to cgi-bin programs which log the user's action before continuing to display either the requested file or a dynamically created HTML page. These cgi-bin programs are Unix shell scripts which write to a specific log file the following information:

The log file actions are specified in [SJMC261], and the processing of this log file in [SJMC260]. The user identifier is unavailable for logging most of these actions, but logged actions are tied to users, by IP address and time, during SuperJournal statistics generation.

Usage logging occurs for:

7. Error Logging

All the data conversion programs write error and diagnostic information to log files. Between each separate stage of the data conversion process these log files are checked. Any indicated errors are corrected, and the program re-run before continuing to the next stage in the process.

The OmniMark scripts which create dynamic displays of article headers and reference information write any errors to predefined log files. These log files require occasional checking and emptying.

8. Problems and Issues

8.1 Multiple DTDs and Evolution

The SuperJournal Data Conversion process has to deal with many different publishers' DTDs, and should have the potential to include more. It was decided very early in the project that any attempt to request that all SGML headers be supplied according to the same DTD was not feasible. Publishers would not be able to accommodate production of SuperJournal specific SGML into their busy schedules, but would expect to supply SuperJournal with their existing electronic data. The development of a journal electronic publishing production system would obviously be much simpler if only one DTD were used. It was decided that a common DTD was a necessity for loading the header data into the application database, so conversion processes were set up to transform the publishers' supplied SGML into SGML conformant to the SuperJournal Header DTD.

During the three years of the SuperJournal project many publishers have updated their DTDs, some several times. Some are now supplied as full-text SGML, where they were originally supplied as SGML headers, and probably this would be the case for more journals in the future, were SuperJournal to continue. The SuperJournal Data Conversion process had to be capable of evolution to include these changes.

With hindsight, the original design of the data handling process, using a generic Header DTD and a generic transformation program to SuperJournal Header SGML, was rather ambitious. In theory, this was a good strategy because it involved maintenance of only one data conversion program. In practice, it was found that the generic DTD, and more particularly the syntax specified for the yacc-generated parser front-end to the main header conversion program, became unwieldy and increasingly difficult to maintain and change or to incorporate new DTDs. Also it was found that there was more journal specific pre-processing required than had been envisaged. If OmniMark had been available at the beginning of the project it could have been used to write a possibly more easily maintainable generic data conversion program. But maybe a better choice would have been to write a separate OmniMark program to convert each publisher's SGML, the conversion route used for journals supplied as full text SGML. In fact, to accommodate some of the later journal inclusions and DTD changes, new OmniMark conversion scripts, which generate SuperJournal SGML, have been written because it was thought this would achieve a more rapid inclusion of the new journal or new SGML than attempting to make changes to the existing header conversion program. Because the header conversion program was written to accommodate SuperJournal SGML, originally for completeness, this new route was possible with no further changes.

8.2 Scaleable Processes

One aim of the SuperJournal project was to produce scaleable data handling processes, because it was necessary to convert and load very large numbers of journal issues and articles with a minimum number of staff. But, at the same time, it was necessary to maintain some level of quality control on the data displayed to the end user.

To a large extent this has been successful. Scaleable processes were essential because of the volume of data handled, some scientific journals being weekly. The data handling process becomes unscaleable where manual input is necessary. The initial transfer of the data into a defined directory structure had to remain a manual process. Apart from that, the main reason for manual input was poor data quality, and journal issues supplied with missing information (e.g. cover date). The programs run during the data conversion process attempt to perform some data quality checking. This means that log files must be checked after each phase of the data conversion process and corrections made where necessary. From experience, it has appeared that this checking has been necessary throughout the duration of the project, and would likely always be necessary. Despite significant improvements in the quality of the data supplied during the project, this quality checking is still showing up errors. A single program which takes supplied electronic journal data as input and automatically loads it into an application with no intermediate control and checking does not seem to be feasible.

8.3 Reference Linking

Those journals which are supplied in full text SGML with the articles' bibliographies tagged to a fine enough grain to identify the individual elements of each reference have lent themselves to improved end-user functionality. It has been possible to identify where articles within SuperJournal are cited and include `hot' links to these articles, as well as links to Medline abstracts for journals in the "Medical Genetics and Proteins" cluster. Both backward and forward chaining have been implemented within SuperJournal.

Towards the end of the project, it has been found possible to also include this linking for those articles supplied as PDF, but, because of time and workload constraints, this has been implemented for only one journal. As would be expected, the identification of references is less perfect than from full text, but is accurate enough to appear worthwhile. In whatever way references are identified, data quality becomes an issue. Many internal SuperJournal references have been identified within those journals which have a three year span. It has been noticed that many citations are to previous articles within the same journal.

8.3.1 Errata

Generally, forward chaining has not included links to "Errata". Implementation of these links would either

8.4 Equations and Special Characters

SuperJournal did not find a totally acceptable solution to the problem of displaying equations and special characters, particularly Greek letters, which were marked up in SGML. Greek characters and some mathematical characters are displayed using inline GIF images, but this solution is not ideal, especially where different font sizes are required, eg in headings. Other special characters are displayed within square brackets, which is far from perfect but does maintain the character in the article. It seems not possible to improve on these methods until the web browsers provide more support for mathematical display. A future solution could involve MML (Mathematical XML or HTML) but this was thought to be too immature during the timescale of the SuperJournal project, and is not yet supported by the web browsers. Future web browser rendering of MML will probably involve Java applets. Another possible solution could utilise style sheets and dynamic fonts.

8.5 PDF Manipulation

Generally PDF files were supplied by publishers and displayed unchanged within SuperJournal. But occasionally it was necessary to manipulate PDF files, e.g. to remove security settings; to repair damaged files; or to assemble articles from separate pages. If more multimedia elements had been included attached to journal articles it would have been necessary to add information like a `Base URL' to the PDF file.

The PDF files could possibly have been improved by the addition of metadata within their `Document Information' fields. This was included by some publishers. Potentially SuperJournal could have ascertained this metadata from the articles' SGML headers.

The problem found with manipulation of PDF files was a lack of any command line or batch processing interface. Acrobat Exchange has a manually operated graphical interface which is not suitable for the automatic processing of large numbers of files. It would have been possible to provide PDF batch processing programs using the Adobe Developers' Association Acrobat ToolKit, if time and workload had permitted.

9.  Technical Appendices


This web site is maintained by epub@manchester.ac.uk
Last modified: July 07, 1999