Home | Search | Demo | News | Feedback | Members Only
Ann Apps, Manchester Computing, University of Manchester
SuperJournal Technical Report SJMC120
1. Overall Approach
2. Tools Used
3. DTD Analysis
4. Generic SuperJournal Header DTD
5. Generic Full Article DTD
6. SuperJournal Header DTD
7. SuperJournal Full Article DTD
8. SuperJournal Table of Contents DTDs
9. BRS Data Load DTD
10. SGML Character Entity Sets
11. Problems and Issues
12. Technical Appendix: The SuperJournal DTDs
1. Overall Approach
The SuperJournal project takes electronic journal data from several publishers, most of whom use different DTDs. Some journals are delivered as full articles in PDF, with metadata (header) information as SGML. Others are delivered as articles in full text SGML, mostly with accompanying PDF.
SuperJournal required a common article header information format for data capture by the SuperJournal application. Because article headers were generally to be supplied in SGML, SGML seemed the right choice for a data format for this data capture. This necessitated a common DTD, the SuperJournal Header DTD. All publisher-supplied header SGML is converted into SGML conformant with the SuperJournal Header DTD during the SuperJournal Data Conversion Process described in [SJMC140]. SuperJournal also developed an all encompassing Generic SuperJournal DTD, against which all publishers' header data will parse.
In addition a SuperJournal Full Article DTD was developed to provide a common format for the full article SGML processing (described in [SJMC140]). This full article common format allowed all journals supplied as full text SGML to be processed by the same data conversion programs following the initial conversion to Full SuperJournal SGML.
Some further SuperJournal DTDs were developed, which define tables of contents at various levels within the journals hierarchy.
This report accompanies reports [SJMC121] "SuperJournal Data Analysis" and [SJMC122] "SuperJournal Full Article Data Analysis", which provide:
Few software tools were used during the DTD analysis and definition. SGML parsers (sgmls and OmniMark) were employed to validate the DTDs. "Near & Far Designer" was used for DTD documentation, this tool also providing some validation. SuperJournal did not investigate any SGML authoring tools because all SGML data was supplied by the publishers.
SGML parsing tools were used to validate the developed DTDs. The same SGML parsers, i.e. sgmls and OmniMark, were used as for the SuperJournal Data Conversion Process. They are described in [SJMC140].
Near & Far Designer, from Microstar Software Ltd. (http://www.microstar.com), provides a graphical "tree" representation of a DTD. This tool could potentially be used for DTD design or update, if a graphical tool is the preferred development method. Within SuperJournal it was used for DTD documentation, some of the resulting diagrams and reports being in [SJMC121, SJMC122]. It also provided additional DTD validation. Comments about the DTD elements, such as a title in "words" rather than a probably cryptic name, and a description, could be added, either through the graphical interface, or by their inclusion in the actual DTD in the required format. Near & Far Designer also captures information, and generates various reports, about element attributes. Unfortunately it was found that descriptive comments associated with DTD elements lost their formatting, including newlines, within reports, making them potentially less useful.
The initial phase of SuperJournal DTD definition involved analysing the DTDs used by the publishers involved in SuperJournal. To a large extent these were the DTDs used for the supplied journals, but one or two DTDs were included in the original analysis which have not been included in the data conversion process, because these publishers' journals were not subsequently included in SuperJournal. During the project, when new journals with different DTDs have been included or DTDs have been changed or updated, these DTDs have been added to the analysis.
Most publishers use their own DTD, though some use Majour or SSSH for their article headers. In fact, many of the DTDs are customisations of Majour/SSSH or Elsevier, which assisted the process of DTD analysis. But some have developed their own very specific DTDs. Some DTDs are very verbose and have been written to include every possible different case which makes the SGML much more difficult to process. Some publishers include print formatting information, which theoretically should not be included in the SGML (which should define document structure, not print formats). Some DTDs, which appear to have been developed by typesetters who create SGML for several publishers, allow several different means of capturing the same information. At the other extreme, SuperJournal has been supplied with very simple DTDs: one specifically for SuperJournal article header supply; one very minimal full text DTD, though with special enhancements for SuperJournal. Some publishers chose to use the SuperJournal Header DTD, either temporarily or for the duration of the project, for article headers supplied to SuperJournal.
During the SuperJournal project, new journals with different DTDs have been included. Also, the DTDs for existing journals have been changed or updated, in some cases changing from article header SGML to full article SGML.
The detailed results of this DTD analysis are given in spreadsheet form in the "SuperJournal Header Data Mapping" in [SJMC121] and "SuperJournal Full Article Data Mapping" in [SJMC122]. Dublin Core, whose primary use is for metadata specification (see [SJMC141]), was initially included in the header data mapping, because it was used for article header data specification for a software application which could possibly have been included in SuperJournal. For completeness, the SuperJournal DTDs are also included on these "Data Mapping" spreadsheets. These spreadsheets attempt to indicate where elements are optional or mandatory, single or repeatable. But they do not always indicate levels of structure within the DTD.
It was found that, despite differences in tag naming and DTD structures, there was a great deal of commonality amongst the information included in the DTDs. Where the information diverged was generally where publishers had included elements for their own use, particularly during their publishing production processes or for print formatting, information which was of little interest to SuperJournal. There were differences in the levels of structure within the header DTDs, ranging from largely flat structures to much deeper ones.
It was discovered that in some DTDs significant information necessary to SuperJournal, such as publisher name, ISSN, or cover date, was missing or optional. Most of this information is deducible or, in the case of cover date, can be included during the data conversion process, but it would seem preferable for all article SGML, both header and full text, to be self-identifying. In some cases, although the information was missing from the DTD, it was included in the actual data supplied, but there were some instances where full article SGML was supplied without page numbers, which are not derivable. Generally, the DTDs where this type of header data was missing or optional were the "Elsevier-like" DTDs where this information is supplied as attributes on the main "article" tag, rather than in separate tags. SuperJournal had a particular problem, which required manual intervention, with DTDs which allowed author names to be given as a simple text string rather than separately tagged authors with separated name elements, when the publisher chose to supply the author names in this way.
If SuperJournal were to make recommendations on header DTD design, following this DTD analysis, they would include:
The Generic SuperJournal Header DTD encompasses most of the SuperJournal publishers' DTDs. All the supplied header SGML which conforms to the DTDs included should parse against this DTD. This DTD was written to form part of the SuperJournal Data Conversion Process [SJMC140]. Parsing against this generic DTD validates the supplied SGML and produces a data format suitable for entry to the SuperJournal header SGML generation program. Those DTDs which are included in this generic header DTD are indicated on the "SuperJournal Header Data Mapping" spreadsheet in [SJMC121]. Also indicated is whether this generic parsing is in fact the data conversion route used. For journals whose SGML is not processed by this route, conversion to SuperJournal SGML occurs first, the SuperJournal Header DTD being included in the generic header DTD, originally for completeness.
Defining the generic DTD was not essentially difficult, but rather extended because of the large number of different possibilities catered for. It was simply a matter of analysing the content of each DTD, and then reflecting its tag names and structure within the generic DTD. The only slight problem occurred where the same name was used by two DTDs for two completely different constructs, e.g. <issue> is used in the Majour DTD to introduce a structure of journal issue information, whereas in another DTD <issue> is simply `PCDATA' containing the issue number. Majour and SSSH DTDs were included in the generic DTD, even though initially no journal data was supplied in these specific formats.
Because of the size of the resultant Generic SuperJournal Header DTD, it is not easy to maintain or change. This is one reason why DTDs supplied later in the project were not included, the other reason being the change in the data conversion route for later supplied and full text SGML journals when OmniMark became available (see [SJMC140]).
The Generic SuperJournal Header DTD is listed in [SJMC121].
SuperJournal did not attempt to develop a generic full article DTD. It was not required for the full article SGML data conversion route, which uses a separate OmniMark data conversion for each publisher's DTD. Although there is much overlap between the content of the various publishers' full article DTDs, which means that a generic full article DTD would be possible, it would have produced a much larger and more unwieldy DTD than the generic header DTD.
The SuperJournal Header DTD was developed to provide a common format for article header data capture by the SuperJournal application. It was not intended to be an exemplary serials header DTD, although some publishers chose to supply their article headers to SuperJournal in this format.
From the header DTD analysis, the significant elements and structures of article headers were identified. In addition, the data content requirement for the SuperJournal application, in order to be able to display the journals' tables of contents and article headers, were considered. Using these elements, the SuperJournal Header DTD was defined. In general, SGML elements which were specific to a particular publisher's production process were not included. The exceptions were some publishers' identifiers and types, which are captured by the SuperJournal application but not displayed on to the end user.
The elements included in the SuperJournal Header DTD are shown on the "SuperJournal Header Data Mapping" spreadsheet in [SJMC121]. Which elements are displayed by the various parts of the application are shown in the "SuperJournal Header Display" spreadsheet in [SJMC121]. The SuperJournal Header DTD is listed in [SJMC121].
Particular details to be noted about the SuperJournal Header DTD are:
The SuperJournal Full Article DTD was developed to provide a common format for subsequent full text SGML processing within SuperJournal. It was not intended to be an exemplary full article DTD.
From the full article DTD analysis, the significant elements and structures of articles were identified. Using these elements, the SuperJournal Full Article DTD was defined. As with the SuperJournal Header DTD, SGML elements which were specific to a particular publisher's production process were not included. The header part of the SuperJournal Full Article DTD is generally identical to the SuperJournal Header DTD, though there are one or two differences introduced to assist the data conversion process. Where appropriate, SGML element names within the Full Article DTD correspond to the equivalent HTML names, to ease subsequent processing.
The elements included in the SuperJournal Full Article DTD are shown on the "SuperJournal Full Article Data Mapping" spreadsheet in [SJMC122]. The SuperJournal Full Article DTD is listed in [SJMC121].
Particular details to be noted about the SuperJournal Full Article DTD:
Where an article's bibliographic references are marked up using "fine-grain" SGML tagging, processing of these references to increase end-user functionality becomes possible. If references are tagged in a "coarse-grain" way, as just a text string, it is more difficult to parse this text by an automatic program to identify its constituent elements. The bibliographic processing of article references within the SuperJournal data conversion process is described in [SJMC140].
Most of the full article DTDs analysed by SuperJournal included some fine-grain reference tagging. In many cases this tagging is optional and is not always reflected in the supplied SGML data, though this situation improved during the time of the project.
From the DTD analysis, the significant elements of a bibliographic reference were identified, and composed into a bibliography section within the SuperJournal Full Article DTD definition. A "text string" option is included to allow for cases where mapping of references in supplied article data is not possible, but this is expected to be a "fall-back" option rather than the norm. The ordering of the bibliographic reference elements is not defined, and they are all optional and repeatable. Generally the element names are different from the names of other similar elements in the DTD to distinguish them as bibliographic reference elements both for readability and to aid processing. They consist of:
Further DTDs have been defined by SuperJournal. These DTDs specify tables of contents files at each level of the "SuperJournal/journal/issue" hierarchy. There is also a DTD which specifies an article's "Cited By" list. Details of the content of these DTDs is given below, and the DTDs themselves are listed in Appendix 12. A description of the creation of, and possible use for, the SGML files corresponding to these DTDs is given in [SJMC140].
There may be an excess of information in the tables of contents SGML conformant to these DTDs, but it was included partly for completeness and partly for possible future use. In addition to the particular elements required for capturing tables of contents, these DTDs contain a <head> element, with sub-elements <title> and <meta>, to allow the inclusion of Dublin Core metadata within the SGML files (described in [SJMC141]).
It was decided to maintain this information in SGML format, and thus to define DTDs, because it is derived from information already held in SGML. SGML seemed the obvious format for capturing these tables of contents in a rigorous way, and it would simplify subsequent utilisation of these files especially if programs were written in OmniMark or displayed via XML.
There is an additional set of DTDs which are used solely for the data conversion processing which generates the tables of contents SGML files. These DTDs include a new encompassing root element so that they allow multiple instances of: articles (sjarts.dtd); issues (sjisslist.dtd); journals (sjjnllist.dtd); "cited by" lists (sjcitbylist.dtd).
The Journal Issue Table of Contents DTD (sjtoc.dtd) captures the following information about a journal issue:
The Journal Issue List DTD (sjissue.dtd) captures the following information about a journal's available issues within SuperJournal:
The SuperJournal Contents List DTD (sjjnls.dtd) captures the following information about the journals and current issues available in SuperJournal:
The "Cited By" DTD (sjcitby.dtd) captures the following information about articles in SuperJournal which cite a particular article:
A DTD was defined which specifies the format of the data load file for the BRS search engine, which underlies the NetAnswer search tool within the SuperJournal application. This was defined in order to introduce special HTML displayable characters into the article metadata captured by BRS via SGML parsing (see below). This BRS Data Load DTD (sjbrs.dtd) defines the simple specification of the BRS data load file.
During SGML parsing substitutions are made for SGML character entities as defined in the specified entity files which are declared in the DTD. Because different substitutions are required in different situations, it was found necessary to maintain several sets of entity files and hence several copies of the DTDs, each declaring the relevant entity set. More information about the display of special characters using SGML character entities is given in [SJMC140].
There are two entity sets for SuperJournal SGML header generation:
There are three entity sets used for SuperJournal header and article display:
The SuperJournal Header DTD was defined primarily to provide a common format for article metadata capture by the SuperJournal application, and for article header display. Similarly the SuperJournal Full Article DTD was defined to furnish a common format for full text SGML article manipulation. They were born out of necessity because of the multiple DTDs used for the data supplied to SuperJournal. They were not intended to be exemplary header and full article DTDs, nor were they envisaged as competition for emerging standards such as SSSH for serial headers or ISO-12083 for full articles. However both of these DTDs are adequate for serial header and article SGML definition and some interest has been shown in their use. The SuperJournal Header DTD was used for data supply by some SuperJournal publishers either temporarily or for the duration of the project. The bibliographic reference section of the SuperJournal Full Article DTD has been utilised by another SuperJournal publisher. The Full Article DTD was supplied to a European University Library publisher who expressed an interest in using it.
Because of inexperience in working with SGML or in publishing early in the project, mistakes were probably made in the DTD definition. With hindsight one or two elements may have been defined differently (e.g. the capture of keywords noted in Section 6), or would not have been included (e.g. external identifiers like "snoopy"). But the DTDs have proved satisfactory for their use in defining the data format for the SuperJournal data conversion process and the SuperJournal application data capture.
With hindsight, the original design of the SuperJournal data handling process, using a generic Header DTD and a generic transformation program to the SuperJournal Header DTD, was rather ambitious, though it appeared a good theoretical strategy. In practice, the generic Header DTD became unwieldy and increasingly difficult to maintain and change to incorporate new DTDs. For this reason, not all publishers' DTDs which specify data supplied to SuperJournal were included in the generic Header DTD. For a similar reason, a full article generic DTD was not defined.
The SuperJournal DTDs have continuously evolved during the project. Version change history has been maintained in comments at the head of the DTDs. During the project the SuperJournal Header and Generic Header DTDs were baselined twice to version 2.0 and the SuperJournal Full Article DTD was baselined once to version 1.0 (now version 1.1) for production of the reports [SJMC121, SJMC122]. Changes to the DTDs became necessary when:
The DTD analysis performed before definition of the SuperJournal Header DTD, indicated that there was a minimum requirement for article metadata to self-identify an SGML article header file (see Section 3). Some of the analysed DTDs did not fulfil this minimum requirement, or allowed some of the required items to be optional.
Implementation issues necessitated the use of several different SGML entity sets, which in turn meant that several copies of the same DTD (both Header and Full Article) were required. This causes maintenance problems, because any change to a DTD must be propagated across all the copies.
This web site is maintained by email@example.com
Last modified: July 07, 1999