[SJ Logo]SuperJournal Production Process

Home | Search | Demo | News | Feedback | Members Only


Ann Apps, Manchester Computing, University of Manchester

SuperJournal Technical Report SJMC130

Contents:
1.  Overall Approach
2.  Journal Data Transfer
3.  Journal Data Receipt
4.  Verification
5.  Archive
6.  Journal Data Pre-Processing
7.  Data Conversion
8.  Work Tidy File
9.  Application Data Load
10. Index New Data
11. Error Logging
12. Tables of Contents
13. New Journals
14. Background Tasks
15. Technical Appendices

1. Overall Approach

This document details the actions required to process the journal data received from the publishers, from receipt through to loading into the SuperJournal application. It is a "user guide" defining the actions to perform during the Production process rather than a description of the Data Handling and Conversion process which may be found in [SJMC140].

Within this document, the journal names, and hence some conversion program names have been masked. A "Manchester Computing internal" version of this document also exists [SJMC131] which uses real journal and program names, thus providing the "real" Data Handling User Guide used for day-to-day SuperJournal production.

For a diagrammatic representation of the Production Process see Appendix 15.1.

1.1 Data Handling Environment

All the data handling operations should be run as the user on the CS6400 supjinfo so that all file ownership is consistent, except where stated differently. All executable programs are accessible from the "PATH" of supjinfo. The SuperJournal journals directory structure is defined in Appendix 15.4.

1.2 SuperJournal Data Spreadsheet

A SuperJournal Data Spreadsheet is maintained which records the data handling actions performed from receipt to data load. Details of this spreadsheet are given in Appendix 15.2. Events which should be recorded are indicated throughout this User Guide.

2.  Journal Data Transfer

Journal data is generally transferred from the Publisher to SuperJournal by FTP. Data consists of:

The actual format of supplied data may be:

Details of how particular journals are supplied are given in Appendix 15.3.

Publishers should send an email notification of journal data transfer, but a check for data arrival should be made on the FTP site regularly by:

du /superj1/Publishers | more

3.  Journal Data Receipt

On receipt of a new journal issue a new directory should be created within the SuperJournal standard journals directory structure, with a symbolic link from the main journals directory where necessary. The SuperJournal journals directory naming convention is defined in Appendix 15.4. Thus the directory will be:

/superj<n>/<Publisher>/<JournalIdentifier>/V<Volume>I<Issue>

Copy the data to this new issue directory. Unzip and untar where necessary.

It is sometimes necessary to flatten publisher supplied directory structures. All supplied files should be in the main issue directory, except for any graphics files which should be in a single sub-directory, which will eventually be called graphics. Exceptions to this are:

3.1 Extraneous Files

Copy any extraneous files which may possibly be required in the future to another directory. Delete any unwanted extraneous files, e.g. blank PDF pages; files in unnecessary formats. Details of dealing with extraneous files are given in Appendix 15.3.4.

3.2 Unpack SGML Headers

For journals where the SGML headers are supplied packed into one file, this file must be unpacked.

4.  Verification

Problems with the data transfer and requests for resupply should be sent to the publisher at this stage, though there could be later problems with the actual data.

If the data has arrived and unzipped OK:

5.  Archive

5.1 Long Term Archive

Data which has been supplied tarred and gzipped or zipped is archived by using the Legato Networker software attached to the CS6400.

5.2 Temporary Archive

To protect against any possible destructive errors in the data conversion process, copy the files within the new issue directory to a temporary directory:

and if there are graphics files:

[Note that more detailed temporary archiving may be needed for journals with more directory structure.]

These files should be removed when the data conversion is complete.

5.3 SuperJournal FTP Site Tidy

The newly supplied data should now be removed from the SuperJournal FTP site.

6.  Journal Data Pre-Processing

6.1 PDF Articles

6.1.1 PDF File Naming

If the file extension of the supplied PDF is not ".pdf", then this must be corrected by:

where <extn> is the supplied file extension, typically ".PDF".

If the SGML and PDF file names are inconsistent, but the difference is only in case:

to make all characters in filenames lower case.

Any further file renaming necessary should be performed by the SGML pre-processing scripts detailed below, but occasionally manual intervention is necessary to make the PDF and SGML file names consistent.

6.1.2 PDF Security

Indexing by the RetrievalWare search engine requires that the PDF articles have no security settings at all, i.e. no password and no restrictions. PDF articles which are known to have security settings, i.e. MGP5 and maybe PS1 (both of which use passwords), should be processed using Acrobat Exchange to remove the security settings. To use Acrobat Exchange on the CS6400 as user supjinfo type:

Any further PDF articles with security settings will be discovered during the Main Header Conversion phase, and should then be corrected.

6.1.3 PDF Generation from PostScript

Journals CCS5 and PS18 are supplied as PostScript, generally one file per page of the journal issue, which must be distilled into PDF articles.

6.2 SGML Pre-Processing

6.2.1 Control Characters

If the supplied SGML files contain Control characters, in particular ControlM characters at the end of lines, remove these characters by

where <extn> is the file extension for the SGML files. Note that for journals MGP13 and MC4 this program must be run on both the SGML headers and the full text SGML.

6.2.2  Journal Specific Pre-Processing

Journal specific pre-processing may rename the files, both SGML and PDF as well as modifying the SGML. The pre-processing required for each journal is as follows. Note that the file extension shown for the SGML files is the most commonly used one, but actual supplied data may differ.

6.2.3 Error Checking

After running the relevant journal pre-processing script check the named log file. Any errors should be corrected in the base files and the pre-processing rerun before progressing to the main data conversion. At this point a check should be made that the numbers of SGML and PDF files are still the same as supplied. A difference in numbers caused by file renaming may indicate:

In the first two cases, the base SGML data should be corrected and the pre-processing rerun. In the last case, this must be reported to the publisher and data conversion for this issue put on hold until a correct article is resupplied.

7.  Data Conversion

7.1 Full Article SGML Conversion

7.1.1 Journal Specific Conversion

Journals supplied as full text SGML require the following journal specific conversion:

7.1.2 General Full Article SGML Conversion

For all the above journals:

7.1.3 Figure Problems

Figure problems are generally:

7.1.3.1 ImageMagick

ImageMagick is generally a useful tool for investigating figures. A figure may be displayed by:

7.2 PDF Article Reference Extraction

Journals for which PDF reference extraction programs exist require the following conversion:

7.3 Main Header Conversion

7.3.1 Validation

7.3.2 Header Conversion

7.4 Reference Linking

7.5 HTML Generation

HTML articles and MiniContents are generated for journals supplied as Full SGML, i.e. MGP1; MGP2; MGP3; MGP7; MGP13; MC1; MC4:

7.6 Medline Article Links

For Molecular Genetics and Proteins journals, Medline identifiers are determined for the articles themselves:

8.  Work File Tidy

When the data conversion for a new issue is complete:

9.  Application Data Load

10.  Index New Data

The SuperJournal application includes three search engines, requiring the new journal issue data to be indexed by each of them.

10.1 Isite

The new issue article headers are indexed by Isite during the SuperJournal application data load. No further action is required, except an occasional separate re-index when the number of Isite databases becomes too large.

10.2 RetrievalWare

The data is indexed, as both article headers and full articles, by RetrievalWare during the SuperJournal application data load. No further action is required for the article header index. Following the data load it is necessary to check that the full article index was performed correctly by:

10.3 NetAnswer

The new issue article headers are indexed by BRS, the search engine which underlies NetAnswer, as a separate operation during the data handling process. This action must be performed under a user who has BRS load privileges. The data to be indexed is in BRS format in the *.brs files in the issue directory which are created by the main header conversion program. Indexing is run within the new issue directory.

10.3.1 BRS Data Modification

If it becomes necessary to modify the data indexed by BRS:

11.  Error Logging

All the SuperJournal data conversions write error and diagnostic information to log files. Between each separate stage of the data conversion process these log files should be checked. Any indicated errors should be corrected and the program re-run before continuing to the next stage of the process. The majority of these errors are caused by incorrect base data, but some may be program bugs which require correction. Error checking has been indicated in the above sections only where errors are likely to occur, but every log file should be checked. The particular data conversion program run indicates to the data handler the name of the relevant log file.

The log files are left in the journal issue directory after the data handling process is complete. The existence of a particular log file indicates that this part of the process was performed.

12.  Tables of Contents

Tables of Contents are created or updated within the SuperJournal/Journal/Issue hierarchy during the data handling process. These files are created or updated automatically during the Main Header Conversion phase. No manual intervention should be required, except where a journal cluster is moved onto a new disk.

12.1 SuperJournal Contents

The top-level Table of Contents file is: /superj1/Journals/SuperJournal.toc. This file lists the journals available within SuperJournal by cluster, and indicates the latest issue. A new "dummy" entry must be made manually in this file if a new journal is added to SuperJournal (see below).

12.2 Journal Contents

The journal contents file is: /superj<n>/Journals/<Publisher>/<JournalIdentifier>/<JournalIdentifier>.toc where <n> is the number of the disk containing the data for the particular journal. It lists the issues available for this journal, latest issue first. There is a symbolic link to this file from /superj1/Journals/<Publisher>/<JournalIdentifier>/<JournalIdentifier>.toc.

If the disk where data for a journal is held is changed (e.g. Molecular Genetics and Proteins data is held on /superj5 from January 1998, but on /superj2 previously):

12.3 Issue Contents

The issue contents file is: /superj<n>/Journals/<Publisher>/<Jid>/V<Volume>I<Issue>/<Jid>V<Volume>I<Issue>.toc where <Jid> is the journal identifier. It lists the articles within the issue and is created automatically.

13.  New Journals

If a new journal is added to SuperJournal

<sjjnl>
<jnlinfo>
<jid>JournalIdentifier</jid>
<pubdir>PublisherDirectory</pubdir>
<jtl>JournalTitle</jtl>
</jnlinfo>
</sjjnl>

14.  Background Tasks

14.1 Medline Article Links

At the time when a new Molecular Genetics and Proteins journal issue is processed and loaded into SuperJournal the Medline identifiers for the articles are often not available. It is necessary to ascertain these Medline identifiers and add them to the article SuperJournal SGML article headers at a later time. The best strategy is to process a batch of these unrecorded Medline identifiers at regular intervals. A note is made on the SuperJournal Data Spreadsheet for journal issues whose article Medline identifers are not yet known. For some journal issues, several attempts are necessary before Medline has become aware of them.

14.2 Back References

During the generation of intra-SuperJournal links a list of unresolved "back references" is automatically created. This is in a directory:

For each journal where references could not be found there is a file whose name is the SuperJournal Journal Identifier. There is a two line entry in this file for each unresolved reference to the particular journal:

At regular intervals, this list of "back references" should be inspected and the references resolved where possible, the determined references being removed from the list. This list of "back references" was introduced to cover a potential problem of late-loading of some journals. In reality, most unresolved references are caused by invalid references, usually incorrect year, volume or page.

15.  Technical Appendices


This web site is maintained by epub@manchester.ac.uk
Last modified: July 07, 1999