Home | Search | Demo | News | Feedback | Members Only
Ross MacIntyre, Manchester Computing, University of Manchester
SuperJournal Technical Report SJMC150
1. Purpose of the Report
3. Data Transfer
4. Data Contents
5. File Naming
6. File Resolution
The principal aim of the SuperJournal Data Handling process was to set up a scaleable process to allow the loading of large amounts of data into the application. Wherever possible, automatic scripts were to be created to process and load the data.
The points where a process is not scaleable include those where manual intervention is required. So scalability issues very often become issues of manual intervention. During the project problems were sometimes overcome by writing specific scripts, to reduce the level of effort required, but this still meant it was necessary to pre-process each publisher's data with a different script before running it through the main data conversion process.
This report contains various recommendations made in light of experience with the various participating publishers. All should be found in realm of "common sense" and are consistent with the concept of "self-identifying data", well established in data processing. It should be recognised that the files received by SuperJournal were not necessarily produced for this purpose, they exist to get the journals into print.
The "top tips" can be summarised as follows:
The table below gives an indication of the volumes of data processed, on a part-time basis, by one member of staff at Manchester Computing:
* In May 98 there was an on-going load of back issues for one of the subject clusters.
Recommendation: Files should be transferred via (binary) FTP.
During the project, the availability of FTP increased to the point where all publishers have the capability. The process is easiest if publishers FTP data to SuperJournal, however, publishers data is increasingly being made available on FTP sites from which authorised intermediaries "pull" the data. If data is to be transferred in this fashion, the publisher should arrange to send email alerts when data is made available. This should form part of the routine procedure, not be done as an afterthought.
When creating the unit of delivery, the files to be transferred should ideally be packed together using the utilities tar and gzip.
Alternative methods of electronic file transfer include ISDN, which is available to a number for publishers/typesetters, but is not supported at Manchester Computing, existing as a test facility only. Email is not recommended for the transfer of data. In particular, it is not recommended for PDF transfer because it may cause corruption and because of the file sizes involved, which may exceed some mail systems limits.
Regarding physical transfer mechanisms, CD-ROM is the most preferable as it is resilient, prevents corruption in transit and provides a physical archive of the original data received. Other media received included: DAT, Optical, SyQuest, Zip, and 3.5" discs. Zip discs have been reliable and are reusable, but drives for reading are not as ubiquitous as for CD-ROMs.
Manual intervention is inevitable during the process of data transfer to SuperJournal. It is necessary to move the data into a defined directory structure. Early in the Project, problems were encountered when a publisher sent more than one issue without any directory structure. This necessitated manual inspection of the content of the SGML header files to distinguish between articles in different issues.
The data files transferred should include a file containing a listing of the files that should be received. This is used to verify that all files intended have been successfully received.
Preferably the directory structure of the transferred data should be simple with the PDF and SGML files in one directory, and a single subdirectory for graphics files where these are required. There should be no tarred/zipped files included within the main tar/zip file. Also there should be no extraneous files.
An example of a good unit of transfer, for Volume 1 Issue 1 of a journal "JNL" would be a tarred/gzipped file called: JNL1_1.tar.gz. This would unzip into: JNL1_1.tar and untar into:
where 0101001...0101087 are the articles, graphics is a directory containing the gif files, and filelist is a text file listing the files sent.
The directory "graphics" would contain files for example:
where 0101001f1.gif is figure 1 of article f1, and 0101001f1th.gif is the thumbnail gif for figure 1. Obviously, the figure references within the SGML file for article 0101001 would be consistent with these names.
In general, the project retains publishers' file names. But in some cases it has been necessary to rename files. We require files to order under UNIX in page number order (or Table of Contents) within an issue so that they are loaded into the application in the correct sequence. Where this is not the case, scripts have been written to rename files using volume number and start page number.
It is also necessary to rename files where a separate directory has been sent for each article with the actual article SGML file called "main.sgm". File names such as "s1.sgm", etc. do not appear very informative, but we have retained them except where they do not sort in page number order.
Where we have renamed files they are according to the convention where v=volume, i=issue, p=start page:
vviippp.sgml, e.g. 0904025.sgml for volume 9, issue 4, page 25
a0pppp00.sgml, e.g. a00012300.sgml for page 123
In the second case, the volume and issue are deduced from the directory structure, the trailing zeroes are left spare for extra tagging such as graphics.
The type of file should be explicitly stated via an appropriate extension. File extensions have included: .txt, .new, .412, .396, .sgc, .ehd and .phd for SGML, and been completely absent for Postscript. (Internally, the project uses .sgml, .pdf, .ps, .gif, .jpg and .html.) As a UNIX platform is in use, all file extensions are converted to lower case.
There should not be a problem with the length of filenames. Having said that, on one occasion a publisher copied files to a 3.5" disc and all filenames were contracted to 8 characters, losing their significant characters, requiring manual inspection of all files and subsequent renaming.
Recommendation: One-to-one correspondence between headers and article files.
We would expect to be supplied with the same number of SGML headers and full articles. Where discrepancies occur these have to be investigated. (Note that as a UNIX system has been used, the correspondence should include the case of the file names.)
In general we have found that:
Where there are more headers than article files, one file usually contains several short articles, each of which has a header. In this case we have to create symbolic links to the composite article file corresponding to each header, or the file can be manually split.
Where there are more article files than headers, we have usually been sent extra material such as ToCs and end pages. But sometimes these article files contain material which should be in the issue, but for which publishers do not generally create headers, such as poems or reviews. We cannot load articles into SuperJournal without a header, so in some instances we have manually created SGML headers.
Sometimes there are an excessive number of article files with no headers which should really be just one article, e.g. "Book Reviews" should not be split into separate reviews if headers are not supplied. In this case we have to manually reassemble a Book Review article from the constituent parts. This is more complicated when pages are repeated in the separate files because each review does not start on a new page.
Some publishers' PDF files have security settings, to prevent files being changed, printed, etc. Whether or not a password is used, any security setting causes the PDF to be in encrypted format. These files were not previously a problem because Acrobat Reader could deal with them OK.
Our full text indexing software, RetrievalWare, is unable to index either damaged or encrypted files. It is possible to repair damaged PDF files, and to remove security settings (provided we are supplied with any password) using Acrobat Exchange. But this is a manual operation. Each file must be loaded into Acrobat Exchange, and saved as a new file. Acrobat Exchange does not have any command line or batch options. (It is accepted that there is an option to develop the facility in-house having joined the Acrobat Developers Association.)
Generation of the PDF articles is outside our remit in Manchester, though we do have the facilities and have done so. In these cases we generate PDF from Postscript using Acrobat Distiller, but with no manual intervention or qualitative checking.
It is possible to automatically generate thumbnails during distillation from PostScript, which enhance the usability of the PDF files.
It is also recommended that the metadata (Title, Subject, Author, Keywords) is created within the PDF "Document Info", as this is accessible by certain search engines, improving the display of search hits.
To include multimedia links into a PDF file to be displayed on the Web requires the link to be created as a web link. Relative file names are best used for these links, but for display with some versions or set-ups of Acrobat, it is necessary to set the Base URL in the PDF document. We also need to set this Base URL so that we can log when users access this multimedia link. Setting the Base URL is a manual operation, which we have to do ourselves in Manchester. It is probably error prone because it is a long URL to type in. Again there is no command line or batch option.
PDF files created using Acrobat Distiller version 3.0 are not always readable using Acrobat Reader of an earlier version, e.g. 2.1.
Early in the Project some PDF files sent individually were damaged after FTP transfer. This was apparently caused by the use of ASCII FTP transfer. All PDF files should be FTPd in binary mode. Do not rely on "automatic", as this can result in ASCII being selected for PDF. This has also occurred more recently where "overnight" automated procedures were in place at the publisher's.
As noted above, damaged PDF files can sometimes be repaired using Acrobat Exchange, but this is a manual operation.
No problems have been encountered when an issue's worth of files have been tarred/gzipped and FTPd in binary mode.
The main problem found with SGML is poor data quality. SGML requires strict adherence to its syntax. Errors show up when parsing the SGML against a DTD. These generally appear to be typing errors. In some instances full text SGML articles have several lines of the article missing.
<author><name><surname>Caldwell</surname>, <fname>M.</fname></name></author> (<date>1995</date>).
<title>Trichology and the net.</title> <pubtitle>Am. J. Typo. Errors.</pubtitle> <volno>56</volno>, <pages>193–201</pages>.</reference>
Some SGML is littered with stray punctuation, particularly between author names, and within affiliations. This requires a pre-process script to remove the punctuation to enable parsing.
, <au><fnm><b>François </b></fnm><snm><b>Cooper<sup>3</sup></b></snm></au>
and <au><fnm><b>Gérard </b></fnm><snm><b>Roses<sup>1</sup></b></snm></au>
Identifier and identifier reference attributes should match, or the SGML parser complains. Some data has a "null" identifier included where an identifier is not used or an author reference with no corresponding affiliation identifier. This is not really a good idea. Parsing will fail if there are two of these in one article. And they will show on the display, e.g. in the author names and affiliations, if we have not noticed them.
e.g. <author linkend="aun">
Newlines, or the lack of them, seem to be problematic. Some data has ^M characters, either instead of, or as well as, newlines. A script has been written to remove these because the data conversion process cannot cope with them. Some SGML headers have no newlines at all, which causes problems in some of the software used. We have scripts to insert them following end tags. We find newlines in strange places, such as in the middle of an SGML tag, or within a Journal Name. These may be caused by editing software with a fixed line length set - it is mindlessly inserting a newline after so many characters. It is difficult to process this data automatically and we have to resort to manual editing.
The SGML parser cannot cope with characters outside the normal range. These should be entered as SGML character entities. e.g. \325 should be ’ Also note in particular & should be &
Fields with no "real" content are sometimes supplied with "dummy" values, presumably because the attribute is defined as mandatory in the DTD used.
Note that this will potentially be displayed to the end user. It seems likely that this type of error is symptomatic of increased use of default values in SGML-editor templates. We have also seen spurious affiliation identifiers present.
Other null content includes: <abs><p> </abs> and irrelevant fields, e.g <bm>. This type of "noise" can be removed with data processing scripts.
e.g. <re>Received: 12 August 1996; revised: 9 December 1996; accepted: 10 December 1996</re>
<re>12 August 1996<rv>9 December 1996<acc>10 December 1996
better still would be the more well-formed:
<re>12 August 1996</re><rv>9 December 1996</rv><acc>10 December 1996</acc>
Our minimum requirement for data supplied in the headers is:
Although this information may be obvious to a person, it is not obvious to an automatic script. We cannot load an article that has no title.
The following data is preferred, but it is possible to generate them automatically:
The following describes actions taken to create the above fields in particular circumstances.
Some publishers do not supply Cover Dates, or they supply only the year. In these cases, we supply the cover date as an argument to a pre-processing script.
e.g. <art ... cd="96" ...> will be converted to <art cd="0596" >
There are many variations of date formats in Cover dates, including seasons. And there are many variations in supplied copyright strings. We have had to include data conversion coding to accommodate them all, and to provide some degree of consistency of their display in the SuperJournal application. e.g. they have been supplied as: 19971001, January 1997, Winter 1996/97 , 1.1.1997, 0197.
We found inconsistencies between what different publishers include within the copyright strings. Where they are not included or incomplete we add them during the data conversion process. Some publishers do not seem to include a year with the copyright, so we have included that for them (using the cover date, but there may be an issue here especially around Dec/Jan). Where publishers do not include a copyright string and have not given us a rigorous definition for it we copy the notice from the bound paper copy of the journal. In some cases the publisher has the year at the end of the copyright string, whereas the remainder put the year at the start. We decided to leave this as the publisher has typed it because it matches the copyright statement in the bound paper copy of the journal.
To provide indexing of author names within the application, we expect the names to be tagged to indicate each author's surname. We have seen SGML that lists all the authors in one field, with no obvious way of splitting them automatically. We manually edit these author fields.
e.g. <au><snm>Benny Goodman and Ann Apps</snm></au>
But not all examples are as straightforward as that:
e.g. <aug><au><snm>SHANJU ZHANG YUBIN ZHENG ZHONGWEN WU MINGWEN TIAN DECAI YANG RYUTOKU YOSOMIYA</snm></au></aug>
Also, affiliations are sometimes run together, but it is usually possible to deduce where to split them.
e.g. <aff><sup>1</sup>INSERM U 249, CNRS UPR 9008,
<sup>2</sup>Service de Génétique, Hôpital Charles Nicolle, <cty>Tunis</cty>, <cny>Tunisia </cny>and
<sup>3</sup>Service Commun no. 7, INSERM, 75005 <cty>Paris</cty>, <cny>France</cny></aff>
<aff><oid="1"> INSERM U 249, CNRS UPR 9008
<aff><oid="2"> Service de Génétique, Hôpital Charles Nicolle, <cty>Tunis</cty> <cny>Tunisia </cny></aff>
<aff><oid="3"> Service Commun no. 7, INSERM, 75005
The author names and article title in some SGML headers supplied early in the Project had no capital letters. These were changed to all capitals by the data conversion process.
e.g. <atl>cultural studies without guarantees; response to kuan-hsing chen</atl> <aug><au><fnm>ien<snm>ang</au>
<atl>CULTURAL STUDIES WITHOUT GUARANTEES; RESPONSE TO KUAN-HSING CHEN</atl> <aug><au><fnm>IEN<snm>ANG</au>
For one journal, all were supplied in capitals.
Where sub-headings appear inside a tagged field such as keywords or abstract or "issue" in cover date, these are removed during the data conversion process, because they would appear strange in the SuperJournal display where the headings would be duplicated.
e.g. <cd>Issue 1: January 1996</cd>
<re>Revised: 9 July 1997</re>
No Affiliation References
Here the authors are grouped by address. The data conversion pre-process adds affiliation references within each author group.
Some journals are supplied as full article SGML. From these we generate HTML to display the article via an intermediate SGML conformant to a full article DTD defined for the project, FSJ-DTD. We also extract the header if it is not supplied separately. This data conversion code uses SGML-aware software, OmniMark.
If internal references, e.g. from a figure reference to the figure or from a citation to the bibliography, are tagged, the links can be generated in the HTML making the article more user-friendly. Some publishers do not supply this tagging. It is difficult, particularly with the internal citations to generate them with a global edit.
e.g. <bbr id="graves95"> and <figr id="fig1"> are preferable to (Graves 1995) and (Fig. 1).
Note: in general, the tagging of bibliographic references is the area most prone to typing errors and mis-tagging.
There are variations in the way bibliographic references are tagged. In some cases each bibliographic reference is just a text string. In other cases there is a fine-grain tagging from which details such as title, author names, journal can be ascertained. Where bibliographic references are tagged in this way, we add links to the HTML article to e.g. MEDLINE abstracts and to full text if already loaded in SuperJournal.
e.g. of a simple text reference:
<bb ID="B1">Morgan, J.E. (1994) Cell and gene therapy in Duchenne muscular dystrophy. <it>Hum. Gene Ther.,</it><b> 5,</b> 165–173.</bb>
It may be possible to parse this to extract the information.
e.g. of a supplied publisher's reference:
<author><name><surname>Graves</surname>, <fname>J.</fname><mname>A.M.</mname></name></author>, <author><name><surname>Watson</surname>, <fname>J.</fname><mname>M.</mname></name></author> (<date>1991</date>).
<title>Mammalian sex chromosomes: evolution of organization and function.</title> <pubtitle>Chromosomian</pubtitle> <volno>101</volno>, <pages>63–68</pages>.</reference>
In FSJ-DTD SGML this would be:
where AuthorName is: surname<space>initials. The MEDLINE format for the above reference is:
In many cases, there is no indication in the SGML where a figure should appear. Some publishers put all the figures in a "bucket" at the end. Some put the figure at its first reference, which could be in the middle of a sentence. If figure positions are not defined, we display the figure at the end of the paragraph in which it is first referenced.
There are variations in which tables are defined. Some are supplied as graphics, so their display is the same as for figures. Some tables are tagged in SGML, which necessitates their conversion to HTML. This conversion is non-trivial.
Tables should be marked up so that rows and cells contain the whole of their logical contents. This is not necessarily how the table appears on the printed page, where table cells may be wrapped across several lines.
We have come across some strange tagging of table footnotes and legends. They were tagged within the last row of the table, in a cell that spanned the entire table. The data conversion has to extract these and tag them correctly.
e.g. tagged as: <row><entry spanname="1to5" rowsep="1"
<ftnote>Five animals were transgenic.</ftnote></ftn></entry></row>
<table> .<fn id="a">Five animals were transgenic</fn> </table>
Generally the same footnote identifiers e.g. "a" were used in each table, causing parsing errors if there is more than one table in an article. To overcome this we prepend the identifier with the table number.
Within SGML it is possible to include a large number of character entities. At present HTML browsers display very few of these correctly. In SuperJournal character entities were converted into keyboard characters. In other cases, particularly Greek letters, the word was initially surrounded in square brackets, e.g. α becomes [alpha], latterly, the character has been replaced with an image for display.
Simple maths encountered so far has been displayed using keyboard characters. e.g. Fractions: x/y. Generally, journals in SGML in the Materials Chemistry cluster, where there are many complicated equations, reference the equations as graphic files. SGML tagged equations are displayed using a mixture of keyboard characters and Greek character images. This is not perfect, especially as the images do not scale, but is a "step-on-the-way".
Some time was spent looking at formats for the graphics files for the figures. Some graphics formats, versions of EPS, were unreadable with any software available at Manchester. This may be down to lack of knowledge, but the same problems were not encountered with the TIFF images. Problems were experienced with a sample file which was composed of text plus EPS files produced from 3 different packages (Aldus Freehand, Quark Express and Advent 3x2), 2 on Macintosh and 1 Windows. All were transferred to UNIX but would not display with the graphics software xv or ghostview, nor would they display with any software on Windows PCs or a Macintosh.
Additionally, some EPS file problems have been cured by a simple, but manual, editing out of the leading characters in the file. This involved deleting all the leading characters (which were largely unreadable with a text editor) before the first "%%" in the file.
We request a thumbnail size GIF file to display within an article and also a larger size GIF file to display when the user clicks on the thumbnail. Some publishers have additionally supplied a larger size of figure. We have not yet made use of these.
Most publishers have supplied these files. Where other formats have been supplied, we have used image file conversion software ImageMagick, with which we can convert between graphics formats and sizes. ImageMagick has a command line interface so we can set up batch conversions where necessary. It is, however, very costly in machine time to convert large numbers of figures required in some journal issues. Also the GIF files produced are not optimised and can be "large".
We have to rename graphics files and their paths from those used by the publishers for consistency within SuperJournal. Problems have been encountered where links to graphics files are missing or wrongly tagged. There have also been cases where figure files were missing.
The report may sound negative, as it highlights the problems encountered when setting up the data conversion and load process. In fact, we have been successful in creating a data handling process which is to a large extent scaleable. Most problems are encountered only once, because code is written to overcome them. The cases where manual editing of the supplied data files is necessary are in the minority.
Overall, those publishers who have devoted time and effort to quality control issues, including acting upon feedback from SuperJournal, shine.
This web site is maintained by email@example.com
Last modified: May 07, 1999