[SJ Logo]SuperJournal Application: Configuration, Integration, Performance

Home | Search | Demo | News | Feedback | Members Only


Ross MacIntyre, Manchester Computing, University of Manchester

SuperJournal Technical Report SJMC220

Contents:
1. Purpose of the Report
2. Overview
3. The Run-time Environment
4. Integration and Performance
5. Conclusions
Appendix A – Object Model
Appendix B – Screen Shots of Preference Page
Appendix C – Comparative Timing Spreadsheet

1. Purpose of the Report

2. Overview

This report covers the implementation and integration aspects of the project and the performance issues encountered once the run-time environment was established.

The application was controlled using the object database management system Fujitsu's ODB-II, licensed to the project by ICL. Additionally, objects were created to store the data that corresponded to the journal article headers, and this data was also stored in the text retrieval system BRS/Search. The object model for the application and data is included in Appendix A.

Three search tools were implemented: Isite, NetAnswer and RetrievalWare, each requiring a different implementation approach.

3. The Run-time Environment

The following section describes some of the constituents of the run-time environment for the application. It is not intended to be detailed production documentation, especially as the software is no longer available nor supported, but to give a flavour of the system and its associated control files.

3.1 Creating the Objectbase

The data and database design is documented in SJMC210 – Application Design.

The objectbase was created, and filestore reserved, using the system command mkob:

mkob -n media -w /fstore/NewApp/media

The above command created an objectbase called media, under the directory of /fstore/NewApp/media. If nothing had been specified by -w, the objectbase would have been created under the current directory. There was also a command line option -s, which specified the size of the objectbase in the range of 256 Kbytes to 4 Gbytes; the size given was rounded up by the system to the next multiple of 4Kbytes. If -s was not specified, the objectbase would have been assigned the default size of 512 Kbytes.

The objectbase could be expanded or deleted afterwards using the system commands expandob and delob, e.g.:

expandob -dp <port> media 10000

The above command assigned the objectbase media an additional 10 Mbytes disk space. The -dp flag specified the port number the object management system was to run on.

The system command odbctl allowed the development staff to check the database status. Particularly, odbctl -dp <port> -o, gave detailed information about the currently open objectbases. The information this command provided included the amount of space currently allocated to the objectbase, and the amount of space that was free. The objectbase should not be allowed to become more than 70% full, as performance suffers as a consequence.

The user objectbase files included slot files, header files, and ODQL methods files. The slot files (odbUser.slt) stored the saved directory information and header files (odbUser.hd) stored data. Both were created when the objectbase was created, and deleted by using delob system command when they were no longer needed. In the objectbase directory, there were also a number of lib*.so files. These were method library files which stored the results of compilations of user-defined (as opposed to system-defined) methods. Each class was held in a separate file which was created automatically when compileProcedure command was executed, and deleted when a class was deleted. They should not be deleted unless the entire ODB-II system needed to be removed.

As with any data storage system, some sizing was required to estimate the amount of disk space required, including any "temporary work files". The ODB-II run-time environment also used work and transaction objectbases. The system increased the size of these objectbases dynamically, to a fixed maximum specified by the workSize parameter in the client environment file (see next section). The capacity of the work objectbase could become inadequate if the application accumulated the results of many large queries. In turn, the transaction objectbase could become inadequate if many objects were updated in a transaction. To decrease the likelihood of either of these, changes were made to the application design, making queries and transactions shorter, however, the workSize parameter was increased during the life of the Project.

ODB-II had very high limits on the number of classes in the system, and on the number of subclasses for any class. It is unlikely that this limit would be hit (unless the database was used in some "peculiar" way). Having said that, in general, large numbers of subclasses of any one class should be avoided. A common method is to create "artificial layers" to reduce the number if this situation occurs. However, for performance reasons, popular objects can benefit from being divided into fine-grained sub-classes.

To illustrate this point, during the initial design stage of SuperJournal, authors and registered users were grouped into the same class, as all authors were potentially eligible to be users. Later however, with the increase of the number of journal articles, the number of authors became large and this affected logon times. The retrieval time for finding the real application users among all the authors became excessive. To redress this, users and authors were classified into two different classes: User and Author. Because there were only a few authors who did actually use the application, they were required to register, as for other users, in order to become an application user. The retrieval time was improved dramatically.

3.2 Indexes

With ODB-II, when the number of objects (database size) increased, the retrieval time got longer, as the database needed to scan through an increased number of instances to fetch the required objects.

Unsurprisingly, a number of indexes were created within the objectbase. This offered (almost) direct retrieval, as opposed to scanning all class instances otherwise. Object indexes could be created/deleted at any time, but in common with other database management systems, indexes were better created after the data had been loaded, otherwise, the opposite effect can occur, i.e. the database operations slow down. Also, if the database had been frequently updated, there would have been a balance to perform between the benefit of having the index and the trouble caused by maintaining the indexes. However, the pattern of loading with SuperJournal was via scheduled off-peak batch runs.

3.3 Environment Parameter Files

Before doing any operation on ODB-II, two environment files needed to be configured, a server environment file and client environment file.

3.3.1 The Server Environment File

The server environment file was needed when starting the ODB-II server. It defined the environment in which an ODB-II server operated. The following were the entries in the file:

portNo, systemDir, bufferSize, trace, traceFile, recoverMode, recoverLogFile

3.3.2 The Client Environment File

The client environment file was needed when clients established a communication path with the ODB-II server, e.g. to start the ODQL interpreter or to activate the compiler. The following entries could be included in the client environment file:

bufferSize, CF, checkLevel, hostName, InterruptAntTime, interruptPossibleTime, maxInstanceSize, messageLevel, trLogFile, workDir, workSize.

In the system administration menu, it was said that if the "workSize" client environment parameter was used, values must be given for both work and transactions. However, there was no documentation supplied which explained how to set separate values for both work and transaction objectbases. In the database settings, it was assumed that workSize was for both work and transaction objectbases.

For work objectbase:

Default value: 1024 Kbytes
Maximum value: 4,194,304 Kbytes (4 Gbytes)
Minimum value: 256 Kbytes

For transaction objectbase:

Default value: 2048 Kbytes
Maximum value: 4,194,304 Kbytes (4 Gbytes)
Minimum value: 256 Kbytes

(The detailed explanation of the other client environment parameters are not included in this documentation.)

3.4 ODB-II's Concurrency Control

Like any other database system, ODB-II used locking techniques to control the consistency and integrity of updates made by concurrent transactions. An inevitable consequence being that deadlocks could arise; where transaction 1 is waiting for 2 to release a lock at the same time as transaction 2 is waiting for transaction 1.

There were two types of locking, "exclusive" and "shared". "Exclusive" locks could be used by processes which wrote to the database and "shared" by processes that read from the database.

Most of the transactions the SuperJournal application involved were transactions which read the data from the database, but the transactions which were used in the registration process wrote to the database, requiring exclusive locking. To prevent the database dead locking, a client-server mechanism was used to administrate the database, so that users queued outside the database rather than inside for the transaction.

To see the effect, the ODB-II application socket model may need to be briefly explained in the following steps:

  1. The server process creates a generic socket with socket().
  2. The server binds the socket to an agreed-upon address with bind().
  3. The server notifies the system that it is ready for connections with listen().
  4. The server sits back and waits for the first connection with accept().
  5. The client process creates a generic socket with socket().
  6. The client binds the socket to any system-selected address with bind();
  7. Once bound, the client connects its socket to the server's socket with connect(), using the agreed-upon address. This establishes a connection. The client also sends its request to the socket.
  8. The server is notified of the new connection. Typical servers then fork a server child to handle the specific connection.
  9. The server child reads data from the socket which has been sent by the client. The server child also writes data to the socket, which is then available to be read by the client.
  10. When the server child and client have finished talking, they terminate the connection.

From the above explanation about the client-server process, if the application was very busy, i.e. many requests had come at once, it would have been best for the server to fork child servers to deal with concurrent requests. But some careful planning to deal with the database concurrency control would have been required to prevent the database deadlocking.

Considering that this project was a research project with fairly low concurrent use and reasonably short transactions, so the connection would not be held for very long, the application was designed so that the server did not fork when it received the requests from the socket. The advantage of this approach was that deadlocking could not happen. The disadvantage, which became apparent towards the end of the project, when the user base has expanded, was that when concurrent requests were received, the user had to wait for the server to finish dealing with the current request before listening to the socket again.

3.4.1 The Server Program

The server program was a C program with embedded ODQL called odb_daemon.kc. It was compiled with the following:

nelcc -envFile ../nel/envfile.cs6400 -c odb_daemon.kc

which produced the object code odb_daemon.o and also ODB-II reference file odb_daemon.ref. This was linked using:

nelcc -o odb_daemon odb_daemon.o

to get the server program odb_daemon.

The server first created the socket connection and started an ODB-II session to connect to the database. A transaction was then started and closed to create an Application object which was used to administrate all database activities. It then "sat back" in the system, listening to the socket.

When the client sent the data to the socket, the server got the data (which was already in the format of command/parameters), and started another transaction to invoke the appropriate method of the Application object. This transaction was closed as soon as the method invocation was finished. The server then returned to the listening state. The result of the method was sent out to the standard output, in this case, the user's Web browser. The server remained in this listening/acting loop until the server process itself was finished (normally or abnormally). The ODB-II session was closed before the closure of the server program.

3.4.2 The Client Program

The code for the client program was called odbnewapp.c. It was compiled with:

cc –c odbnewapp.c

and the executable created with:

cc –o ../cgi-bin/newappdb/odbnewapp odbnewapp.o –lsocket -lnsl

The client program received the data from the environment variable PATH_INFO or from the form data. It analysed which type of database command the user wanted to process, re-organised the request if necessary dependant upon the database command and/or the source of the data (from the PATH_INFO or the form), established the connection, and send it off to the socket.

4. Integration and Performance

4.1 Logon

Users of the application were required to register, having been provided with registration information by the library at their institution. Subsequent access required the user to enter their email address and optionally a chosen personal identifier. (See SJMC240 for full description of Registration and Login.) The email address identified the individual and was the primary key to their instance of the Person object.

As mentioned earlier in the report, a subclass of Person was defined to separate the User objects from Authors. The class definition is as follows:

defineClass CFmedia::User
super: CFmedia::Person
{
maxInstanceSize: 4;
class:
String shortName;
String unmangleSessionKey(String Mangle, String Oid);
instance:
String username;
String password;
String sessionKey;
String servername;
Integer loginTime;
Integer lastAccessTime;
Integer timeOut
default: 7200;
CFmedia::Media media
default: NIL;
CFmedia::Form form
default: NIL;
CFmedia::Container container
default: NIL;
String libName;
String ipAddress;
CFmedia::Media previousPage;
String searchEngine default:"No Preference";
String preferedCluster default:"SuperJournal Clusters";
String startScreen default:"SuperJournal Clusters";
String homePage default:"SuperJournal Clusters";
String set ccsNotifyNewIssue default:EMPTY;
String set mgpNotifyNewIssue default:EMPTY;
String set psNotifyNewIssue default:EMPTY;
String set ppNotifyNewIssue default:EMPTY;
Article set preferedReading default:EMPTY;
String NetAnswerSesskey;
Boolean stillLoggedIn();
Boolean makeSessionKey();
String mangleSessionKey(String SessKey, String Oid);
Boolean setPassword(String Password);
Boolean setUserName(String username);
String getUserName();
String getOid();
Boolean contains(CFmedia::User user);
Boolean updateAccessTime();
};

This separation of objects led to a significant improvement for logon, of the order of one quarter of the time. (See Appendix C – Comparative Timing Spreadsheet.) The login time increased gradually subsequently, but still remained around half previous time. This reflected the overhead associated with the support of personal preferences within the application using ODB-II.

An additional point to note, though outside of the control of the Project, was the login time experienced by users outside JANET during 1996. Publishers involved in the Project suffered due to the, then, bottleneck between JANET and the commercial Internet. Subsequent to the upgrading of the link and the additional connection via Manchester, no further complaints were received.

4.2 Browsing

The flexibility offered by using object technology was exploited during the database design and is documented in SJMC210 – SuperJournal Application Design. In essence, each significant tagged element from the SGML header was stored as an object. The application was designed to assemble together the objects that constituted the page. As objects are not linked together, the re-assembly stage depended upon each object containing an ordered set of object identifiers which were its constituents, or "members". For example, when looking at a page of journals, if a particular journal was selected, the application accessed the relevant Journal object, and used the property containing the ordered set of object identifiers for the issues.

The use of object technology as the mainstay of the application architecture led to performance problems when use of the service increased substantially in the third year. The amount of processing performed in order to link the objects (e.g. title, author, and keywords) back together for display was an inhibiting factor.

The application was re-engineered to still convert to HTML "on-the-fly", but to use the original SGML files as input, rather than an assembly of objects. When a user selected "View Abstract" (i.e. view article header/metadata) in the SuperJournal application, the displayed HTML file was generated on-the-fly by an OmniMark script which processed the SuperJournal SGML header file. This same script was also used to display an article's header when a user followed reference and "cited by" links (see SJMC230 – SuperJournal Application Special Features). It identified the required SGML elements and output them as HTML elements. For all these links, the URL included a call to a cgi script which logged the user's action before performing it. This logging added slightly to the length of response times experienced.

After this fundamental change was implemented, response improvements were dramatic and compared favourably with other electronic journal services. See Appendix C – Comparative Timing Spreadsheet.

[Note: It may also have been possible to reduce the time associated with the on-the-fly conversion using OmniMark, as this currently includes a licence manger check. The freely available OmniMark Lite (OMLE) allows up to 200 "actions", which may be adequate for the header conversions.]

The URL displayed as the browser's location was a "one-time" URL, derived from the current session information, including user identifier. This was very flexible, but did entail more processing than just retrieving an HTML file.

4.3 Searching

4.3.1 Isite

The name "Isite" actually refers to "an integrated Internet publishing software package which includes a text indexer/search system, Isearch, and Z39.50 communication tools to access databases." Only the indexer/system was implemented.

The most recent Isearch used was version 1.20.06, which was the latest version from CNIDR. Compared to the older versions used (version 1.10.03 and 1.10.05), indexing and searching had been improved considerably.

The database index was split into several index files, rather than squeezed into one single file which is what the older versions did, so that the time needed for merging the database indexes was shorter. However, it was found that Iindex created an index file for each appending job (even if it was only appending one file to the database). It was therefore possible for there to be a huge number of database index files created , which would affect the searching speed to some degree. As well as the above, the size of the database has a quite big influence to the searching speed and accuracy.

Isite was integrated with the main application as one of the three search engines. It was a freeware search engine and, as the source code was available, offered great flexibility.

There were three types of searches supported: simple, boolean and advanced. Simple search allowed up to three fields to be searched, with the "OR" relation operating between them. Boolean search allowed two fields to be searched concurrently; the Boolean operators allowed were "AND", "OR" and "ANDNOT". The advanced search allowed users to type in the query in one field as plain text with complicated relation nesting. Only the boolean search page was used by the project.

The search interface was created by simply editing an HTML search form. Links were created to cgi-scripts to support the toolbar functions. The list of field names available in the on-screen pull-down lists were hand-typed, as opposed to being derived from some schema or from the data itself.

Compared with commercial search engines, Isite lacked advanced search facilities such as QBE (query by example), nature language search, etc. It was a simple straight-forward search engine and couldn't search multiple databases. It had rigid database size limitations and the size influenced the search speed and accuracy.

It was found that when the number of the database files reached a "certain" figure, searching of the database failed with a "too many files opened" error message. Whilst loading new data into the application, the new files were appended to the corresponding existing Isite database and a new database file created. After a while, the number of the database files fell foul of the limitation, which made the database "not searchable". To circumvent this, the entire database was regularly re-indexed, removing the incremental appended files.

In two of the four clusters (Communication & Cultural Studies and Political Science), PDF was the only format for the full text files. The other two clusters (Molecular Genetics & Proteins and Materials Chemistry) contained both HTML and PDF files. As a consequence, none of the CCS and PS header files contains an <HTMLLOC> tag; which identified the location of the HTML format version.

When Isearch, searched the CCS and PS database, it displayed an error message "<header file path name> can't be opened" on the search result page. The reason was that all five databases used the same Isite cgi-script, inside which conditional "if"statements were used to test whether the PDF or HTML file existed. To get around the problem, two dummy header files, which had an <HTMLLOC> tag, were indexed.

4.3.2 NetAnswer

NetAnswer is the Web interface for Dataware's BRS/Search, which is one the three SuperJournal search engines.

The version of NetAnswer used (version 1.1), was stateless, so each search request sent from the Web needed to launch an independent BRS/Search session. It could not "remember" the relationship between calls made by the same user. A statefull NetAnswer protocol would speed up the searching and optimise the search results by doing the optimisation on the previous search results.

Separate BRS databases were created for each of the four clusters. BRS/Search supports cross-database searching, so this did not inhibit the functionality supported.

4.3.2.1 NetAnswer Configuration

The NetAnswer configuration file, neta.cfg, contains details for the NetAnswer server, such as MIME type declaration file, Internet server cgi executable, process timeout, general: error screens, timeouts, types of highlighting, icons, etc. It also contained database-specific tailoring. Below is the tailoring done for one of the four SuperJournal live databases, SJG0, which contained the header data for the CCS cluster:

[SJG0]
docPtf=SJG0DOC
tocPtf=SJG0TOC
tocPara=ATL,AUS
docPara=all
searchQualify=ALL;ATL;KWDS;ABS;AUS;AFF;JTL;PNM
# Increase document per page limit - really governed by limit on results set
docsPerToc=100000
requestLog=/filename/BRSdbs/Logs/sjg0/log.cfg
#X16384+=set plurals on, X128+=Last in First Out ordering of hits,
#X2048+=allow wildcard at start of search term
tocOptions=X16384+;X128+;X2048+
docOptions=X16384+;X128+;X2048+
noHitsPage=/filename/username/WWW/NetAnswer/SJG0noHits.html

Most of the above are settings for the searching, such as auto-pluralisation, i.e. give a hit for mouse and mice. The two entries docPtf and tocPtf are the print time format files for document display and hit list display respectively. The values given point to files in another directory, in this case they will be called SJG0DOC.fmt and SJG0TOC.fmt.

The following is an extract from the search hit list display:

<TITLE>SuperJournal G0 (CCS) Database -- Search Results</TITLE>
<H2>Search Results: NetAnswer - Brief Display</H2>
<I>Search Terms: </I>[QUERIES:999 ASCENDING]<BR>
<I>Documents</I>: #BUF30 - #BUF31 of #RSLT
<HR>
<B>[ATL]</B>
<BR><i>[AUS]</i>
<BR>[JTL]|Vol|[VOL]Iss|[INO]|[CD:DE "No Date Avail."]pp[SPG]-[EPG]
<HR>

It retrieves the hit records and puts then in buffers for display. The search terms entered are shown.

BUF30 & BUF31contains the start and end number of the hits displayed, RSLT is total number of hits found. The hits would then be displayed, showing the article title [ATL], author(s) [AUS] and journal issue information, made from journal title [JTL], volume [VOL], issue [INO], cover date [CD] (with default if no date available) and start page [SPG], end page [EPG].

Portions of the screen display would be as follows:

image59.gif (25609 bytes)

The entry on the hit list is hypertext linked to the full header display, the look of which is determined by SJG0DOC.fmt. The following are taken from the live file to illustrate some integration decisions taken:

:VALID S* SELECT-PARAS
<TITLE>SuperJournal G0 (CCS) Database -- Document Display</TITLE>
<H2>Search Results: NetAnswer - Full Record Display</H2>
<I>Document :| </I>| #DOC of #RSLT<BR>
<HR>
:IF ATL
\o1Title: \f1
<B>[ATL]</B>
<BR><BR>
:ENDIF
\~
:IF ART
:SET BUF1=ART:1L
<HR>
<A HREF="http://www.superjournal.ac.uk/cgi-bin/cgiwrap/user/netacgi/go/reftex.pl?/superj1/Journals/#BUF1">
Download in Tagged Format</A>
<HR>
:ENDIF

There is a link from the screen to a PERL script (reftex.pl), stored within the netacgi driectory, which would take the content of the buffer and convert to a text file in a simple tagged format, suitable for downloading and storing in a personal bibliographic database. The formatting is actually done by using another print time format file.

\~
:IF ART
\o1Full Article: \f1
:SET BUF1=ART:1L
<A HREF="http://www.superjournal.ac.uk/cgi-bin/cgiwrap/user/logs/sjg0disp//BRSpdf/#BUF1"> [ART] </A>
<BR><BR>
:ENDIF
\~
<HR>

The full article is accessed by following the above link. This takes the contents of the buffer, essentially the journal filename, and reformats it to locate the physical file.

4.3.2.2 Additional Integration

NetAnswer was available from the main SuperJournal application as one of the three search engines. Once it is selected from the main application, users were led to the NetAnswer search form directly.

Much of the integration can be seen from the work done defining what was to appear on screen, a true WYSIWYG approach. NetAnswer works with a default set of icons, which indicate certain functions within BRS/Search. The defaults were replaced or removed in an effort to provide a common feel to the application. This was not always possible.

The following shows extracts from the icon configuration file. In the main there are four possible settings:

The exception to this is the "Home" icon, which has a property iconRef, meaning it can call a script. In SuperJournal, this calls a script which looks at the IP address in use in order to find the user and their declared preference for the Home icon. This is rather inexact, but all use of BRS/Search via NetAnswer is anonymous, due to its statelessness.

[IMGSJGO]
#### SuperJournal Home Page
iconRef1="http://cs6400.mcc.ac.uk/cgi-bin/cgiwrap/supjinfo/newappdb/odbnewapp/NetAnswer/display/home"
iconImage1=/sj/images/b_home.gif
iconAlt1="[SuperJournal Home Page]"

The search icon could not be redirected to another search page, it would only go to NetAnswer's.

#### Search
iconImage2=/sj/images/b_search.gif
iconAlt2="[Search]"

One trivial, but irritating aspect, was the inability to remove an item from the help file where the function was not wanted. In the application, these "not wanted" functions had to have their iconHelp set to ".", but this was still displayed when the help file was accessed.

#### Nullify image3
iconImage3=
iconGrey3=
iconAlt3=
iconHelp3=.

The remaining icons were mapped to their logical equivalent with SuperJournal and used the same images as present on the familiar toolbar, e.g.

#### Up
iconImage4=/sj/images/b_up.gif
iconGrey4=/sj/images/G_up.gif
iconAlt4="[Up]"

4.3.3 RetrievalWare

RetrievalWare was a commercial search engine from Excalibur Technologies. During the project it was updated from version 6.0, version 6.0.1, version 6.5, to version 6.5.1

Versions 6.0 and 6.0.1 had the following common problems:

It took about 18 months for the above problems to be resolved by updating to version 6.5. Unfortunately, a new problem arose in the version 6.5. When the "preference" tab was struck, a message about a JavaScript error was displayed, the search engine stuck on the preference page and it was impossible to go any further. This new problem was partially solved by applying a RW6.5.1 patch, but error messages were still displayed in the lower versions of Netscape and MSIE.

4.3.3.1 RetrievalWare System Administration and Configuration

Two RW environments were created, one development and one live, which were integrated into to the development application and live application respectively. The two versions of RW had a similar structure in terms of file distribution and only the live version is described here.

The RetrievalWare system can be administrated via the Web interface, or Admin Wizard, which offers the following functions:

Another way of administrating the RW system was to do the above setting-up offline. The situation to avoid at all costs is mixing the two methods. The Wizard would blindly overwrite settings which may have been well thought out and made via the command line.

RetrievalWare had two main configuration files: rware.cfg, containing information about system parameters and libraries; and exec.cfg, containing information about the various servers.

RetrievalWare was acquired to "bolt-on" to the existing application. It indexed data in-situ, working on units defined as "libraries". These may be gathered together into "groups", if appropriate for the intended user base. Those in the same group are located in the same directory, and the Web page will only list libraries for the user's group. Each group of libraries has to have the two configuration files rware.cfg and exec.cfg.

The following is the directory structure for a group of libraries:

<directory contains a group of libraries>:

config : stores the configuration files
badfiles : stores filenames of those problem files against the document parser
indexes : stores indexes for libraries. Each library has a sub-directory to store its index files
dp : stores document parser files for each libraries
logs : stores server log information
user_params : stores information about users

After a library was set up in rware.cfg file, a document parser file had to be defined before creating the new library. It was used to control how the documents in a library are both indexed and viewed.

There was a system utility menu (a Unix shell script) which organised all RetrievalWare command line programs into a series of menus. Command line programs could be executed via the menu so that command line programs or their arguments did not have to be remembered. Command line programs had a range of different arguments for different needs and programs had to be executed on the command line sometimes.

To create a new library using the standalone indexer from the command line:

/Excalibur6.5GO/rware/bin/sindex –cfg <path to the configuration file rware.cfg> -library <name of the library> -new -log –file <name/name specification (with wild cards) of the file/files to be indexed > (a prefix @ is needed if the file lists the files to be indexed).

To append files to an existing library:

/Excalibur6.5GO/rware/bin/sindex –cfg <path to the configuration file rware.cfg> -library <name of the library> -log –file <name/name specification (with wild cards) of the file/files to be indexed > (a prefix @ is needed if the file lists the files to be indexed).

The –log flag was used to log the indexing process to a file named indexlog.dat situated in the directory where the library index files were stored. Once a new library was created, queries to the library could be made via the system utility menu or via the Web browser.

SuperJournal had four RW libraries of full text files and four libraries of abstracts, one for each of the journal clusters. SuperJournal libraries were grouped under a directory called "demos", created during software installation, rather than other separate directory. The reason was that the PDF highlighting feature "disappeared" in directories other than "demos", for some unknown reason.

4.3.3.2 Searching Libraries via the Web

RW used its own http server (Apache_1.2.0), as the project needed to have two versions of RW running, development and live. (It was viewed as somewhat unusual to have both on the same platform.)

The Web searching was made up of http server, a CGI interface and a Front-End (FE) server. The cgi interface also required HTML template documents and a template cross-reference file. The following gives a brief listing of their functions.

http server
cqcgi
Front-End Server

4.3.3.3 RetrievalWare Integration with the Main Application

RetrievalWare was integrated into the SuperJournal application as one of the three search engines used. The http server for the live RW system used a predefined port number of "7329". When the RW was selected from the application, a URL was called directly with all the form data required for the logging page packed to the QUERY_STRING environment variable in the following format:

http://www.superjournal.ac.uk:7329/cgi-bin/cqcgi/
@rware.env?CQ_SAVE[CGI]=/cgibin/cqcgi/@rware.cfg&
CQ_PROCESS_LOGIN=YES&CQ_LOGIN=YES&CQ_SAVE[New_Query]=FALSE&
CQ_USER_NAME=<the SuperJournal username>&CQ_PASSWORD=?????.

Under normal operation, a user would login to RW, however, this would have been an irritation within the application, as the user would already have opened a session. The above hid the screen from view, making the move transparent.

Very little effort was devoted to changing the appearance of RetrievalWare, though the templates provided offered scope for this to be done. More time was spent addressing the various shortcomings of the software and trying repeatedly to get answers from Excalibur. The RW home icon in the RW search pages was configured to call essentially the same script as that used from NetAnswer, allowing the user to have their home preference consistently supported.

4.4 Special Features

4.4.1 Persistent User Preferences

A key element of the project was the provision of choice to the user. This covered a number of areas, and for some features the user could register a preference that would persist. This prevented the user from having to repeatedly make the same choice each time they used the application. Users could change their persistent personalised preferences at any time within the application. See Appendix B for screen shots of the Preference setting page.

Personal preferences included the following:

Personal "Reading List"

A personalised list of hot articles (bookmark). Buttons for adding an article into the hot list were available on the table of contents page of the issue and article header pages. The button for removing an article from the hot list was just below the pull down menu. The "View" button sitting next to the "Remove" button displayed the selected article from the hot list directly.

Personal reading list was stored in the attribute "preferedReading" in the class User. Its data type was "Article set".

Search Engine

There are four options for choosing the default search: No Preference (default), Isite, NetAnswer and RetrievalWare. "NoPreference" displayed a page with links to all three search engines, when the search button in the tool bar was clicked, and other three options went directly to the appropriate search page for the chosen search engine.

Search engine was stored in "searchEngine" attribute in the class User as "String" data type.

Preferred Cluster

There were five options for the choice of the preferred cluster: SuperJournal Clusters (default), Communication and Cultural Studies, Molecular Genetics and Proteins, Political Science, and Materials Chemistry. In the context of application integration, the Isite search page would automatically launch the search against the database of the preferred cluster.

This preference was stored in the attribute "preferedCluster" in the User class as "String" datatype.

Start Screen for Next Session

The valid options for this property were: SuperJournal Clusters (default), Communication and Cultural Studies, Molecular Genetics and Proteins, Political Science, Materials Chemistry and Previous Session. This setting would decide the first page to be displayed when starting a new session. The "Previous Session" selection would retrieve the last object used. It was not possible to reconstruct hit lists, but functioned adequately for browsing activity.

This preference was stored in the attribute "startScreen" in the User class as "String" datatype.

Home Page

The personal home page could be chosen from the following five options: SuperJournal Clusters (default), Communication and Cultural Studies, Molecular Genetics and Proteins, Political Science, and Materials Chemistry.

It was stored in the attribute "homePage" in the User class as "String" datatype.

Time Out

If a session had been started, but had not been used for a period of time which eventually exceeded the "time out" number of minutes, the session would be abandoned. Effectively, the user's session key was deleted and the user would be given a "Timed Out" message should they subsequently try and access the system by clicking on the page left in the local cache on their terminal. This was a cautionary measure introduced to deal with public access areas. It was essential to try and ensure that the data being logged did match the behaviour of the person whose identifier was being used.

It was stored in the attribute "timeOut" in the User class as "Integer" datatype.

Personal Alerting

Personal alerting allowed a user to enter keywords or phrases for a particular cluster. When new data was loaded, searches were launched against the new data, using the terms entered, and hits notified to the user by email.

These were stored in the attribute "xxxSearchword", where xxx = ccs, mgp, ps or mc, as "String set" datatype.

Notify New Issue via Email

When new journal issues were loaded to the application, users could be notified by email about the contents of the issues. This setting allowed the user to select the journal(s) of interest from the four clusters.

These were stored in four corresponding properties for each cluster in the User class, all as "String set" datatype.

Personal Details (Email, Personal Identifier)

A User's email address and personal ID could be changed any time via the preference page.

Email and password were stored in "username" and "password" attributes respectively in the User class. Both of them are of "String" datatype.

4.4.2 Technical Implementation of the Personal Preferences

The personal preferences were stored in the ODB-II database in an object of the PreferenceForm class. The form object consisted of:

When the preference form was displayed as an HTML page, each field in the form wrote its own HTML section. The field also fetched its corresponding current value from the user object, displaying it as the default field value. Users were thus able to view their current preference settings on the preference page.

On submission:

  1. The preference form called the method "handleSubmit" of an object of the Application class. The "handleSubmit" method analysed the data passed by the client cgi program and checked which type of "submitButton" it was. In this case it will have found it was of type PreferenceButton which would be retrieved.
  2. The PreferenceButton object called one of its methods, "submitCallback", which took the values in the preference form and applied any changes to the appropriate user properties.
  3. On the success of the "submitCallback" method, the "submitSuccessMessage" method is called to briefly display the changed user preferences.

In the end, the "Preference" form contained all personal settings and could be said to appear cluttered.

4.4.3 Personal Alerting

Users declared their keyword(s)/phrase(s) of interest for each cluster via the Preference form. The function was then executed as part of the database loading process.

The personal alerting was the last step in the loading process. By then all articles had been successfully loaded into the ODB-II database and indexed with Isite and RetrievalWare. Temporary Isite databases were created for each cluster into which new articles had been loaded. The personal alerting was done by executing an ODB-II application program, "palerting".

This then carried out the following sequence of actions:

  1. Checked if new articles had been loaded in each cluster,
  2. If yes, got the users who had set up the search keywords/phrases for this cluster,
  3. Got the search keywords/phrases,
  4. Searched for each keyword/phrase in the corresponding temporary Isite database,
  5. If hits found, created a new file to store the search result, otherwise appended the search results to the result file.

The search program "Isearch_rst_to_file" was modified from the original Isearch source code. It sends the search result to the file specified on the command line in a specified mode (write or append).

The perl program "ParesultFilter" was then used to filter the search result file, so that each article was unique, also article title, authors, journal issue information are fetched from the header file and displayed in the desired format. This is then written to a temporary file as mail message headers with search results. The temporary file would then be used to send email messages to the users.

All this was logged.

4.4.4 The Keyword & Author indexes

The purpose of the keyword and author indexes was to help the application users to selectively browse against the application. The selection was driven by either keyword or author name. There would be guaranteed a "hit".

When an article was loaded into ODB-II, keywords in the header file ware stored in the "keywords" property ("String set" datatype) of the article object. Similarly for the author names.

The indexes were offered from Isite and NetAnswer search screens, with the user having to select which cluster and one or more letters to match the initial letter of the keyword/author. ODB-II called the "handleSearchKeyword" method of the object of the Application class. The method got all the articles in that cluster, collected all keywords in a set, and found all the keywords starting with the required letter from that keyword set. The search result page was displayed as a list of hyperlinks, which were Isite searches using the keywords as search terms, with the number of occurrences displayed beside it

As the number of the keywords in ODB-II quickly became large and, as can be seen from the description, as the process was not very efficient, the user would have to endure slow response times. It was planned to amend the data processing and load to produce the keyword/author lists as flat files, rather than building dynamically.

5. Conclusions

Appendices

Appendix A – Object Model
Appendix B – Screen Shots of Preference Page
Appendix C – Comparative Timing Spreadsheet

 

Appendix A. Object Model

image35.gif (8278 bytes)

Appendix B. Screen Shots of Preference Page

pref1bw.gif (19315 bytes)

pref2bw.gif (16569 bytes)

 

Appendix C. Comparative Timing Spreadsheet

 

 

 

NewApp

OldApp

ServiceA

NewApp

ServiceB

ServiceC

ServiceA

NewApp

ServiceA

NewApp

 

In

07-Jan

07-Jan

26-Jan

26-Jan

26-Jan

26-Jan

27-Jan

27-Jan

28-Jan

 

Beta-test

15:00

15:20

09:30

09:40

10:00

10:15

15:30

15:40

15:20

Login

7

52

33

11

3

4

53

12

95

14

     

(IP check)

 

(IP check)

 

(IP check)

 

(IP check)

 
Select First Cluster

7

15

30+17

5

6

6

2+22

6

10+1

13

Go Home

2

16

12

3

3

1

42

5

10

4

                     
Select Second Cluster

4

15

16+29

4

5

5

15+2

8

15+13

11

Choose Title

11

23

9+26

4

5

5

24+24

6

16+39

7

Choose Issue

9

39

35

6

17

8+2

49

9

69

9

Choose First Abstract

8

39

19

9

19

2

13

6

20

9

Next

5

32

7

5

15

n/a

20

8

2

7

Next

5

31

16

5

6

n/a

12

5

4

5

Next

5

37

17

4

42

n/a

34

5

43

7

Up to ToC

8

50

31

5

>180, >120

3

23

6

24

9

                     
Go Home

2

15

4

3

18

1

21

5

t/o, 64

5

Select Third Cluster

8

26

t/o,crash,4+12

6

13

2

13+9

7

13+54

8

Choose Title

3

28

8+4

4

t/o,14

crash, 5

77+,t/o,44

5

37+54

4

Choose Issue

7

43

10

6

>120,14

46+t/o,5

15

5

>120,122

8

Previous

5

33

6

4

36

5+2

23

6

13

6

Up to Titles

4

14

21

4

30

9+4

17

6

62

5

Next

6

18

21

4

n/a

n/a

40

6

69

8

Next

7

19

55

5

n/a

n/a

19

6

25

9

t/o=timeout
crash=system fault
>=time greater than
e.g. >120=over 120 secs

 

This web site is maintained by epub@manchester.ac.uk
Last modified: April 30, 1999