Home | Search | Demo | News | Feedback | Members Only
Ann Apps, Manchester Computing, University of Manchester
SuperJournal Technical Report SJMC260
1. Overall Approach
2. Tools Used
3. Usage Statistics Generation
4. RetrievalWare Log Files Conversion
5. Invalid Log File Entries
6. Problems and Issues
Appendix 7.1 C++ Classes in the Usage Statistics Generation Program
Appendix 7.2 Header files
Appendix 7.3 SPSS Data Dictionaries
Appendix 7.4 Mean and Standard Deviation
Appendix 7.5 Code Organisation and Compilation
SuperJournal usage statistics are taken from logging within the SuperJournal application. These statistics are fed into the SuperJournal Evaluation Research undertaken by the HUSAT Research Institute at the University of Loughborough. They are also displayed in the "members only" section of the SuperJournal web site. The statistics are processed on a monthly basis, covering February 1997, when the SuperJournal application was launched, through November 1998.
The generated usage statistics consist of:
This document describes the program used to generate the usage statistics from the SuperJournal application log files. The detailed specification of the SuperJournal application log files, the generated HTML pages and the SPSS file are given in [SJMC261]. Details of usage statistics generation in a "User Guide" form are in [SJMC262].
The usage statistics generation program is written in C++, with a Lex/yacc generated front-end for the lexical analysis and parsing of input. It is called via a Unix shell script. Details of the use these programming languages within the SuperJournal project is given in [SJMC140].
The SPSS "saved" data file and portable file are created by SPSS commands after verification of the generated SPSS data file. A description of the statistics package SPSS is beyond the scope of this document.
SuperJournal usage statistics are generated each month within a directory created for that month in the "Logs" area (/superj4/Logs). The month's log files from all parts of the SuperJournal application are in a sub-directory of this "month" directory called Logs. Assembling and pre-processing the previous month's User Register and this month's application log files so that they are in a form suitable for input to the usage statistics generation program is detailed in [SJMC262]. Statistics are generated by calling the script dologs:
The statistics generation script dologs is a Unix shell script which:
SuperJournal usage statistics are generated by a program, logs, which is written in C++ with a Lex/yacc generated front-end. The arguments to the program are, in this order:
At the start of the program a Log object is created. Each file specified on the command line is read by calling the Lex-generated lexical analyser to recognise syntactic tokens in the input which it passes to the yacc-generated parser. This parser checks the syntax of the input and then takes appropriate action, which generally consists of passing information to the Log object, by calling one of its public operations. The yacc-generated parser is able to parse input in all the required formats:
When all the input has been parsed, the Log object outputs its accumulated information as a set of HTML files, which display the usage statistics in a tabular form, and an SPSS data file.
The program logs consists of the following C++ classes, whose public operations are listed in Appendix 7.1. Coded values used within the program are detailed in Appendix 7.2. Compilation and source code management are described in Appendix 7.5.
A Log object accumulates all the logging information passed from the parser, generates the statistics, then outputs them in the required format.
A LogEvent object is created for every logged event. Where necessary information is recorded about the event from data provided by the parser, in some cases after further string parsing. LogEvent holds:
The LogEvent is output:
There is a LogLib object for each library, including an "other" (Z) library to which all users who do not belong to a University library are attached. Usage statistics calculated and output in tabular form in a set of HTML pages are:
Additionally, this library's section of the new User Register is output.
There is a LogUser object created for every user registered with SuperJournal either from information in the User Register or from logged information about new registrations. At the end of reading all the log files, correlations are made between events and users to fill in missing information and to assist in the statistics generation. Statistics calculated and output in tabular form are:
A LogJnl object is created for every journal within SuperJournal. The LogJnl objects are created at the start of the logs program from information held in the header file logs.hxx which includes the journals' masked names. Issue and article information is linked into each LogJnl object as the journal catalogue is read, when volume and issue masking occurs. Issues and events are correlated after reading all the log files. Statistics calculated and output in tabular form are:
A LogIss object is created for every journal issue within SuperJournal as the journal catalogue is read, and added to the journal issue list. Issues and events, and articles and events, are correlated after reading all the log files. Statistics calculated for output in tabular form are:
A LogArt object is created for every article in SuperJournal as the journal catalogue is read, and added to the journal issue article list. Articles and events are correlated after reading all the log files. Article usage information is output in tabular form.
A LogTime object is created for every date and time pair processed. It holds each of the date/time elements as integers: year; month; day; hour; minute; second. For NetAnswer time processing it also holds the same elements for the action "end" time. It provides functionality to compare, subtract and output date/time pairs.
A LogCount object counts the number of actions by a particular user, known by library number and user number. It is an element of a LogCountList with knowledge of the next LogCount on the list.
The LogGen object provides general utility operations to objects of all classes within the program logs. These operations include looking up coded values, output of the common parts of the generated HTML files, and statistics calculations. There is only one instance of this object which is instantiated during construction of the main Log object, with knowledge of it passed to all the objects created which require its operations. It holds the current date and time, and the month and year for which statistics are being processed.
The LogItem provides a simple text object as an element of an exclusive list, LogExclList (see below). It holds a text string, with the possibility of holding an "extra" string.
The logs program includes several list classes, for lists of various object types. They generally provide the same functionality, any differences being indicated below. It was found necessary to have a specific list class for each object type, rather than a generic list, to cater for operations which return a list element.
LogLibList provides a list of LogLib objects.
LogUserList provides a list of LogUser objects within a library.
LogList provides a list of LogEvent objects, in chronological order.
LogJnlList provides a list of LogJnl objects.
LogIssList provides a list of LogIss objects within a journal.
LogArtList provides a list of LogArt objects within a journal issue.
LogCountList provides a list of LogCount objects which are used to count the number of actions by a particular user. This is an exclusive list where each element LogCount refers to a unique user.
LogExclList provides an exclusive list of LogItem objects. The contained text string of every LogItem on the list is unique.
The list of logged events is generated in SPSS format to allow for subsequent statistics manipulation in an SPSS saved file "sjlgyymm.sav", and an SPSS portable file "sjlgyymm.por" which is supplied to other project staff. The SPSS data format specification is given in [SJMC261].
Each event is output in SPSS format by the operations of the LogEvent class. This list of events is output to an SPSS data file "sjlgyymm.dat" by the program logs. An SPSS data dictionary file is also output by logs to a file "sjlgyymm.sps".
After completion of the program logs, the controlling script dologs calls the SPSS function spssbat which generates the SPSS saved and portable files from the generated data dictionary, the generated data file, and a static SPSS data dictionary "sjloginc.sps". The SPSS data dictionary files are specified in Appendix 7.3.
The RetrievalWare search engine logs to dated log files, one per day of use, using its own format. These log files are converted into a format consistent with the main SuperJournal log files in a single file for the month by a program, rwarelogs. The format of the RetrievalWare log files and the resulting SuperJournal RetrievalWare log file entries are specified in [SJMC261]. Instructions for this pre-processing of the RetrievalWare files is given in [SJMC262].
RetrievalWare log files are converted to SuperJournal format by calling
where the RetrievalWare log file names begin with cq.
The RetrievalWare log file pre-processing program, rwarelog, is a C++ program which reads "standard input", which will be a RetrievalWare log file, and generates "standard output":
The user's machine name and IP address is not recorded by RetrievalWare. This information will be deduced from the SuperJournal user identifier and the time during the main log file processing.
Compilation of this program is described in Appendix 7.5.
Before any log files are processed, invalid log file entries are removed from the main SuperJournal application log files, and listed in a separate file, one file for the month, called invallog. These are invalid registrations and invalid logins caused by such problems as users typing invalid passwords. All these log file entries will include the keyword "Invalid".
This invalid log entry extraction is performed by a script, rminval, which uses the Unix utility grep to:
The processing of the SuperJournal log files proved to be a huge overhead both in programming effort to write and maintain the log file processing program, and in the monthly processing of the actual log files because of the manual intervention required to amend the log files produced by the search engines (see below). This manual effort, which was also very tedious, increased "exponentially" in the later months of the project when users seemed to suddenly realise they could search. This log file processing overhead was obviously necessary for the SuperJournal project whose objective was journal usage evaluation. But such extensive log file processing would not be justifiable for a production electronic journal service.
Writing the log file processing program in C++, an object-oriented language, allowed the program to be written in a modular fashion, thus improving maintainability. All the coded values used within the program are declared as enumerated lists and structures within the header file logs.hxx (see Appendix 7.2). Thus when new features were added to the logging, because of enhancements to the SuperJournal application, it was usually necessary only to update these lists, structures and list limits. If recognition of new keywords in the input by the lexical analyser were required they were added to a list in logs.lxx. New constructs in the log files also had to be added to the parser in logs.yxx.
Some manual processing of the log files was found to be necessary. This manual intervention increased the time required for processing the log files each month, and also the tedium of the task. It is possible that some of this manual editing could have been automated but the programming effort involved would have been non-trivial. Detailed instructions for this manual editing are given in [SJMC262].
It was decided to check by eye the `academic status' entered by the user when the user had selected "other", and attempt to edit this to one of the defined categories where appropriate. There seemed to be several causes of users selecting "other" status: some users were determined to enter their exact job title; visiting researchers wished us to know that they were visitors; status is named differently from University to University, eg. Research Associate or PostDoc, Academic, Lecturer or Faculty; some categories were missing from the list. These missing categories included: librarians (though this category was added later in the project); computer staff (generally these were classed with librarians after manual editing); technicians; secretaries and PAs.
Strictly it was not necessary to correct academic status before processing the log files, but it was felt that the statistics generated would be improved with these amendments.
18.104.22.168 Query Editing
The users' search queries in the NetAnswer log files were edited to make them more readable by: removing extraneous data; and changing the NetAnswer field names to more readable ones. This pre-processing of the NetAnswer could have been automated using a text-processing script, but the writing of such a script remained low priority.
22.214.171.124 Download Tagged Bibliographic
As described in [SJMC262], the BRS function number was corrected where a user had performed a "Download Search Results in a Tagged Bibliographic Format". These actions, which NetAnswer logs as a repeat search, were identified in the log files by eye, generally at the same time as performing the above editing of search queries. It would have been difficult to have made these corrections automatically.
126.96.36.199 Abstract SuperJournal Identifiers
When a user selected "View Header/Abstract" after a NetAnswer search, the abstract identifier logged was the document number in the underlying BRS database. To convert these BRS document numbers into the SuperJournal article identifiers essential for the statistics generation it was necessary to look them up using the basic BRS search tool and then determine the SuperJournal identifier from the journal issue catalog file (see [SJMC140]). It may have been possible to automate this correction process but it would have required a further programming effort. When the decision was made to make these corrections manually the amount of NetAnswer searching by users was small. This manual editing became a significant overhead only in the last few months of the project.
When a user retrieves either an abstract or an article following a RetrievalWare search it is necessary to convert the logged RetrievalWare document number into the corresponding SuperJournal article identifier or file path name.
188.8.131.52 RetrievalWare Document Catalogues
RetrievalWare document catalogues must be created to allow for document number look-up. These catalogues are created by performing sets of simple searches typed into a utility written "in house" which uses the RetrievalWare API. It is not possible to automate this process. To accommodate newly loaded articles these catalogues must be recreated each month. Setting up these catalogues each month involved about an hour's tedious effort.
184.108.40.206 RetrievalWare Document Number Look Up
Following creation of the RetrievalWare document catalogues, the document numbers were edited manually into the RetrievalWare log files. It would have been possible to write a program to automate this process, but RetrievalWare log processing was introduced at a late stage of the project when it was thought that the programming effort required would not be worthwhile.
To include machine location codes in the statistics, it is necessary to edit machine addresses into a header file of the log file processing program, logs, recompile and reinstall the program, and then re-run the log file processing. Identifying the machine locations was by "educated guesswork", and it was done for only some months. It may have been possible to perform this machine identification by a more automated, or a "one-off", process if machine IP location information had been requested from the librarians. But it is likely that manual inspection would still have been necessary for off-site location identification.
The masking of journal issue numbers as consecutive numbers within a year assumes that journal issues are loaded in date order. If journal issues are loaded in an erratic fashion, the algorithm which deduces the issue number may not give the same number for the same issue from one month to the next. However, this problem has probably been overcome by the fact that all the log files for all months since logging began were reprocessed several times during the later stages of the project by which time the journal catalogue should have contained most missing issues.
Some parts of the SuperJournal application, such as the NetAnswer search engine and many low-level browsing actions, do not log the user's SuperJournal identifier because this is unavailable at the point where logging occurs. At the end of reading in all the log files, user identifiers are deduced for events in which they were not logged from the previous event in the chronological list of events with the same IP address. This deduction may not be correct if a user is accessing via a commercial ISP where the IP address is allocated dynamically and may thus change during the user's session, but this is unlikely for academic users from the University libraries involved in the project. This occurrence would be reported as a fault during log file processing. In fact, only a couple of instances of this occurrence were seen, and these were not University library users. Also this algorithm may make incorrect deductions where several users are accessing SuperJournal at the same time from a University which is sending a University-wide cache IP address, but the chance of this occurring is small, and these cache addresses were seen only at later stages of the project.
The implementation of the SuperJournal application was such that integration of some parts was rather loose. This meant that, for instance, it would have been possible to return another day to a NetAnswer search screen without relogging into SuperJournal, or to save an HTML article and then follow low-level browsing links at a later time. Any activity of this type would show as an error report during log file processing, because the user did not login before performing the action. However, in reality, the incidence of such logged activity was negligible especially once testing and demonstrating activity by project staff was excluded. It is unlikely that electronic journal users have "computer hacker" mentalities.
This web site is maintained by email@example.com
Last modified: July 06, 1999