GoldenGATE usability

Summary

  • The functions used to add and edit mark-up manually are generally well implemented, and the most commonly required functions can be invoked easily.
  • Some functions require the user to interact with a list of candidate items for editing. The interface supporting these interactions could be improved. (See Modality, below).
  • The ability to build user defined functions and then to chain functions together in named 'pipelines' is powerful, especially where a user is working with a large number of original documents which use a consistent mark-up convention, allowing reuse. The interface for this functionality is good, though the documentation will need to keep pace with development, and is currently incomplete is some areas.
  • Performance issues affect the loading and display of large documents (> approx. 200kB). These files must be stlit up into smaller sections in order to work on them. The speed issues may reflect the limitations of the JTextPane component in the Java Swing library.
  • The GoldenGATE's 'look and feel' can be intimidating at first sight, and would benefit from some reorganisation and simplification. In some cases there is more than one way to trigger an action, each with their own menu item, and this might benefit from streamlining.
  • The current documentation recommends that all formatting data is removed from the file at of before loading into GoldenGATE. Depending on the presence and quality of any formatting mark-up in the input file, preserving this mark-up may produce better results than the currently recommended procedure, which is appears to be based around the processing unstyled OCR input files.
  • XML tags appear appropriately indented in the editing window, but the text contained within the tags does not. It is easier to work with files if the text content is at the same of indentation as the parent element.

Specific usability issues have been noted as follows:

User interface modality

The following comments refer to more complex functions which require user interaction after they have been invoked. Examples include the spell-checker, the FAT function, and others.

Ideally, in terms of interaction, the GG user interface would behave more like familiar applications such as Word.

In particular, the use of modal windows for interaction within functions can cause problems. This type of window does not allow the user to interact with the file being worked on until the user input is finished and the dialogue window is closed. This makes checking the context for the decisions being made about the text being worked on difficult or impossible. In some cases this makes using other applications more sensible for certain processes (e.g. spell checking).

One improvement would be to include more surrounding context where tokens are shown in the dialogue window for checking or correction. (The Edit/Slide Annotations function, which displays each instance of a specified annotation type singly and with surrounding text already uses this approach. I would like to see this extended to address the modality issue).

Another possibility would be to remove the modal dialogue box completely and work within the main document, moving from example to example as changes are approved.

Another example of modality is the differing behaviours available to users depending on whether the mark-up of an element is shown, or if just the content is highlighted. In highlight mode, elements can be removed, but tokens cannot be excluded or included from an element, even where there is no ambiguity in which element is being referred to.

Non-obvious addition of annotation

The addition of elements to the text results in element names being added to the right-hand column, where their display can optionally be turned on. For new users, the effects of their actions may appear to be invisible or absent unless they notice the new element names appearing in the list. This may be confusing, and may also prompt multiple applications of functions. I would suggest a user preference option to allow newly added tags to be shown.

Guido tells me that a user preference for this exists, which turns on highlighting of new elements when these are added.

User error - Loss of content and mark-up when saving

On one occasion I saved a document to a file and later found it to consist of only the text content with no mark-up at all. This was probably caused having the 'save as' output set to 'text'. If this preference is preserved during a session, perhaps the impending loss of markup should be brought to the attention of the user?... Maybe not, as it is a case of user error, but it might be more friendly.

Update: Guido reports that this is not so much iof an ssue now, as files are saved with different file extensions depending on the output type selected. However, mistakenly saving as 'text' can still result in lost mark-up if the original file is then worked on with the assumption that the saved-as file contains the full content of the file. This is also true when saving only selected elements, so users need to be aware of this.

Speed issues

Importing a large file (~1MB, 120 pages in Word) into GG takes 2 to 3mins, as does each subsequent screen redraw, which occurs after most edits and any window resize. (This is on a PC with a 3.2GHz Xeon processor and 1GB RAM - a fast machine).

From a user perspective, this is too slow to enable work on long documents (> approx. 200kB), unless a job can be largely or completely automated as a batch.

They may need to be an upper limit placed on the size of the input file, though how to specify this may be problematic, as the processing overhead may be more directly related to the size and complexity of the document tree than absolute file size per se.

GG has the capability to  handle files as 'parts', i.e. split them into multiple files for individual editing and later rejoining.

The user specifies an element to use as the unit for splitting, e.g. section. This goes a significant way to reducing the speed problem. The issue still remains that a suitable set of elements must exist in the file or be added to it if the document is to be split into an appropriate number of parts and in the appropriate locations, which may still require edits to the large file in GG, unless some other application is used.

I have extracted just the main section of the file (about 30 pages in the Word document) and I have exported this section as filtered HTML. This has reduced the 'refresh' lag to 10 seconds or so (on this fast PC), which is still not ideal, but is workable. It would probably be best to break the document down still further into species chunks. I will give this a go in GG, as it allows me to export to document 'parts'.

The use of predefined tools and pipelines reduces the impact of screen redraws for scriptable edits, as multiple edits are carried out automatically in a batch. Human driven edits are still affected, however.

Guido tells me that one option would be to only render visible part of the document, though there is no immediate plan to implement this. Otherwise, a move to a faster GUI component might be of use. (GG currently uses JTextPane). The availabilty of a replacement component is unclear. I will leave this issue to wiser heads than mine.

Increasing the Java heap size by editing the -Xms and -Xmx arguments in the GoldenGATE.bat did not resolve the problem. The process appears to be CPU limited, not memory limited.

For now, large documents are probably best worked on by saving them into several chunks in another application, and working on them individually. This may to some extent reduce the accuracy of the automatic detection and mark-up of taxon names using FAT.

Text encoding issues for XML files

Files saved as XML appear to be ANSI encoded, and have no encoding specified. Undeclared encodings are usually treated as UTF-8 in XML reading tools.

Opening UTF-8 encoded XML files in testing caused certain characters to display incorrectly. It is unclear if this is just a display issue, or if  characters are handled incorrectly at other levels in the GG processing. This is a particular issue with texts using non-Latin characters, and author and place names containing non-ANSI characters.

More detail.

Function related usability questions

Attribute Taxon Names function

The Find Taxon Names tool in GG may miss some names, and these are then annotated manually. Rather than requiring the user to adding the required attributes to the taxonName elements manually, a time consuming and repetitive process, a function has been provided which analyses the document content which has been annotated with <taxonName> tags and attempts to populate the element attributes automatically.

The user is presented with a UI panel listing the content of all of the taxonName elements, and the results of parsing these into attribute values.

This parsing process appears to be significantly less reliable than the results of the FAT tool. For example, FAT usually handles genus name expansion in taxonName attributes properly, so a name string like S. earlei in the Simuliidae file will be marked up as follows:

<taxonomicName _evidence="knownData" genus="Simulium" genus.bestMatchDistance="86" genus.bestMatchVote="2" genus.innerRound="1" genus.outerRound="1" rank="species" species="earlei">
S. earlei
  </taxonomicName>

Note the expanded genus attribute.

In testing, the Attribute Taxon Names function instead offered a genus value of S, which must be corrected using a drop-down selection, once for each element needing correction. This is time consuming and prone to error.

There is an option in the UI to select any of the displayed elements for tag removal, but there is no option to ignore an element. This approach may be the best way to work, as FAT can then be rerun. This combination works, but requires a repeat of the post-FAT manual checking stage which has already been done, and introduces opportunities for error.

My suggestions would be:

  • make the current state of each element clearer, showing text and markup, with the option to merge attributes where these have been wrongly parsed
  • allow elements to be ignored (i.e. exlude them from any changes)
  • allow full editing of attributes and text in the UI
  • allow the option to correct all instances of elements with identical attributes/text content.
  • provide a version of this function which only deals with elements without attributes, avoiding regression in the correctness of previously annotated and corrected elements.

LSID addition

In the section of the  GG manual Workflow to generate a valid TaxonX XML document up to Level 1 (pg7), the Get LSIDs for Taxa custom function is used.

Page numbers

For reference: The TaxonX discussion regarding the handling of page numbers ( here - scroll down to Pages/Page Breaks) says "Page numbers are part of the minimal information requested to stay with the treatments in traditional publications." A page break element in the form <pb n="373" url="http://foo/bar.html"/> has been added to the TaxonX schema.

Should such an element be required as part of the Web Revisions data, this will need to be bourne in mind, particularly as the input document will not necessarily be a scanned/OCRed version of the document as published. In this case, should the published page information be needed, a means of adding the <pb/> tags with their associated attributes will be needed.

Spell checking UI

Moving the file to my  Mac allowed me to run the spellcheck. The UI is awkward, with a modal approach (so you are either in 'spellcheck-mode' or not, so no ability to edit the file manually or scroll the file to check on something) , and there is no highlighting of the text being corrected, making it difficult to be sure exactly which instance of a word is being corrected at any particular time, or even to find it at all.

I gave up on GG's spellchecker very quickly. Until this is aspect of the UI improved, it is probably worth carrying out this function in an external text editor, and re-importing the corrected file.

edit logoScratchpads logoCreative Commons Licensedrupal logo
Scratchpads developed and conceived by: Vince Smith, Simon Rycroft & Dave Roberts