Revision of Taxonomic name identification in GoldenGATE from Thu, 2007-04-19 10:50

This page is part of the GoldenGATE evaluation for EDIT WP6.

GG uses an integrated tool, FAT (Find all taxon names) to identify taxonomic names.

FAT uses a combining approach, employing several stages of processing, and utilising a variety of pattern recognition, named entity recognition, rule-based, statistical, and list based techniques. The results appear to be at least as good as and often better than other available tools with a few exceptions under certain circumstances (see below).

Confirmed taxon names (and taxon name components) are added to a set of data files, stored locally, providing a learning capability. As a result, the output improves in quality as more files are processed and more names are identified.

Invoking FAT in GG

FAT is most easily called by clicking on the 'Run FAT' button in the Custom Functions section of the UI. It is also available via Analyzers/Run/FAT.analyzer.

FAT then processes the file and asks the user to select queried strings. This input is them used to modify the results, and the relevant strings in the document are annotated.

The user then checks the results, and then makes the following corrections:

  • remove annotations around incorrectly identified names (Using Edit/Select Annotation)
  • add taxonName annotations to any missed names
  • correct any partially marked-up names
  • Run Markup Converters/Remove duplicate annotations
  • add the appropriate attributes using the Custom function Attribute Taxon Names

Taxonomic names in the editor

Once FAT has identified a taxonomic name, GG adds a <taxonomicName> annotation, with a set of attributes containing the extracted information. E.g.:

<taxonomicName genus="Simulium" rank="species" rankWeight="3" species="mexicanum" subGenus="Hemicnetha">
Simulium (Hemicnetha) mexicanum Bellardi
</taxonomicName>

The attributes are self-explanitory, with the exception of rankWeight. I will add an explanation of this attribute when I have found out what it means.

Possible errors

False positives

On occasion, author or place names can be falsely identified as Genus names. E.g.:

...In 1914 <taxonomicName genus="Malloch" rank="genus" rankWeight="1">
Malloch
</taxonomicName>
 provided a redescription of the male...

or

 ...would be deposited in the USNM, but there is no record of this and it is presumably lost (Dr F. <taxonomicName _evidence="lexicon" genus="C" rank="subGenus" subGenus="Thompson">
C. Thompson
</taxonomicName>
, personal communication).
</paragraph>

Correction method: Edit/Select Annotation/taxonName gives a panel which lists all of the relevant annotated strings, together with a checkboxes allowing the removal of annotations as necessary. Alternatively, individual annotations can be cleared with a right-click/Remove or Remove All.

Making this correction apparently does not remove the incorrect entry from the Genus name database, should it be present there. This entry may need to be removed to avoid the accumulation of errors, unless it is an ambiguous case.

False negatives

Unrecognised taxon names may be caused by a number of factors such as typographic errors (including errors in punctuation), and the absence of the name in the lists used by FAT.

Missed names are annotated manually, and have attributes added via the Attribute Taxon Names analyzer. (Usability notes to be added).

The GG instructions advise the following:

It proves extremely advantageous to have very clean and spell checked OCR-output, since artefacts have an impact on the mark-up process, and later corrections are very tedious. This has especially an effect on the recognition of names.


The above is very important, and a thorough check for errors before the start of the process, (including, for example, missing full stops before genus name abbreviations), and at strategic points during the process will make a significant difference to the efficiency with which the output can be generated.

Partial errors

Multiple author names

FAT can be inconsistent in the inclusion of multiple author names within the taxonomicName tags. E.g.

<paragraph>
<taxonomicName genus="Simulium" rank="species" rankWeight="3" species="earlei" subGenus="Hemicnetha">
Simulium (Hemicnetha) earlei Vargas
</taxonomicName>
, Martínez Palacios & Díaz-Nájera
</paragraph>

Correction method: Select the text to be included and also all or part of the relevant closing tag. Right-click and choose 'Include Tokens' . (This can also presumably be automated through the document. To check).

Factors affecting the quality of FAT results

Input file size and content

The FAT paper shows the results of processing individual short files compared with concatenating the same files and processing the resulting larger text. Larger input documents increase the number of 'sure positives', and the larger the number of sure positives, the larger the amount of information available to the rules based analysis and word level classification of the input text.

In practice, this will not necessarily result in worse final mark-up quality for short documents, because queried strings are displayed to use user for manual classification. The number of texts presented in this way would be expected to decrease proportionally (though not necessarily in absolute terms) as document length increases. In other words, working on long files or files which are processed as a 'corpus batch' (see below) will usually result in fewer user interventions per name annotated.

Language dependency

The accuracy of the name detection process depends in part on the existance of statistically distinct N-Gram distributions for taxonomic names and the language in which the document is witten. FAT currently uses data derived from the analyses of common English to derive its 'base language' N-Gram data. Literature written (or containing quotations) in languages other than English may have a higher rate of false positives.

The FAT paper gives as examples Latin and Italian, which have similar N-Gram characteristics to taxon names. German documents are also more difficult to process reliably in that their  use of capitalisation for nouns makes these words prone to being incorrectly recognised as Genus or Subgenus names.

Increasing the range of languages supported is listed as one of FATs priorities for further development.

How FAT handles name completion

I quote from page 8 (nominally page 53) of the FAT paper:
Name Completion
Making use of the scientists’ names, we also extract taxonomic names that lack the genus, e.g., from enumerations, such as Pheidole pallidula, orbula, xantra. In addition, the rules allow genus abbreviations like Ph. for Pheidole in Ph. cornutula. In order to determine the meaning of a taxonomic name, we need to complete the names with their full parts.

If the genus part is missing, we have two options: First, we check if the species part appears elsewhere in the document, together with the genus it belongs to. If this is not the case, we use the last genus that we have extracted before the position of the name to complete. This is useful especially in case of enumerations: If several species of the same genus are enumerated, the genus is often given only with the first one. We then transfer the genus part to the subsequent taxon names.

If the genus is abbreviated, we also have two options: First, we again check if the species part appears elsewhere in the document, together with the full name of the genus it belongs to. If this fails, we check if we have recognized any genus name that starts with the given abbreviation. If there is exactly one such genus name, we insert it. If there is more than one, i.e., the abbreviation is ambiguous, we use the one which appears closest before the abbreviation.

Thoughts

  • Currently, GG users work in isolation, storing learned taxonomic name data locally and in isolation. One or more centralised taxon name data stores accessed and updated via the internet would allow rapid global improvement in the performance of FAT for multiple users in multiple locations simultaneously.
    • Guido has indicated that the use of and addition to external lexicons is already available for ant names.
    • The problem of bad data being added to common data files would be an issue, as all users would be affected by a single user's mistake.
    • A way to mitigate this might be to use moderation before items are added to a pool, or perhaps to require some suitable number of submissions from different users before new content is added or sent for moderation.
  • These data pools could become an input into a taxonomic thesaurus for use in other contexts. (This is apparently already being looked at or in progress).
  • As system learning progresses, we can return to previously processed documents and re-analyse them to take advantage of the improved information available. This could catch some false negatives which may have been missed on the first pass. We could even use full text searching on newly added names to identify documents which contain these previously unfound names for repeat processing.
  • Guido and I  have collaboratively outlined a preferable approach. A corpusBatch module (a more complex version of the current Batch capabilities of GG) would accept multiple files as input and process them internally as a single file. This would allow the FAT output to benefit from the quality improvements which are associated with increasing document size even for a set of small documents. If implemented, this function may be affected by the interface speed issues which become a problem with documents over about 200KB. Guido is already working on ways of minimising these issues.
  • A couple of more speculative thoughts
    • If we were to include within the collected taxon name data, guids for the document(s) within which they were found, and ideally also the location(s) within these document(s), an inverted index of taxon names and their locations could be created in a single step. This would effectively provide the basis for a name search engine.
    • Knowing the order, proximity and pattern of names within documents might also be useful in inferring the taxonomic relationship between the names in an automatic or semi-automatic way, particularly if this was combined with full text search functionality which could identify words commonly used to designate taxonomic relationships and document sections.
Scratchpads developed and conceived by: Vince Smith, Simon Rycroft & Dave Roberts