Revision of Taxonomic name identification in GoldenGATE from Tue, 2007-04-17 13:48

Identifying Taxonomic Names

FAT (Find all taxon names)

GG uses an integrated tool, FAT, to identify taxonomic names.

FAT uses a combining approach, using several stages of processing, and employing a variety of pattern recognition, named entity recognition, rule-based, statistical, and list based techniques. The results appear to be at least as good as and often better than other available tools with a few exceptions under certain circumstances (see below).

Confirmed taxon names (and taxon name components) are added to a set of data files, stored locally, providing a learning capability. As a result, the output improves in quality as more files are processed and more names are identified.

How FAT handles name completion

I quote from page 8 (nominally page 53) of the FAT paper:

Name Completion
Making use of the scientists’ names, we also extract taxonomic names that lack the genus, e.g., from enumerations, such as Pheidole pallidula, orbula, xantra. In addition, the rules allow genus abbreviations like Ph. for Pheidole in Ph. cornutula. In order to determine the meaning of a taxonomic name, we need to complete the names with their full parts.

If the genus part is missing, we have two options: First, we check if the species part appears elsewhere in the document, together with the genus it belongs to. If this is not the case, we use the last genus that we have extracted before the position of the name to complete. This is useful especially in case of enumerations: If several species of the same genus are enumerated, the genus is often given only with the first one. We then transfer the genus part to the subsequent taxon names.

If the genus is abbreviated, we also have two options: First, we again check if the species part appears elsewhere in the document, together with the full name of the genus it belongs to. If this fails, we check if we have recognized any genus name that starts with the given abbreviation. If there is exactly one such genus name, we insert it. If there is more than one, i.e., the abbreviation is ambiguous, we use the one which appears closest before the abbreviation.

Taxonomic names in the editor

Once FAT has identified a taxonomic name, GG adds a <taxonomicName> annotation, with a set of attributes containing the extracted information. E.g.:

<taxonomicName genus="Simulium" rank="species" rankWeight="3" species="mexicanum" subGenus="Hemicnetha">
Simulium (Hemicnetha) mexicanum Bellardi
</taxonomicName>

The attributes are self-explanitory, with the exception of rankWeight. I will add an explanation of this attribute when I have found out what it means.

Possible errors

Multiple author names

FAT can be inconsistent in the inclusion of multiple author names within the taxonomicName tags. E.g.

<paragraph>
<taxonomicName genus="Simulium" rank="species" rankWeight="3" species="earlei" subGenus="Hemicnetha">
Simulium (Hemicnetha) earlei Vargas
</taxonomicName>
, Martínez Palacios & Díaz-Nájera
</paragraph>

Correction method: Select the text to be included and also all or part of the relevant closing tag. Right-click and choose 'Include Tokens' . (This can also presumably be automated through the document. To check).

False positives

On occasion, Author or Place names can be falsely identified as Genus names. E.g.:

...In 1914 <taxonomicName genus="Malloch" rank="genus" rankWeight="1">
Malloch
</taxonomicName>
 provided a redescription of the male...

Correction method: Right-click on either tag and choose 'Remove'.
It is unclear if making this correction would also remove the name Mallock from the Genus name database. If not, this data would require cleaning to avoid progressive data poisoning with spurious text being added over time.

Factors affecting the quality of FAT results

Input file size and content

Language dependency

The accuracy of the name detection process depends in part on the existance of statistically distinct N-Gram distributions for taxonomic names and the language in which the document is witten. FAT currently uses data derived from the analyses of common English to derive its 'base language' N-Gran data. Literature written (or containing quotations) in languages other than English may have a higher rate of false positives.

The FAT paper gives as examples Latin and Italian, which have similar N-Gram characteristics to taxon names. German documents are also difficult in that their  use of capitalisation for nouns makes these words prone to being incorrectly recognised as Genus or Subgenus names.

Increasing the range of languages supported is listed as one of FATs priorities for further development.

Possibilities

  • Currently, GG users work in isolation, storing learned taxonomic name data locally and in isolation. A possible next step would be to centralise and pool taxon name data via the internet. This could allow rapid global improvement in the performance of FAT for multiple users in multiple locations simultaneously.
    • Guido has indicated that the use of and addition to external lexicons is already available for ant names.
    • The problem of bad data being added to common data files would be an issue, as all users would be affecetd by a single user's mistake.
    • A way to mitigate this might be to use moderation before items are added to a pool, or perhaps to require a number of submissions from different users before new content is added (or sent for moderation).
  • These data pools could become an input into a taxonomic thesaurus for use in other contexts. (This is apparently already being looked at or in progress).
  • As system learning progresses, we can return to previously processed documents and re-analyse them to take advantage of the improved information available. This could catch some false negatives which may have been missed on the first pass. We could even use full text searching on newly added names to identify documents which contain these previously unfound names for repeat processing.
  • A couple of more speculative thoughts
    • If we were to include within the collected taxon name data, guids for the document(s) within which they were found, and ideally also the location(s) within these document(s), an inverted index of taxon names and their locations could be created in a single step. This would effectively provide the basis for a name search engine.
    • Knowing the order, proximity and pattern of names within documents might also be useful in inferring the taxonomic relationship between the names in an automatic or semi-automatic way, particularly if this was combined with full text search functionality which could identify words commonly used to designate taxonomic relationships and document sections.
Scratchpads developed and conceived by: Vince Smith, Simon Rycroft & Dave Roberts