This page is part of the GoldenGATE evaluation for EDIT WP6.
GG uses an integrated tool, FAT (Find all taxon names) to identify taxonomic names.
FAT uses a combining approach, employing several stages of processing, and utilising a variety of pattern recognition, named entity recognition, rule-based, statistical, and list based techniques. The results appear to be at least as good as and often better than other available tools with a few exceptions under certain circumstances (see below).
Confirmed taxon names (and taxon name components) are added to a set of data files, stored locally, providing a learning capability. As a result, the output improves in quality as more files are processed and more names are identified.
FAT is most easily called by clicking on the 'Run FAT' button in the Custom Functions section of the UI. It is also available via Analyzers/Run/FAT.analyzer.
FAT then processes the file and asks the user to select queried strings. This input is them used to modify the results, and the relevant strings in the document are annotated.
The user then checks the results, and then makes the following corrections:
Once FAT has identified a taxonomic name, GG adds a <taxonomicName> annotation, with a set of attributes containing the extracted information. E.g.:
<taxonomicName genus="Simulium" rank="species" rankWeight="3" species="mexicanum" subGenus="Hemicnetha">
Simulium (Hemicnetha) mexicanum Bellardi
</taxonomicName>
The attributes are self-explanitory, with the exception of rankWeight. I will add an explanation of this attribute when I have found out what it means.
On occasion, author or place names can be falsely identified as Genus names. E.g.:
...In 1914 <taxonomicName genus="Malloch" rank="genus" rankWeight="1">
Malloch
</taxonomicName>
provided a redescription of the male...
or
...would be deposited in the USNM, but there is no record of this and it is presumably lost (Dr F. <taxonomicName _evidence="lexicon" genus="C" rank="subGenus" subGenus="Thompson">
C. Thompson
</taxonomicName>
, personal communication).
</paragraph>
Correction method: Edit/Select Annotation/taxonName gives a panel which lists all of the relevant annotated strings, together with a checkboxes allowing the removal of annotations as necessary. Alternatively, individual annotations can be cleared with a right-click/Remove or Remove All.
Making this correction apparently does not remove the incorrect entry from the Genus name database, should it be present there. This entry may need to be removed to avoid the accumulation of errors, unless it is an ambiguous case.
Unrecognised taxon names may be caused by a number of factors such as typographic errors (including errors in punctuation), and the absence of the name in the lists used by FAT.
Missed names are annotated manually, and have attributes added via the Attribute Taxon Names analyzer. (Usability notes to be added).
The GG instructions advise the following:
It proves extremely advantageous to have very clean and spell checked OCR-output, since artefacts have an impact on the mark-up process, and later corrections are very tedious. This has especially an effect on the recognition of names.
The above is very important, and a thorough check for errors before the start of the process, (including, for example, missing full stops before genus name abbreviations), and at strategic points during the process will make a significant difference to the efficiency with which the output can be generated.
FAT can be inconsistent in the inclusion of multiple author names within the taxonomicName tags. E.g.
<paragraph>
<taxonomicName genus="Simulium" rank="species" rankWeight="3" species="earlei" subGenus="Hemicnetha">
Simulium (Hemicnetha) earlei Vargas
</taxonomicName>
, Martínez Palacios & Díaz-Nájera
</paragraph>
Correction method: Select the text to be included and also all or part of the relevant closing tag. Right-click and choose 'Include Tokens' . (This can also presumably be automated through the document. To check).
The FAT paper shows the results of processing individual short files compared with concatenating the same files and processing the resulting larger text. Larger input documents increase the number of 'sure positives', and the larger the number of sure positives, the larger the amount of information available to the rules based analysis and word level classification of the input text.
In practice, this will not necessarily result in worse final mark-up quality for short documents, because queried strings are displayed to use user for manual classification. The number of texts presented in this way would be expected to decrease proportionally (though not necessarily in absolute terms) as document length increases. In other words, working on long files or files which are processed as a 'corpus batch' (see below) will usually result in fewer user interventions per name annotated.
The accuracy of the name detection process depends in part on the existance of statistically distinct N-Gram distributions for taxonomic names and the language in which the document is witten. FAT currently uses data derived from the analyses of common English to derive its 'base language' N-Gram data. Literature written (or containing quotations) in languages other than English may have a higher rate of false positives.
The FAT paper gives as examples Latin and Italian, which have similar N-Gram characteristics to taxon names. German documents are also more difficult to process reliably in that their use of capitalisation for nouns makes these words prone to being incorrectly recognised as Genus or Subgenus names.
Increasing the range of languages supported is listed as one of FATs priorities for further development.
I quote from page 8 (nominally page 53) of the FAT paper:
Name Completion
Making use of the scientists’ names, we also extract taxonomic names that lack the genus, e.g., from enumerations, such as Pheidole pallidula, orbula, xantra. In addition, the rules allow genus abbreviations like Ph. for Pheidole in Ph. cornutula. In order to determine the meaning of a taxonomic name, we need to complete the names with their full parts.
If the genus part is missing, we have two options: First, we check if the species part appears elsewhere in the document, together with the genus it belongs to. If this is not the case, we use the last genus that we have extracted before the position of the name to complete. This is useful especially in case of enumerations: If several species of the same genus are enumerated, the genus is often given only with the first one. We then transfer the genus part to the subsequent taxon names.
If the genus is abbreviated, we also have two options: First, we again check if the species part appears elsewhere in the document, together with the full name of the genus it belongs to. If this fails, we check if we have recognized any genus name that starts with the given abbreviation. If there is exactly one such genus name, we insert it. If there is more than one, i.e., the abbreviation is ambiguous, we use the one which appears closest before the abbreviation.