GG uses an integrated tool, FAT (Find all taxon names) to identify taxonomic names.
FAT uses a combining approach, employing several stages of processing, and utilising a variety of pattern recognition, named entity recognition, rule-based, statistical, and list based techniques. The results appear to be at least as good as and often better than other available tools with a few exceptions under certain circumstances (see below).
Confirmed taxon names (and taxon name components) are added to a set of data files, stored locally, providing a learning capability. As a result, the output improves in quality as more files are processed and more names are identified.
FAT is most easily called by clicking on the 'Run FAT' button in the Custom Functions section of the UI. It is also available via Analyzers/Run/FAT.analyzer.
FAT then processes the file and asks the user to select queried strings. This input is them used to modify the results, and the relevant strings in the document are annotated.
The user then checks the results, and then makes the following corrections:
Once FAT has identified a taxonomic name, GG adds a <taxonomicName> annotation, with a set of attributes containing the extracted information. E.g.:
<taxonomicName genus="Simulium" rank="species" rankWeight="3" species="mexicanum" subGenus="Hemicnetha">
Simulium (Hemicnetha) mexicanum Bellardi
</taxonomicName>
The attributes are self-explanitory, with the exception of rankWeight. I will add an explanation of this attribute when I have found out what it means.
On occasion, author or place names can be falsely identified as Genus names. E.g.:
...In 1914 <taxonomicName genus="Malloch" rank="genus" rankWeight="1">
Malloch
</taxonomicName>
provided a redescription of the male...
Correction method: To review the re
It is unclear if making this correction would also remove the name Mallock from the Genus name database. If not, this data would require cleaning to avoid progressive data poisoning with spurious text being added over time.
Unrecognised taxon names may be caused by a number of factors such as typographic errors (including errors in punctuation), and the absence of the name in the names lists used by FAT.
Missed names are annotated manually, and have attributes added via the Attribute Taxon Names analyzer. (Usability notes to be added).
FAT can be inconsistent in the inclusion of multiple author names within the taxonomicName tags. E.g.
<paragraph>
<taxonomicName genus="Simulium" rank="species" rankWeight="3" species="earlei" subGenus="Hemicnetha">
Simulium (Hemicnetha) earlei Vargas
</taxonomicName>
, Martínez Palacios & Díaz-Nájera
</paragraph>
Correction method: Select the text to be included and also all or part of the relevant closing tag. Right-click and choose 'Include Tokens' . (This can also presumably be automated through the document. To check).
The accuracy of the name detection process depends in part on the existance of statistically distinct N-Gram distributions for taxonomic names and the language in which the document is witten. FAT currently uses data derived from the analyses of common English to derive its 'base language' N-Gran data. Literature written (or containing quotations) in languages other than English may have a higher rate of false positives.
The FAT paper gives as examples Latin and Italian, which have similar N-Gram characteristics to taxon names. German documents are also difficult in that their use of capitalisation for nouns makes these words prone to being incorrectly recognised as Genus or Subgenus names.
Increasing the range of languages supported is listed as one of FATs priorities for further development.
I quote from page 8 (nominally page 53) of the FAT paper:
Name Completion
Making use of the scientists’ names, we also extract taxonomic names that lack the genus, e.g., from enumerations, such as Pheidole pallidula, orbula, xantra. In addition, the rules allow genus abbreviations like Ph. for Pheidole in Ph. cornutula. In order to determine the meaning of a taxonomic name, we need to complete the names with their full parts.
If the genus part is missing, we have two options: First, we check if the species part appears elsewhere in the document, together with the genus it belongs to. If this is not the case, we use the last genus that we have extracted before the position of the name to complete. This is useful especially in case of enumerations: If several species of the same genus are enumerated, the genus is often given only with the first one. We then transfer the genus part to the subsequent taxon names.
If the genus is abbreviated, we also have two options: First, we again check if the species part appears elsewhere in the document, together with the full name of the genus it belongs to. If this fails, we check if we have recognized any genus name that starts with the given abbreviation. If there is exactly one such genus name, we insert it. If there is more than one, i.e., the abbreviation is ambiguous, we use the one which appears closest before the abbreviation.