This page is part of the GoldenGATE evaluation for EDIT WP6.
GG uses an integrated tool, FAT (Find all taxon names) to identify taxonomic names.
FAT uses a combining approach, employing several stages of processing, and utilising a variety of pattern recognition, named entity recognition, rule-based, statistical, and list based techniques. Their own testing indicates that the results appear to be at least as good as and often better than other available tools. Some limitations exist - see below.
Confirmed taxon names (and taxon name components) are added to a set of data files, stored locally, providing a learning capability. As a result, the output improves in quality as more files are processed and more names are identified.
FAT is most easily called by clicking on the 'Run FAT' button in the Custom Functions section of the UI. It is also available via Analyzers/Run/FAT.analyzer.
FAT then processes the file and asks the user to select queried strings. This input is them used to modify the results, and the relevant strings in the document are annotated.
The user then checks the results, and then makes the following corrections:
Once FAT has identified a taxonomic name, GG adds a <taxonomicName> annotation, with a set of attributes containing the extracted information. E.g.:
<taxonomicName genus="Simulium" rank="species" rankWeight="3" species="mexicanum" subGenus="Hemicnetha">
Simulium (Hemicnetha) mexicanum Bellardi
</taxonomicName>
The attributes are self-explanitory, with the exception of rankWeight. I will add an explanation of this attribute when I have found out what it means.
On occasion, author or place names can be falsely identified as Genus names. E.g.:
...In 1914 <taxonomicName genus="Malloch" rank="genus" rankWeight="1">
Malloch
</taxonomicName>
provided a redescription of the male...
or
...would be deposited in the USNM, but there is no record of this and it is presumably lost (Dr F. <taxonomicName _evidence="lexicon" genus="C" rank="subGenus" subGenus="Thompson">
C. Thompson
</taxonomicName>
, personal communication).
</paragraph>
Correction method: Edit/Select Annotation/taxonName gives a panel which lists all of the relevant annotated strings, together with a checkboxes allowing the removal of annotations as necessary. Alternatively, individual annotations can be cleared with a right-click/Remove or Remove All.
Making this correction apparently does not remove the incorrect entry from the Genus name database, should it be present there. This entry may need to be removed to avoid the accumulation of errors, unless it is an ambiguous case.
Unrecognised taxon names may be caused by a number of factors such as typographic errors (including errors in punctuation), and the absence of the name in the lists used by FAT.
Missed names are annotated manually, and have attributes added via the Attribute Taxon Names analyzer. (Usability notes to be added).
The GG instructions advise the following:
It proves extremely advantageous to have very clean and spell checked OCR-output, since artefacts have an impact on the mark-up process, and later corrections are very tedious. This has especially an effect on the recognition of names.
The above is very important, and a thorough check for errors before the start of the process, (including, for example, missing full stops before genus name abbreviations), and at strategic points during the process will make a significant difference to the efficiency with which the output can be generated.
The importance of this pre-processing check is such that it may well be worth incorporating a tool into GG to help in catching issues not easyly spotted using a standard spell check, such as possible punctuation errors etc.
FAT is often inconsistent in the inclusion of multiple author names within the taxonomicName tags. E.g.
<paragraph>
<taxonomicName genus="Simulium" rank="species" rankWeight="3" species="earlei" subGenus="Hemicnetha">
Simulium (Hemicnetha) earlei Vargas
</taxonomicName>
, Martínez Palacios & Díaz-Nájera
</paragraph>
Correction method: Select the text to be included and also all or part of the relevant closing tag. Right-click and choose 'Include Tokens' .
In papers containing many references to author teams, this issue is significant. A more automated aproach to fixing the problem post-FAT would probably be practical, but ideally this would be addressed as a development of FAT itself.
On occassion, lists of names in the form S. alpha, S. beta may be incorrectly interpreted as being one subspecies name, not two species names:
<taxonomicName _evidence="WSS:97" genus="Simulium" genus.bestMatchDistance="1594" genus.bestMatchVote="2" genus.innerRound="1" genus.outerRound="1" rank="subSpecies" species="jobbinsi" subSpecies="horacioi">
S. jobbinsi, S. horacioi
</taxonomicName>
Correction method: Select the second name and preceeding comma, and right-click/Split annotation. Exclude the comma and space, and edit attributes in both names as required.
This looks like a bug to me.
The FAT paper shows the results of processing individual short files compared with concatenating the same files and processing the resulting larger text. Larger input documents increase the number of 'sure positives', and the larger the number of sure positives, the larger the amount of information available to the rules based analysis and word level classification of the input text.
In practice, this will not necessarily result in worse final mark-up quality for short documents, because queried strings are displayed to use user for manual classification. The number of texts presented in this way would be expected to decrease proportionally (though not necessarily in absolute terms) as document length increases. In other words, working on long files or files which are processed as a 'corpus batch' (see below) will usually result in fewer user interventions per name annotated.
The accuracy of the name detection process depends in part on the existance of statistically distinct N-Gram distributions for taxonomic names and the language in which the document is witten. FAT currently uses data derived from the analyses of common English to derive its 'base language' N-Gram data. Literature written (or containing quotations) in languages other than English may have a higher rate of false positives.
The FAT paper gives as examples Latin and Italian, which have similar N-Gram characteristics to taxon names. German documents are also more difficult to process reliably.The capitalisation of nouns makes these prone to being incorrectly recognised as Genus or Subgenus names.
Increasing the range of languages supported is listed as one of FATs priorities for further development.
I quote from page 8 (nominally page 53) of the FAT paper:
Name Completion
Making use of the scientists’ names, we also extract taxonomic names that lack the genus, e.g., from enumerations, such as Pheidole pallidula, orbula, xantra. In addition, the rules allow genus abbreviations like Ph. for Pheidole in Ph. cornutula. In order to determine the meaning of a taxonomic name, we need to complete the names with their full parts.If the genus part is missing, we have two options: First, we check if the species part appears elsewhere in the document, together with the genus it belongs to. If this is not the case, we use the last genus that we have extracted before the position of the name to complete. This is useful especially in case of enumerations: If several species of the same genus are enumerated, the genus is often given only with the first one. We then transfer the genus part to the subsequent taxon names.
If the genus is abbreviated, we also have two options: First, we again check if the species part appears elsewhere in the document, together with the full name of the genus it belongs to. If this fails, we check if we have recognized any genus name that starts with the given abbreviation. If there is exactly one such genus name, we insert it. If there is more than one, i.e., the abbreviation is ambiguous, we use the one which appears closest before the abbreviation.
One of FAT's strengths is its ability to learn, both from user input and prom decisions made as a result of document content analysis. Learning in this context is the persistence of certain text data which can be used to aid decision making in the future. Though this is a valuable feature, I would suggest that the user interface of GoldenGATE should include features to better manage the learning process, allowing the user to have more knowledge of the decisions GoldenGATE has made, and also allowing the amendment or deletion of learned items.
For example, I would like to be given the option to see a list of all the names detected once FAT has run. I would also like to have changes to one instance of a name (e.g. the addition of mistakenly excluded authors into the annotation) reflected in all other instances of the name where those authors have been similarly excluded. This could presumably be set up as a user defined tool, but it is a common enough issue to warrant being handled formally.
Any corrections, such as the manual removal of false positives after a FAT run should automatically be reflected in the text store.
Perhaps a report could be created as a file is worked on, so names recognised, rejected and added to the learning store can be easily seen and corrected by the user.
The 'Attribute Taxon Names' custom function interface has a very inflexible interface. Context for each name is not shown, and so cannot be used to populate attribute values. The process cannot be cancelled, nor can individual names that the user is unsure of be skipped. It is a modal interface, so the document cannot be examined to inform the editing decisions made the user. All of these make what could be a very useful function much less usable.