(Part of the GoldenGATE evaluation)
The GG tutorial is found in the Tutorial folder in the GG installation folder. It includes two example files and instructions to guide the user through the process of creating TaxonX output.
This process involves editing the input file into an semantically marked up file intermediate format which is sufficiently detailed and structured to allow its conversion into a final output format.
Other than TaxonX, it is unclear which final output formats are currently supported by GG, but it is clear that plug-ins can be written to support other XML schemas should this be required.
The minimum specification for the intermediate format will vary depending on the nature of the target output. The intermediate format specificaion will need to be documented to aid the user producing the marked up output in GG. Such documentation probably already exists for conversions to TaxonX.
Below are some comments about the various stages of the process, and also about some of the functions used in the creation of the final output from the first tutorial example.
The tutorial example files are OCR output in XML format.
(I will not be routinely annotating other tests to this level of detail, but I feel that this simple example shows some useful characteristics which are generally relevant to the conversion process).
<p>
Ectatomma tuberculatum Olivier.-Port of Spain and Sangre Grande,
</p>
<p>
(R. Thaxter), ; Botanical Garden, Port of Spain, (Wheeler), ?.<BR>
Ectatomma ruidum Roger.-Port of Spain, (R. Thaxter), ; Chagu-
</p>
<p>
anas, (Urich), ; Botanical Garden, Port of Spain, (Wheeler), .<BR>
Ectatomma (Gnamptogenys) concinnum F. Smith.-Caparo, (P. B.
</p>
<p>
Whelpley), 9.
</p>
This file is loaded into GG using File/Load Document from File, with the 'Files of type:' dropdown set to HTML, XML and SGML files (custom reader).
On import, certain processing is carried out automatically.
In the case of the above file, the content is wrapped in a root element <document>, <p> elements are converted into <paragraph> elements, and <BR> elements are converted into literal line breaks.
This is relevant because it will be necessary to check how GG handles any existing markup within input files. The GG documentaion has the following to say about input formats:
"It is recommended to prepare the input documents in html format without any formatting but the paragraph elements (<p>) and a solid (<hr>) line to mark page breaks. For the processing of the documents no further elements are needed. It proved also extremely advantageous to have very clean and spell checked OCR-output, since artefacts have an impact on the mark-up process, and later corrections are very tedious. This has especially an effect on the recognition of names." (GG Manual pg6).
As part of further testing, I have loaded some files which are the product of Word 2003's Save as options.
The Web Page, Filtered output produces relatively clean HTML, with no Word specific tags/namespaces, which affect the other output formats.
For this reason it may be that Word documents, if these are in fact supported as a WP6 input format, should be 'Saved as' Web Page, Filtered before loading to GG.
A possibility:
Notwithstanding the above, it is possible that Word namespaced elements present in other Save As output file types may in fact be useful as markers of specific content, aiding the creation of the intermediate format file by allowing automation, or at least supplying 'hints' clarifying the document structure for whoever makes the conversion in GG.
It may therefore be worth looking at the creation of a specific Word (or OpenOffice Write) template containing predefined semantic character or paragraph styles which would allow users to mark-up their existing or new documents with styles which either map directly onto the appropriate tags in the intermediate format, or can be used as guides for later automatic or manual conversions in GG.
This would have the benefit of using tools already familiar to many authors of input documents, allowing them to imprint their knowledge of the domain in an unambiguous form within the document, prior to its submission and while the structure of the document (as indicated by the formatting and layout of the text which is lost on import to GG) is most easily understood.
If this path is taken, it would be best if the template did not attempt to replicate the strengths of GG, but should allow users to apply elements in the document as 'hints' to aid the creation of the intermediate format. This would be most useful if this helped those areas which are hardest to automate, for example the identification of true paragraph boundaries.
Going back to the tutorial file...
The file is now in the form:
<document>
<paragraph>
Ectatomma tuberculatum Olivier. - Port of Spain and Sangre Grande,
</paragraph>
<paragraph>
(R. Thaxter),; Botanical Garden, Port of Spain, (Wheeler),?.
Ectatomma ruidum Roger. - Port of Spain, (R. Thaxter),; Chagu-
</paragraph>
<paragraph>
anas, (Urich),; Botanical Garden, Port of Spain, (Wheeler),.
Ectatomma (Gnamptogenys) concinnum F. Smith. - Caparo, (P. B.
</paragraph>
<paragraph>
Whelpley), 9.
</paragraph>
</document>
At this point the paragraphs as recognised by the OCR process are incorrectly split. GG depends on the user to redefine the paragraph boundaries correctly, using the Merge Annotations and Split Annotation functions. The line breaks are then fixed using the paragraph structure normalizer to give:
<paragraph>
Ectatomma tuberculatum Olivier. - Port of Spain and Sangre Grande, (R. Thaxter),; Botanical Garden, Port of Spain, (Wheeler),?.
</paragraph>
<paragraph>
Ectatomma ruidum Roger. - Port of Spain, (R. Thaxter),; Chaguanas, (Urich),; Botanical Garden, Port of Spain, (Wheeler),.
</paragraph>
<paragraph>
Ectatomma (Gnamptogenys) concinnum F. Smith. - Caparo, (P. B. Whelpley), 9.
</paragraph>
There is not yet enough information for GG to automatically split the content into treaments using the built-in TreatmentSplitter.analyzer. More structural information must be added.
In order to recognise treatments within the text, GG needs to recognise their necessary components.
A treatment consists of three components:
If we run TaxonDetector.analyzer, GG will mark-up the taxon names in the text, providing an indicator of the nomenclature component and thus the start location for each treatment.
Manually assigning location annotations to the various location names in the text provides the structural information GG needs to detect the location of the materials examined compenent of each treament.
Once these annotations are complete, the TreatmentSplitter.analyzer can be run successfully, producing the content in the required IF:
<document STORAGE_UNIT_ID="0x06132B08027E7EDD38875A446F5096EF" STORAGE_UNIT_TITLE="">
<paragraph>
<treatment STORAGE_UNIT_ID="0xC965A6B7E0FC830C471B83EA54EBD65D" pageNumber="1">
<taxonomicName genus="Ectatomma" rank="species" rankWeight="3" species="tuberculatum">
Ectatomma tuberculatum Olivier
</taxonomicName>
<subSubSection treatmentPart="Materials">
<collection_event>
. -
<location>
Port of Spain
</location>
and Sangre Grande, (R. Thaxter),; Botanical Garden,
</collection_event>
<collection_event>
<location>
Port of Spain
</location>
, (Wheeler),?.
</collection_event>
</subSubSection>
</treatment>
</paragraph>
<paragraph>
<treatment STORAGE_UNIT_ID="0xB40463398B603E096E29295F2764F537" pageNumber="1">
<taxonomicName genus="Ectatomma" rank="species" rankWeight="3" species="ruidum">
Ectatomma ruidum Roger
</taxonomicName>
<subSubSection treatmentPart="Materials">
<collection_event>
. -
<location>
Port of Spain
</location>
, (R. Thaxter),;
</collection_event>
<collection_event>
<location>
Chaguanas
</location>
, (Urich),; Botanical Garden,
</collection_event>
<collection_event>
<location>
Port of Spain
</location>
, (Wheeler),.
</collection_event>
</subSubSection>
</treatment>
</paragraph>
<paragraph>
<treatment STORAGE_UNIT_ID="0x53E6D9CDC59AC654A76BE3FB4182A85D" pageNumber="1">
<taxonomicName genus="Ectatomma" rank="species" rankWeight="3" species="concinnum" subGenus="Gnamptogenys">
Ectatomma (Gnamptogenys) concinnum F. Smith
</taxonomicName>
<subSubSection treatmentPart="Materials">
<collection_event>
. -
<location>
Caparo
</location>
, (P. B. Whelpley), 9.
</collection_event>
</subSubSection>
</treatment>
</paragraph>
</document>
The TaxonxCreator.Analyser is then run:
<taxonx:taxonx xmlns:mods="http://www.loc.gov/mods/v3" xmlns:taxonx="http://research.amnh.org/informatics/taxlit/taxonx/taxonx1">
<paragraph>
<taxonx:taxonxHeader>
<mods:mods>
<mods:titleInfo>
<mods:title>
Automated TaxonX markup created by GoldenGATE
</mods:title>
</mods:titleInfo>
</mods:mods>
</taxonx:taxonxHeader>
</paragraph>
<taxonx:taxonxBody>
<paragraph>
<treatment STORAGE_UNIT_ID="0xC965A6B7E0FC830C471B83EA54EBD65D" pageNumber="1">
<taxonx:treatment>
<taxonx:p>
<taxonomicName genus="Ectatomma" rank="species" rankWeight="3" species="tuberculatum">
<taxonx:nomenclature>
<taxonx:name genus="Ectatomma" rank="species" rankWeight="3" species="tuberculatum">
Ectatomma tuberculatum Olivier
</taxonx:name>
</taxonx:nomenclature>
</taxonomicName>
<subSubSection treatmentPart="Materials">
<collection_event>
. -
<taxonx:seg type="materials_examined">
<location>
<taxonx:collection_event>
<taxonx:locality>
Port of Spain
</taxonx:locality>
</taxonx:collection_event>
</location>
and Sangre Grande, (R. Thaxter),; Botanical Garden,
<collection_event>
<location>
<taxonx:collection_event>
<taxonx:locality>
Port of Spain
</taxonx:locality>
</taxonx:collection_event>
</location>
, (Wheeler),?.
</collection_event>
</taxonx:seg>
</collection_event>
</subSubSection>
</taxonx:p>
</taxonx:treatment>
</treatment>
</paragraph>
<paragraph>
<treatment STORAGE_UNIT_ID="0xB40463398B603E096E29295F2764F537" pageNumber="1">
<taxonx:treatment>
<taxonx:p>
<taxonomicName genus="Ectatomma" rank="species" rankWeight="3" species="ruidum">
<taxonx:nomenclature>
<taxonx:name genus="Ectatomma" rank="species" rankWeight="3" species="ruidum">
Ectatomma ruidum Roger
</taxonx:name>
</taxonx:nomenclature>
</taxonomicName>
<subSubSection treatmentPart="Materials">
<collection_event>
. -
<taxonx:seg type="materials_examined">
<location>
<taxonx:collection_event>
<taxonx:locality>
Port of Spain
</taxonx:locality>
</taxonx:collection_event>
</location>
, (R. Thaxter),;
<collection_event>
<location>
<taxonx:collection_event>
<taxonx:locality>
Chaguanas
</taxonx:locality>
</taxonx:collection_event>
</location>
, (Urich),; Botanical Garden,
</collection_event>
<collection_event>
<location>
<taxonx:collection_event>
<taxonx:locality>
Port of Spain
</taxonx:locality>
</taxonx:collection_event>
</location>
, (Wheeler),.
</collection_event>
</taxonx:seg>
</collection_event>
</subSubSection>
</taxonx:p>
</taxonx:treatment>
</treatment>
</paragraph>
<paragraph>
<treatment STORAGE_UNIT_ID="0x53E6D9CDC59AC654A76BE3FB4182A85D" pageNumber="1">
<taxonx:treatment>
<taxonx:p>
<taxonomicName genus="Ectatomma" rank="species" rankWeight="3" species="concinnum" subGenus="Gnamptogenys">
<taxonx:nomenclature>
<taxonx:name genus="Ectatomma" rank="species" rankWeight="3" species="concinnum" subGenus="Gnamptogenys">
Ectatomma (Gnamptogenys) concinnum F. Smith
</taxonx:name>
</taxonx:nomenclature>
</taxonomicName>
<subSubSection treatmentPart="Materials">
<collection_event>
. -
<taxonx:seg type="materials_examined">
<location>
<taxonx:collection_event>
<taxonx:locality>
Caparo
</taxonx:locality>
</taxonx:collection_event>
</location>
, (P. B. Whelpley), 9.
</taxonx:seg>
</collection_event>
</subSubSection>
</taxonx:p>
</taxonx:treatment>
</treatment>
</paragraph>
</taxonx:taxonxBody>
</taxonx:taxonx>
This output just needs to have the non-taxonX tags removed in order to produce the final TaxonX output. This is done by (manually) selecting only the taxonx namespaced elements and saving as XML (selected tags).
The final output is:
<taxonx:taxonx xmlns:mods="http://www.loc.gov/mods/v3"
xmlns:taxonx="http://research.amnh.org/informatics/taxlit/taxonx/taxonx1">
<taxonx:taxonxHeader> Automated TaxonX markup created by GoldenGATE </taxonx:taxonxHeader>
<taxonx:taxonxBody>
<taxonx:treatment>
<taxonx:p>
<taxonomicName genus="Ectatomma" rank="species" rankWeight="3" species="tuberculatum">
<taxonx:nomenclature>
<taxonx:name genus="Ectatomma" rank="species" rankWeight="3" species="tuberculatum"> Ectatomma
tuberculatum Olivier </taxonx:name>
</taxonx:nomenclature>
</taxonomicName> . - <taxonx:seg type="materials_examined">
<taxonx:collection_event>
<taxonx:locality> Port of Spain </taxonx:locality>
</taxonx:collection_event> and <taxonx:collection_event>
<taxonx:locality> Sangre Grande </taxonx:locality>
</taxonx:collection_event> , (R. Thaxter),; Botanical Garden, <taxonx:collection_event>
<taxonx:locality> Port of Spain </taxonx:locality>
</taxonx:collection_event> , (Wheeler),?. </taxonx:seg>
</taxonx:p>
</taxonx:treatment>
<taxonx:treatment>
<taxonx:p>
<taxonomicName genus="Ectatomma" rank="species" rankWeight="3" species="ruidum">
<taxonx:nomenclature>
<taxonx:name genus="Ectatomma" rank="species" rankWeight="3" species="ruidum"> Ectatomma
ruidum Roger </taxonx:name>
</taxonx:nomenclature>
</taxonomicName> . - <taxonx:seg type="materials_examined">
<taxonx:collection_event>
<taxonx:locality> Port of Spain </taxonx:locality>
</taxonx:collection_event> , (R. Thaxter),; <taxonx:collection_event>
<taxonx:locality> Chaguanas </taxonx:locality>
</taxonx:collection_event> , (Urich),; Botanical Garden, <taxonx:collection_event>
<taxonx:locality> Port of Spain </taxonx:locality>
</taxonx:collection_event> , (Wheeler),. </taxonx:seg>
</taxonx:p>
</taxonx:treatment>
<taxonx:treatment>
<taxonx:p>
<taxonomicName genus="Ectatomma" rank="species" rankWeight="3" species="concinnum"
subGenus="Gnamptogenys">
<taxonx:nomenclature>
<taxonx:name genus="Ectatomma" rank="species" rankWeight="3" species="concinnum"
subGenus="Gnamptogenys"> Ectatomma (Gnamptogenys) concinnum F. Smith </taxonx:name>
</taxonx:nomenclature>
</taxonomicName> . - <taxonx:seg type="materials_examined">
<taxonx:collection_event>
<taxonx:locality> Caparo </taxonx:locality>
</taxonx:collection_event> , (P. B. Whelpley), 9. </taxonx:seg>
</taxonx:p>
</taxonx:treatment>
</taxonx:taxonxBody>
</taxonx:taxonx>