The capture and encoding of taxon relationships from documents

This page deals with information about the way in which GoldenGATE's output encodes the relationships between taxa (parent/child/sibling/synonym/'easily confused with' etc.). See also Taxonomic name identification in GoldenGATE.

For reference, an example set of taxonomic relationship types, and an example set of nomenclatural relationship types can be seen in the CATE schema.

Definitions of the relationships handled by the TCS schema are also of interest.

Extracting relationships

Some of the questions relating to the handling of existing taxonomic documents in EDIT WP5/6 are:

  • How,
  • at what point, and
  • to what extent

will taxonomic and nomenclatural relationships which are expressed in documents be extracted and encoded?

Two points in the flow of legacy data into the system appear most suitable for this process:

  1. During marking-up prior to addition to the XML repository
  2. During processing of XML content in the repository by the XML Analyser prior to addition to the Data Warehouse.

Adding relationship metadata at the first mark-up stage requires the following:

  • The intermediate format created in GG, and the XML format created on export from GoldenGATE must support the required set of relationships
  • GoldenGATE must provide tools to efficiently add the required elements/attributes
  • The user must have the knowledge to identify and correctly annotate the relationships

And implies:

  • either the creation of a single metadata element within the file summarising the taxonomic and nomenclatural relationships described, or
  • the creation of an external linked metadata file or database entry which contains the object relationships for the original file, or
  • some convention (e.g. on first mention) determining where such information is added to the mark-up, or
  • potentially a large degree of redundancy (i.e. repetition) of such information in the file

Extracting relationships at the point of XML analysis for addition to the Data Warehouse requires that:

  • the XML format created on export from GoldenGATE must have sufficient structure and enough machine-recognisable content to support the extraction of the required set of relationships by the Analyser.
  • the process should be sufficiently fast to avoid this process becoming a bottleneck in the data import workflow
  • the results must be highly reliable
  • a mechanism for detecting and correcting errors is necessary - presumably some form of user or moderator checking/editing.

It seems likely, in fact, that some metadata addition will take place at both points on the data workflow.

Once extracted and added to the Data Warehouse, the collected information can then be used to represent taxonomic relationships not just within individual documents but also for multiple documents, up to and including the entire Data Warehouse, with the aggregated data providing an overall picture (or pictures) of the connections of interest, and highlighting regions of stability and dissent in any particular taxonomic group.

So, how do GoldenGATE and TaxonX handle taxonomic and other relationships?

TaxonX and document content

To quote from the TaxonX website:

Taxonx is a XML schema for encoding taxonomic literature in order to:

  • Create open, stable, persistent, full text digital surrogates of taxonomic treatments
  • Identify taxonomic treatments and their major structural components to enable networked reference and citation
  • Identify lower level textual data such scientific names, localities, morphological characters, and bibliographic citations to facilitate their extraction by, and integration with external applications and resources
  • Study and describe the structure of systematics publications by creating few typical corpora of literature, such as entire journal (eg AMNH Novitates), across taxa (e.g all ant systematics papers post 1995), or faunistic (e.g. all ant systematics paper covering Madagascar ranging from 1758 to 2006)

TaxonX is designed to identify relevant objects at various levels of granularity and locate them within a document structure. Accordingly, the current version of the core format lacks specific constructs to explicitly encode most types of taxonomic relationships.

What relationships can currently be expressed in TaxonX?

The current TaxonX 'core' schema and relationships

Where relationships between annotated items are available for machine use, these are largely implicit through the position of these items within common containing elements, and these containers usually relate to the structure of the original document (but see comments on the use of the <xid> element below).

To a human reader, taxon relationships expressed in the text are obviously explicit, but this text currently occurs as 'plain text' in paragraph elements, outside of the mark-up relating to objects covered by schema elements. This data will not be directly available without further analysis using natural language processing techniques. This processing will be inevitably prone to some degree of error, given the natural variations in writing styles and conventions (not to mention languages) which exist.

For example, a <treatment> element (which is associated with a taxonomic name and reference details to identify a taxonomic concept), contains a set of divs which identify the types of treatment content present in the file (e.g. nomenclature, materials examined etc.). The content of these elements, either plain text paragraphs or marked-up, are associated with the relevant taxonomic name by virtue of their presence within the treatment for that name.

There are some exceptions (e.g. synonymy, which is covered as part of the nomenclature component of treatments), but for the most part, relationships are read from document structure. As a result, taxon names in a discussion section will appear as sibling nodes within the parent discussion node, but the mark-up will not attempt to indicate the precise nature of their relationship.

So, what options are available to support the explicit mark-up of relationships which are described in text, as opposed to being reliant on document structure?

Expressing taxon relationships explicitly

As discussed above, relationship data can be expressed most efficiently in various ways:

  • Linking to an external data source containing the relationship data
  • Storing the relationship data within the file in the form of a metadata element containing data marked-up in some suitable (and possibly external) schema
  • Expressing relationships as attributes applied to all related marked-up objects in the file.

The third option, though viable, leads to redundancy of data in the file, and may also make editing the data more complicated, as reciprocal relationships need to be kept in sync.

Looking at the first two options:

Using the <xid> element to refer to an external data source

The current version of TaxonX includes the xid element. This is a pointer to some external data source (e.g. a file or database query result) identified by URI, which contains the relevant information, in this case the relationship summary for the document. The TaxonX schema currently limits the parent elements within which this can be used, but it would presumably be possible to extend this and allow the inclusion of an xid element at the document level, assuming a single summary of relationships in the document is required.

The use of <xid> gives great flexibility, in that any external data accessible via URI can be used, but this still leaves WP6 with the question of how the external data will be created, and where and how it will be stored.

GoldenGATE already has a function which queries the Hymenoptera Name Server and retrieves an id for each name found in the database. This identifier is added to the mark-up as an attribute of an <xid> element, a child of the nomenclature element for the taxon. This id could be used to query a database to extract taxonomic relationship data.

(This function would be of more generally useful if other, more widely applicable data-sources could be queried - e.g. the uBio NameBank, IPNI, or an aggregator service accessing a range of similar sources).

Storing relationship data using mark-up validating against an external schema

TaxonX supports external schemata, for example it uses MODS to mark-up the publication metadata for a paper. An option would be to select a schema which handles the additional metadata we are interested in (such as Taxon Concept Transfer Schema (TCS), and use it to extend TaxonX.

Terry Catapano (designer of the TaxonX schema), has confirmed via email that it is a definite intention to add support for TCS as an allowed schema within TaxonX's xmlData element.

GoldenGATE and TCS integration

GoldenGATE currently has no tools for explicitly associating arbitrarily related objects in the document, Presumably this could be done via a GUI element, or through the creation of a set of text based subject-predicate-object triples.

Once the TaxonX team integrates TCS, I would expect that this functionality would be added to GoldenGATE as a priority.

Multiple relationship representations in documents

Documents containing discussions or comparisons of multiple relationships would need to be considered in the schema design, possibly by using a specific 'weighting' or other attribute in the mark-up to indicate a heirarchy of author preference across the various relationship descriptions. If this was done, it would be possible to query documents and extract only the authors preferred classification, something which would be otherwise difficult to do.

Structural mark-up and relationships - some more details

Synonymy

The method used for representation of synonymy is an item for discussion on the TaxonX wiki. There is a synonymy element, available as a child of the nomenclature element. The content model is apparently being revised, as the nomenclature element is currently specifically intended to represent the nomenclatural information in the heading of a taxonomic treatment. The expression of taxonomic relationships will presumably need (for our purposes, at least) to be more flexibly and broadly implemented (i.e. not be limited to synonymy), and not be tied to the presence of any specific document content item. There is also a status element, which may allow a name in the synonym list to be flagged as the 'preferred' name for that taxon, though I have seen no documentation about the use of this element as yet.

Marking up treatments

Where sufficient structure exists within a section of a document, GG will attempt to automatically mark-up treatment boundaries when the 'Mark-up Treatment Boarders' custom function is run. Where the structure of the text is unclear, GG will display the content of the document in a window which allows the user to specify the start of new treatments, the continuation of the current treatment, or non-treatment content. After running this function and providing any required input, the user will need to check the treatment boundaries and make any required corrections.

Content requirements for GG treatment identification

The GG Manual section 5: Workflow to generate a valid TaxonX XML document up to Level 1has this:

Make sure that paragraphs before a taxonomic treatment contain only the treated taxon. Higher taxa like subfamily or tribe has to be written in an extra paragraph. <paragraph> Formicidae. </paragraph> <paragraph> Subfamily PONERINAE. </paragraph> <paragraph> Ponera grandis, sp. n. </paragraph> <paragraph> [[ worker ]]. Reddish brown, head darker, mandibles, antennae, and legs lighter. Whole body clothed with sparse yellow pubescence, more abundant on gaster. </paragraph>

Though this constraint is commonly met, the above does imply some dependency between a specific content structure and successful mark-up in the application. This may be acceptable, but the desirability of recommending edits to documents in the workflow for WP6 should be assessed. As documents being marked up for WP6 are being used primarily as data sources, this is unlikely to be a problem. If the mark-up is being or might be used as a representation/record of the original file, this may not be the case.

Treatment <div>s

Currently the TaxonX schema uses div elements to divide up treatments into their component parts. These <div> elements have an identifying type attribute, usually taken from the list of suggested values: abstract, acknowledgements, biology_ecology, description, diagnosis, discussion, distribution, etymology, introduction, materials_examined, materials_methods, multiple, synopsis When marking-up at the phrase level, the <seg> elelment is used, with the following list of suggested type attribute values: biology_ecology, collection_data, description, diagnosis, discussion, distribution, etymology, key, materials_examined, synopsis. I assume that <seg> is used to identify phrase level text which does not have the same content type as its parent <div>, should such a parent <div> exist.

Scratchpads developed and conceived by: Vince Smith, Simon Rycroft & Dave Roberts