Revision of The capture and encoding of bibliographic reference information from Tue, 2007-04-24 09:28

Bibliographic content

The Web Revisions system in general would need to be able to handle bibliographic references.

Content added via the main data input process

The requirements for this are still to be confirmed.

Content added via parsing of existing documents

The data import system would ideally be able to recognise, parse and annotate bibliographic references within input documents, probably for later extraction by the analyser and addition to the data warehouse.

It seems wise to look for existing components which could be used to carry out this step, either as a pre or post GoldenGATE stage in the import process. (Can external components be easy called from within GG?)

There is an existing perl library for reference extraction - details to follow.

A possibility:

Whilst at Kew, I wrote some Python scripts which were designed to identify publication name/collation field bouldaries in reference data, and also a collation parser used to extract numerical values for volume, page, fig and other collation components. These produced reasonably good results as part of a manually checked process.

If necessary, and if time allows (and with Kew's permission), I will look at how useful the tool might be for all taxonomic publications (not just botanical ones).

(This function may instead be carried out by a module to be written WP5).

TaxonX reference mark-up

Discussions relating to TaxonX and the treatment of biliographic content can be found here.

Scratchpads developed and conceived by: Vince Smith, Simon Rycroft & Dave Roberts