GoldenGATE imports text files, not binary files, so .doc and .pdf files must be exported to a text file prior to import. The GoldenGATE manual (Section 4) has the following:
It is recommended to prepare the input documents in html format without any formatting but the paragraph elements (<p>) and a solid (<hr>) line to mark page breaks. For the processing of the documents not further elements are needed. It proofed also extremely advantageous to have very clean and spell checked OCR-output, since artefacts have an impact on the mark-up process, and later corrections are very tedious. This has especially an effect on the recognition of names.
Via email, Guido has summarised the supported input formats as follows:
File formats are as plug&play (i.e. extensible via plug-ins. JW) in GoldenGATE as many other things.
Currently, there are plugins for txt, HTML/XML (configurable to rename and filter tags on loading), ARFF, TaxonX XML (XML format specialized to our TaxonX schema, doing some transformation of the generic markup created in the markup process), and GAMTA (a format native to the data model underneath GoldenGATE, derived from XML, and desigend to be capable of representing annotations regardless of nesting order).
The ability to configure the processing which is carried out when specific types of file are imported is particularly useful. For example, documents exported to HTML often contains unwanted mark-up elements and attributes. This could potentially be dealt with seamlessly on import into GoldenGATE.
GoldenGATE uses plug-ins called readers to provide the initial automated processing of files when they are imported. Users can view and edit the available import plug-ins by going to Window/Preferences and viewing the "Document Readers" tab. New readers can also be created.
Currently, GoldenGATE does not support attribute removal or conversion on import. Adding this feature would allow troublesome attributes to be removed seamlessly on import. This functionality is already available within the Markup Converter tools within GoldenGATE, so this is not an urgent requirement.
I will be discussing working with the following file types:
Microsoft Word
Plain text
PDF/OCR output
This section assumes that Microsoft Word is available to users. OpenOffice Writer also supports the required functionality.
Microsoft Word has a number of features which might make it useful for the initial clean-up of a document. For example, it would make sense to carry out a spell check in Word prior to exporting it ready for import into GoldenGATE. This is particularly true where a user has added relevant words to their personal dictionary in Word.
(GoldenGATE's built-in spell-checking tool has some usability issues which favour the use of Word or some other tool for pre-import checks. As GoldenGATE development continues, the benefits of using an external tool may decline).
If a document has a broken paragraph structure, for example having paragraph marks introduced at the end of each line by an OCR process or by cutting and pasting from a PDF (as was the case in one of the test documents), Word can be used to fix this using a series of Find and Replace edits.
For example, the problematic paragraph marks in the Crypto1 document (A Zoological Classification System of Cryptomonads), may be removed using as follows:
(<para> can be any text string which is not found in the text of the document itself).
The above could be made into a Word macro for repeated use on documents with similar problems. This can save a significant amount of manual reformatting in long documents with this particular type of structural problem.
More generally, a library of macros could be created which would solve other commonly encountered issues as these are identified. It would be good practice to have some kind of forum and file sharing mechanism allowing the discussion and dissemination of fixes. This would prevent unnecesssary duplication of work, and also allow technical guidance where users may be using the wrong tool for the job. For the developers, such a forum would be a good source of requirements for desired GoldenGATE functionality, such as new file import options.
It should be borne in mind that every step in the worklow should be carried out using the most appropriate tool for the job. Ideally, GoldenGATE would clean-up such a file on import, through a user selected option or possibly automatically (after on OK by the user). (Documents with this particular problem are quite easy to recognise programmatically, having a strictly limited maximum line length and an unusually large number of paragraphs not ending in a full-stop). In the absence of an appropriate import or other function, or at least until one is available, the problem may be best solved using Word as a pre-processor.
Once the document has been spell-checked and any corrections made, it needs to be saved into a form which GoldenGATE can open, i.e. as plain text or as HTML.
The standard Save As HTML function in recent versions of Word (Word 2000 and later) embeds large amounts of Office specific mark-up in the output. Microsoft's goal was to allow movement between native Word and HTML formats without loss of any document metadata nor resources (e.g. paragrah and character styles were all saved into the file, regardless of whether they were used or not). As a results the output is usually horribly bloated with unwanted tags.
Due to pressure from users, Microsoft quickly released a plug-in for Word 2000 on the Windows platform which supported a cleaner Save As output called 'filtered HTML'. This option was made standard in later versions of the application. The filtered output is not fully clean, but it is significantly closer to a state which is suitable for use in GG.
Word 2004 for Macintosh has an ambiguosly named option Save only display imformation into HTML in its Save As HTML options. This appears to produce filtered HTML output.
HTML exported from Word does not include <hr\> (horizontal rule) elements to mark page breaks, as indicated in the recommended input file description above. I am unsure if there is Word document content element which could be added to the document in the appropriate places prior to export as HTML and which would then be converted directly into an <hr/> tag , or if these would need to be added after the HTML had been exported, or as part of the document import process.
Page breaksdo appear in the filtered HTML output. For example:
<br clear=all style='page-break-before:always'>
An import function could potentially recognise these elements and replace them with <hr/>.
On Windows, Word exports to HTML using the Windows code page 1252 character set (discussed here) by default. Characters outside of this character set are encoded using either character entity references or numeric character references.
Currently, HTML character entity references are handled correctly, but GoldenGATE's default HTML import process does not translate numeric character references into the appropriate character, instead rendering these as plain text with some extra spaces within them.
As a result if Word's HTML output is used, and until numerical character references are supported by GoldenGATE, it is necessary to change Word's default character encoding for HTML export to a suitable Unicode encoding, for example UTF16LE.
This can be done by changing a user preference in Word, via:
Tools/Options/General/Web Options/Encoding and selecting "Save this document as Unicode (UTF-8)". Then check "Always save web pages in the default encoding."
A number of third party tools exist for cleaning Word's HTML output in both the verbose and filtered forms, either as stand alone tools, web based tools or functions within applications. An evaluation of these is outside the scope of this project.
It would be entirely possible to manually remove unwanted elements and attributes in GG, but this would be a slow process, especially in large files. The creation of a standard pipeline for this purpose would be very useful as an addition to the application.
For the purposes of this evaluation, I created a simple pipeline to automate the cleaning process, which worked well with the filtered HTML output from the set of documents tested.
The pipeline comprised:
MarkupConverter:cleanupWordHTML.markupConverter (see definition below)
Annotator: PageBorders.annotator
MarkupConverter: layoutArtefactRemover.markupConverter
Analyzer: <Paragraph Structure Normalizer>
Analyzer: <Whitespace Normalizer>
(The PageBorders.annotator was extended to handle some additional page boundard text patterns).
The cleanupWordHTML.markupConverter was defined as follows:
| Mapped Tag | Mapping | Effect |
|---|---|---|
| *.* | #RA | removes all attributes from all elements |
| style | #D | deletes the style element and its content |
| head | #D | deletes the head element and its content |
| div | #R | removes div tags, leaving content intact |
| span | #R | removes span tags, leaving content intact |
| h1 | paragraph | replaces h1 tags with paragraph tags |
| h2 | paragraph | ditto for h2 |
| h3 | paragraph | ditto for h3 |
| h4 | paragraph | ditto for h4 |
| h5 | paragraph | ditto for h5 |
| h6 | paragraph | ditto for h6 |
| i | #R | removes i tags, leaving content intact |
| b | #R | ditto for b |
| html | #R | ditto for html |
Further mappings will need to be added to the above definition to clean documents containing other unwanted mark-up.
The conversion of all heading tags to <paragraph> as opposed to <p> tags followed the form used by the tutorial file import filter. Converting to <p> is probably equivalent and may in fact be preferable.
One of the papers worked on ('Crypto paper1') was supplied as a Word document with the text broken-up into a paragraph for each line of the file. This problem affects files which have been saved as text or copied and pasted from Acrobat Reader. This makes files awkward to work with, as multiline paragraphs have to be merged by hand. This is a lengthy process, especially for the numerous references where only a few lines could be merged at a time.
If Acrobat Reader is not available to the user, a number of utilities exist (e.g. Xpdf, and PDF Ripper) , both free and commercial, which can be used to extract text or HTML from PDF documents. The text so extracted should have its paragraph structure preserved. Export as HTML may also preserve tables and add links to extracted images. (I am unsure if tables are supported in TaxonX. To be checked).
Interestingly, Adobe offer an on-line PDF conversion service, producing a choice of text or HTML output. The service appears very slow, though.
If the original PDF is not available, and the paragraph structure issue still presents a problem, another approach is required. In order to make working with this type of file easier during testing, I created a small Python script which processed the file using empty lines (and lines with leading tab characters to handle indented paragraphs) as paragraph delimiters, using the original Word document saved as text as input. This was a sucessful approach. The file's paragraph structure after processing was much closer to the structure as published, reducing the number of paragraphs which needed to be split up. All footnotes were correctly handled (i.e. one footnote per paragraph, including multiline footnotes).
I would suggest that a similar option be provided by GG on importing a text file (or possibly as a supplied tool for use on either the whole file or a selection of the file after loading, assuming that leading tabs or spaces survive the import process).
The current 'Annotate Paragraphs' Analyzer expects content which is effectively one line per paragraph (i.e. no line breaks within a paragraph), which is what would be expected for most input types, e.g. HTML.
In all cases, I would suggest that an original copy of the document as published be available for checking by the user. This would reduce the likelyhood of errors in marking-up document structure.