Publications for the Web – PDF tool demo and discussion

Presentation by Naupama Rajagopal (Senior Software Developer) and Rebecca Robertson (Web Administrator) of the Department of Building and Housing on 15 December 2009.

The Department of Building and Housing (DBH) has the same issues as other government agencies in regards to publishing PDF information to the web such as how to convert the document to HTML and adhere to the New Zealand Government Web Standards.

In 2006 the DBH launched their website and had hundreds of PDFs and no associated Word documents, so it was difficult to convert these to HTML. A need for a process was identified.

Often amendments and changes to the original Word document were sent by email to the vendor, rather than the Word document being updated. The vendor would send through the final PDF version, but there was no longer an associated Word version. DBH looked for a smart way to publish HTML from PDF rather than copying and pasting from the PDF version. The vendor was asked to tag the original for accessibility.

Challenges
Copy and pasting was seen as time consuming; there was a problem with fonts not being correct; letters and/or words were missing; often the content needed to be reformatted; and images were of a poorer quality. The example given (Codewords – PDF 470KB) was a 12 page document which took 7 hours to convert.

Solution
Create XML via the vendor’s publication tool, InDesign. XML has broad support and is an open standard. It can be used by a wide variety of applications. The vendor was asked to tag the document at the InDesign document creation stage.

Publishing process

Tagged InDesign layout → XML → XML Converter → Web page (HTML) → published via CML

XML can be used as a method for tagging text – tagging in InDesign is like formatting a Word document. It does take time but it is easier to be done at the outset rather than at the finished product stage.

The DBH devised a list of tags that are to be used with the publications. They reuse many templates for their documents, so this solution worked well. The DBH gave the vendor training and written instructions. The vendor currently used by DBH is Scenario, but they may be moving to a new vendor, Pivotal.

The DBH created a document containing a list of XML tags used which contains: a list of tags (e.g. <title> = Title of publications); examples of the tags (e.g. <title>This is the title</title>); and notes (e.g. “All XML tags should be properly nested” ; “<root> = XML file should start with this tag”; “all attributes inside the XML tags should be enclosed within double quotes”; “Attribute “id” is added to every news item such as <news id=”1”>”).

The publication “Codewords (12 pages) cost $500 to be tagged by Scenario. This cost is for tagging, not for producing the publication.

Staff are now able to do value-added work instead of spending a lot of time on PDF conversions. The Information Architecture of a HTML document is what now takes the time, not the HTML conversion.

The InDesign process is to export the document and send it to the DBH on a CD disc. DBH staff can cleanup the XML using Notepad.

XML to HTML Converter Tool
The file is uploaded and the XML is pasted into the text area. To generate the HTML, a user clicks on the Validate button. The Converter Tool uses a common stylesheet (stylesheet.css) across all of the documents converted. A user can preview the converted HTML then save to file.

Currently the DBH validates the document and fixes errors. In the future, the DBH will deliver the Converter Tool to the vendor to undertake validation and fix errors, then resend for final validation. Full quality images are sent by the vendor, which are resized by DBH as required. The Tool will find text that hasn’t been tagged.

It is very easy for the Comms department to use the converted HTML. The tool is flexible and can be customised as required.

XML, tagged InDesign (source file), PDF, and any images are sent to the DBH by the vendor. Complex tables cost more to tag and are not done. At the moment, simple tables are tagged by using the InDesign table tags. With a table, the Converter Tool treats everything in it as a table cell – the content editor must change the table headers, etc, manually. This information has been documented for Comms, and is required for best practice/web standards. Overall, this makes the HTML compliant.

The DBH’s Statement of Work cost $2,000 to tag by the vendor. This cost is just for the tagging, not for the publication itself. Most medium publications cost $1,000-$2,000. After tagging a document with InDesign, the 12 page example “Codewords” only took 3 hours to add to the web and amend as required. Previously this took 7 hours to manually tag it.

The XML tags can be designed by the user as they are totally customable. The Convertor Tool swaps and replaces XML with HTML, e.g. the code <heading1> is swapped for the HTML <h1> by using find and replace. The Converter Tool is a custom-made dot net application built by DBH.

The Convertor Tool indicates that alt tags for images are required and Comms editor need to add the actual alt text to the final HTML. The images are removed and Comms resize the image, load it to the CMS, add alt text, and link to correct path/location in the case of hyperlinks.

Agencies need a vendor who will tag InDesign documents at not too much cost. Scenario use InDesgin 2 for exporting XML. The latest version is InDesign 4, which should have better tagging capabilities.

DBH will be moving the Converter Tool from dot net to a web-based application so that other government agencies can upload their own stylesheets and use this Tool. DBH are looking at negotiating a licence with Government Technology Services (GTS) so that government agencies can use this Tool at no cost.

Each business unit pays for the InDesign tagging. Comms has a contract with the vendor, and the business has sign-off and pays for this tagging.

Notes