Validate OOXML
Posted 12-11-2020 at 10:27 AM by Michael Uplawski
Updated 12-15-2020 at 12:05 AM by Michael Uplawski (moved one paragraph, cosmetics, threads linked.)
Updated 12-15-2020 at 12:05 AM by Michael Uplawski (moved one paragraph, cosmetics, threads linked.)
Tags ooxml, schema-validation, word-processor, xmllint
Validate OOXML
Ensure the standard-conformance of OOXML-documents created from scratch.
See also the two Threads concerning this topic:A styled version of this document: http://www.uplawski.eu/articles/Linu...ate_ooxml.html
- https://www.linuxquestions.org/quest...6/#post6194391
- https://www.linuxquestions.org/quest...nt-4175650279/
Contents
- Disclaimer
- Introduction
- Motivation
- XML Schema validation with
xmllint
↠ Setup step by step
↠ Invoking xmllint
Disclaimer
I have invented nothing of this; I have not found it, nor developed the procedures mentioned on this page. The complete knowledge that I only reproduce here, has been communicated to me by
NevemTeve in a discussion on LinuxQuestions.org.
Introduction
Even if modern wordprocessors are enriched with many functions which facilitate the redaction of complex text-documents, some recurring tasks, which are performed often and in the same way within the same document, cannot be completely automated with the commands that the program offers. As documents are produced for specific purposes and the needs of individual users cannot be anticipated in all detail, software-companies integrate scripting interfaces to their office-software.
Where such a scripting interface is not present, you can still automate the generation and manipulation of office-documents.
Modern wordprocessors read from and write to compressed XML-files. The Microsoft® file-format OOXML – e.g in docx-files – as well as ODF base on XML. To read and manipulate the content and formatting of such documents you only need to edit the XML-files which you discover after unzipping an ODT- or DOCX-file:
Code:
user@machine:/tmp/docx$ unzip ../rudi.docx Archive: ../rudi.docx inflating: _rels/.rels inflating: docProps/core.xml inflating: docProps/app.xml inflating: word/_rels/document.xml.rels inflating: word/document.xml inflating: word/styles.xml inflating: word/fontTable.xml inflating: word/settings.xml inflating: [Content_Types].xml
You can find the meaning of each of the XML-tags, all the possible XML-attributes, as well as the rules for their deployment in specific contexts on specialised web-sites. Here, I want to concentrate on OOXML only: http://officeopenxml.com/index.php
Motivation
Writing OOXML from scratch can be complicated. As long as you do only modify text-nodes, nothing can happen. But as soon as you manipulate XML-tags or introduce more tags and complexer tag-structures to your document, you have to be careful to obey strictly to the rules of the OOXML standard. Where programmed routines are responsible for those manipulations, they can rapidly and profoundly alter the file-structures together with the actual content.
Even if, after opening the resulting document in your wordprocessor, all looks fine and just as you want it, other programs can be in trouble, if your OOXML code is not what they expect. But interoperability, comparability and comprehension is what standards are initially meant to achieve. You should, therefore, routinely validate your own OOXML-documents against the OOXML-standard to be sure that routines which generate or modify OOXML files, work reliably in all situations.
This document describes a way to validate OOXML wordprocessor files against the pertinent OOXML Schemas, in order to locate and identify potential errors.
XML Schema validation with xmllint
I prefer to first present you the command-line which you will execute to validate a wordprocessor-file and explain its components. The objective is then to ensure that the conditions for the successful command execution are met (read on below).
One last remark. A surprising amount of file-manipulations are needed, before you can validate OOXML with the procedure I chose to present on this page. I consider this unsatisfactory and still seek simplification. But also note that, once that the preparations are completed, repeated validations are as easy as launching xmllint with the few arguments that are included in the command, shown here:
Code:
xmllint -noout -debugent -schema ooxml_xsd/wml.xsd document.xml
xmllintxmllint is an XML-parser for many purposes. Consult the xmllint man-page for the complete description of its many options. On a Linux system, xmllint is part of libxml.-nooutThis option specifies that xmllint shall not produce output other than potential error- and warning-messages.-debugentComments will be printed concerning entities which are defined in the source-document.-schemaThe location of the initial schema-file, which will be read to compare the source-document to the standard.document.xmlThe XML-document which is validated. document.xml is also the main component of an OOXML wordprocessor file. This is where the textual content and the structure of the enclosing tags are found, like in this (scrollable) example of a file document.xml:Code:<?xml version="1.0" encoding="utf-8" standalone="yes"?> <w:document xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" mc:Ignorable="w14 wp14"> <w:body> <w:p> <w:pPr> <w:pStyle w:val="Heading1" /> <w:bidi w:val="0" /> <w:spacing w:before="240" w:after="120" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Validate OOXML</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="TextBody" /> <w:bidi w:val="0" /> <w:spacing w:lineRule="auto" w:line="276" w:before="0" w:after="140" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Ensure the standard-conformance of OOXML-documents created from scratch.</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="Heading2" /> <w:bidi w:val="0" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Contents</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="Heading2" /> <w:bidi w:val="0" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:bookmarkStart w:id="0" w:name="intro" /> <w:bookmarkEnd w:id="0" /> <w:r> <w:rPr></w:rPr> <w:t>Introduction</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="TextBody" /> <w:bidi w:val="0" /> <w:spacing w:lineRule="auto" w:line="276" w:before="0" w:after="140" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Even if modern wordprocessors are enriched with many functions which facilitate the redaction of complex text-documents, some recurring tasks, which are performed often and in the same way within the same document, cannot be completely automated with the commands that the program offers. As documents are produced for specific purposes and the needs of individual users cannot be anticipated in all detail, software-companies integrate scripting interfaces to their office-software.</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="TextBody" /> <w:bidi w:val="0" /> <w:spacing w:lineRule="auto" w:line="276" w:before="0" w:after="140" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Where such a scripting interface is not present, you can still automate the generation and manipulation of office-documents.</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="TextBody" /> <w:bidi w:val="0" /> <w:spacing w:lineRule="auto" w:line="276" w:before="0" w:after="140" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Modern wordprocessors read from and write to compressed XML-files. The Microsoft® file-format OOXML – e.g in docx-files – as well as ODF base on XML. To read and manipulate the content and formatting of such documents you only need to edit the XML-files which you discover after unzipping an ODT- or DOCX-file.</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="TextBody" /> <w:bidi w:val="0" /> <w:spacing w:lineRule="auto" w:line="276" w:before="0" w:after="140" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>You can find the meaning of each of the XML-tags all the possible XML-attributes, as well as the rules for their deployment in specific contexts on specialised web-sites. Here, I want to concentrate on OOXML only: [ OOXML - reference goes here ]</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="Heading2" /> <w:bidi w:val="0" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:bookmarkStart w:id="1" w:name="motivation" /> <w:bookmarkEnd w:id="1" /> <w:r> <w:rPr></w:rPr> <w:t>Motivation</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="TextBody" /> <w:bidi w:val="0" /> <w:spacing w:lineRule="auto" w:line="276" w:before="0" w:after="140" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Writing OOXML from scratch can be complicated. As long as you do only modify text-nodes, nothing can happen. But as soon as you manipulate XML-nodes or introduce more tags and complexer tag-structures to your document, you have to be careful to obey st</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="Normal" /> <w:bidi w:val="0" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> </w:r> </w:p> <w:sectPr> <w:type w:val="nextPage" /> <w:pgSz w:w="12240" w:h="15840" /> <w:pgMar w:left="1134" w:right="1134" w:header="0" w:top="1134" w:footer="0" w:bottom="1134" w:gutter="0" /> <w:pgNumType w:fmt="decimal" /> <w:formProt w:val="false" /> <w:textDirection w:val="lrTb" /> </w:sectPr> </w:body> </w:document>
Before you can validate anything, you must ensure that all the necessary schemas , in the form of *.xsd files, can be accessed by an XML-parser.
I will show you the steps to establish this “validating-environment”.
Set-up step by step
I. Provide the schema catalogEnsure that the file /usr/local/etc/xml/catalog exists, create it otherwise, as root :II. Provide xml.xsdCode:$ mkdir -p /usr/local/etc/xml $ cat >/usr/local/etc/xml/catalog <<DONE <?xml version="1.0"?> <!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <uri name="http://www.w3.org/XML/1998/namespace" uri="file:///usr/local/etc/xml/xml_2009_01.xsd"/> <nextCatalog catalog="file:///etc/xml/catalog"/> </catalog> DONEEnsure that /usr/local/etc/xml/catalog contains the lineIII. Provide the OOXML-xsd filesIf it is missing, insert the line before the tag <nextCatalog> , just like it is shown above.Code:<uri name="http://www.w3.org/XML/1998/namespace" uri="file:///usr/local/etc/xml/xml_2009_01.xsd"/>
You must also get the actual file xml.xsd:Code:wget -O /usr/local/etc/xml/xml_2009_01.xsd http://www.w3.org/2009/01/xml.xsdThe schema files can be downloaded fromIV. Complete wml.xsd
https://repo1.maven.org/maven2/org/apache/poi/ooxml-schemas/1.4/ .
Choose the file
ooxml-schemas-1.4.jar and download it.
Unzip the file, e.g. to your temporary directory and locate the xsd-files in the sub-directory /schemaorg_apache_xmlbeans/src . Move all the xsd-files to a directory that will be accessible later, when calling the xml-parser, e.g. a sub-directory of your working-directory:Code::~/project$ mkdir ooxml_xsd :~/project$ cd ooxml_xsd :~/project$/ooxml_xsd mv /tmp/schemaorg_apache_xmlbeans/src/*.xsd ./Open the schema file wml.xsd and find the tag <xsd:import> with the id xml (the fourth at the time of this writing). Complete this line with the schemaLocation attribute or replace it, so that it is identical to the following:V. Consolidate duplicated import of the same namespace in dml-wordprocessingDrawing.xsdCode:<xsd:import id="xml" namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://www.w3.org/XML/1998/namespace"/>Create a xsd-file dml-wordprocessingDrawing_import.xsd with the following content:Now open the schema file dml-wordprocessingDrawing.xsd . Replace the two tags <xsd:import> with the schemaLocations dml-wordprocessingDrawing.xsd and dml-documentProperties.xsd by one single line which imports only the newly created schema-fileCode:<?xml version="1.0" encoding="utf-8"?> <xsd:schema targetNamespace="http://schemas.openxmlformats.org/drawingml/2006/main" elementFormDefault="qualified" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:include schemaLocation="dml-graphicalObject.xsd"/> <xsd:include schemaLocation="dml-documentProperties.xsd"/> </xsd:schema>Code:<xsd:import schemaLocation="dml-wordprocessingDrawing_import.xsd" namespace="http://schemas.openxmlformats.org/drawingml/2006/main" />
Invoking xmllint
The call to xmllint is already shown, above, but prior executing the command, you must remember to set the environment variable XML_CATALOG_FILES to the location of the schema catalog as, otherwise, the standard path /etc/xml/catalog would be read. This is an example of a successful validation with xmllint after having completed the preparatory tasks, listed above :
Code:
user@machine:/tmp$ export XML_CATALOG_FILES=/usr/local/etc/xml/catalog user@machine:/tmp$ xmllint -noout -debugent -schema ~/prog/ooxml_xsd/wml.xsd ./docx/word/document.xml new input from file: /prog/ooxml_xsd/wml.xsd new input from file: /prog/ooxml_xsd/shared-customXmlSchemaProperties.xsd new input from file: /prog/ooxml_xsd/shared-math.xsd new input from file: /prog/ooxml_xsd/dml-wordprocessingDrawing.xsd new input from file: /prog/ooxml_xsd/dml-wordprocessingDrawing_import.xsd new input from file: /prog/ooxml_xsd/dml-graphicalObject.xsd new input from file: /prog/ooxml_xsd/dml-documentProperties.xsd new input from file: /prog/ooxml_xsd/dml-baseTypes.xsd new input from file: /prog/ooxml_xsd/shared-relationshipReference.xsd new input from file: /prog/ooxml_xsd/dml-shapeGeometry.xsd new input from file: /prog/xml.xsd new input from file: docx/word/document.xml docx/word/document.xml validates DOCUMENT No entities in internal subset No entities in external subset
Now please just believe me: This is cool.
Total Comments 0