Office XML-file hacking anybody?

Michael Uplawski · 03-03-2018, 03:29 AM

Good morning.

As a user of SoftMaker® office (closed-source, commercial) I am quite fond of their support for some of the XML-file formats in use nowadays.

While I seek to automate modifications in these files by use of shell- or ruby-scripts, I have not yet advanced over the state of removing unwanted content from the unzipped version of the files, like e.g. removing a SoftMaker-specific tag from the definition of “embedded” graphic-files or removing all links to additional workbooks in an XLSX-file.

In the future, I want to limit all embedded images in a document (docx or tmdx) to a size that corresponds to what is displayed, rather than scale the images to fit in the document. On the XML-level, these images are just linked in and the graphic-files are provided in a sub-folder. I hope to render some documents much smaller, this way.

Are you having experience with this kind of manipulations and do you have examples and best practices to share?

One reason for my endeavor is the fact that SoftMaker do not provide a programming interface in the Linux-version of their office-package. Under Windows, there is a Basic-dialect and the usual OLE/COM-interface.

In the end, I deem hacking the XML-code much better, also because there are so many ways to achieve the same.

ondoho · 03-04-2018, 02:23 AM

XML manipulating software exists.
xmlstarlet might be a good fit.
i use xmllint, but i think it's read-only, and fairly low-level.
still sounds like a lot of work to script it.

Michael Uplawski · 03-04-2018, 12:00 PM

Quote:

Originally Posted by ondoho

XML manipulating software exists.
xmlstarlet might be a good fit.
i use xmllint, but i think it's read-only, and fairly low-level.
still sounds like a lot of work to script it.

Thanks ondoho.

My xml-software is
*) nokogiri, either as a „stand-alone” command-line tool or as a module in my Ruby-programs
*) Apache-Fop, as a XSL/FO processor, stand-alone or in Java.
*) Firefox to open and view Office xml-files.

What I find difficult is the documentation of the OOXML-standard and how I should apply the information to my own task at hand. I guess I would need simple examples to get on with my coding work... I do not need to create new office-documents, for the time, and a downright conversion between file-formats is also not needed.

It is quite clear that everybody has different needs and if one “must” manipulate ODS-, Docx- or XLSX-files, it will be for a specific task. However, as my coding skills are diminishing I hope to get inspiration from the work and words of others...

What I seek in the end, is a way to automate these manipulations, like you do with the OLE/COM-interface and e.g. Visual Basic or just any other language that provides OLE-functionality. I am not talking about occasional interventions to modify 1 specific file. This would be too simple and I understand your remark in this way, too.

Cheerio.

sundialsvcs · 03-05-2018, 07:30 AM

It's actually an easy thing to do. The technology that you want is called XSLT.

Like all such XML formats, the OpenOffice document formats are standardized and described by a so-called "schema." They can be validated, using appropriate tools, to prove that any document conforms to its schema.

Then, XSLT allows you to write transformation rules – built in XML – to convert one XML structure to another. You can write rules, without writing a single computer program, to "transform away" whatever you don't want, or to turn it into something else. You then re-validate the resulting file to be sure that it still conforms to the schema: this is an automated way to be sure that OO will probably still accept it, even if it

doesn't do the right thing with it. (Yes, XSLT transforms can have "bugs" in them.)

You don't have to write a single line of "custom programming" to do any of these things.

(DocBook, which is "where all those technical books with animals on the cover" actually came from, is an astonishing demonstration of what XSLT can do.)

Michael Uplawski · 03-05-2018, 12:05 PM

Quote:

Originally Posted by sundialsvcs

It's actually an easy thing to do. The technology that you want is called XSLT.

XSLT is one technology to do it. I do it. I did it.
But I express myself badly, miserably, it appears. You dwell on the technology. I know the technology.

My problem is with the application and the so-called “standards“. As I seek examples, I may just as well describe one, myself: Someone wants to modify all inline-images in a Text-Processor file-format to e.g. replace the current images against new ones.

I find terrible tag-hierarchies, where I had expected just one or two tags. Reading up on the meaning of these tags, I get lost and shut-down my computer.

In the past, I have created many different reports from different data-sources using XSLT either from a java-program or by employing the Apache-Fop “xsl/fo”-processor. Some of my most recent tools use nokogiri to convert xml to pdf or to analyze (X)HTML for useful content. This is all an easy game against the task to find your way in the labyrinthine OOXML code.

In my own opinion, of course.

But folks, I tend to let it rest for now. I will be playing around with what I have and maybe publish some of my findings (as usual) in my blog.., if I deem them worth it.

Thanks anyway for your responses.

ondoho · 03-06-2018, 12:13 AM

if libreoffice uses it, it must be possible to get at the specs.
maybe: https://duckduckgo.com/html?q=ooxml%20specifications

another search query: https://duckduckgo.com/html?q=linux%...20office%20xml
it seems you will have to read a few microsoft documents even if you want to do this on linux :-(