LinuxQuestions.org - [SOLVED] Bash script to read list from xml file for requests with wget

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Bash script to read list from xml file for requests with wget (https://www.linuxquestions.org/questions/programming-9/bash-script-to-read-list-from-xml-file-for-requests-with-wget-4175712011/)

Bash script to read list from xml file for requests with wget

I have a list of Sentinel-2 files in XML format. I don't really know how to parse XML in code. I want to download the files. I have a wget statement that can do list item at time. I want to code a wget statement that can get each item in the list.

This is a small part of the list:

Code:

<?xml version="1.0" encoding="UTF-8"?><metalink xmlns="urn:ietf:params:xml:ns:metalink">

<file name="S2A_MSIL1C_20151116T161042_N0204_R097_T18TTM_20151116T161042"><hash type="MD5">7c3d80be6658e02f408329cf9f194bce</hash><size>814488077</size><url>https://scihub.copernicus.eu/dhus/odata/v1/Products('80ba113d-b5c5-4f4f-8e4e-3231bd2c2859')/$value</url></file><file name="S2A_MSIL1C_20151116T161042_N0204_R097_T17TPG_20151116T161042"><hash type="MD5">b4bbfe460677864d7e31fa6833981df1</hash><size>653453581</size><url>https://scihub.copernicus.eu/dhus/odata/v1/Products('d0d43133-8089-4ec9-ab35-37d26323af63')/$value</url></file><file name="S2A_MSIL1C_20151116T161042_N0204_R097_T17TQJ_20151116T161042"><hash type="MD5">461F32396E40CD40D431A3BCB0AEB587</hash><size>680393012</size><url>https://scihub.copernicus.eu/dhus/odata/v1/Products('cefbc29e-f567-4410-b0b3-860617cf6a19')/$value</url></file><file name="S2A_MSIL1C_20151116T161042_N0204_R097_T18TUP_20151116T161042"><hash type="MD5">C46D1D72A3D2C45B2D5C86C28AA1B2FE</hash><size>668335329</size><url>https://scihub.copernicus.eu/dhus/odata/v1/Products('63e2415f-249c-4d68-a908-d26efc34bf82')/$value</url></file><file name="S2A_MSIL1C_20151116T161042_N0204_R097_T17TPJ_20151116T161042"><hash type="MD5">64540FA552D95E3C61B9AC8040469DB2</hash><size>273103410</size><url>https://scihub.copernicus.eu/dhus/odata/v1/Products('c7de7226-5769-451d-9348-b8bf85e7b93f')/$value</url></file><file name="S2A_MSIL1C_20151116T161042_N0204_R097_T18TUN_20151116T161042"><hash type="MD5">17C2E9C89EC1DEDBE3D04BB528184028</hash><size>822199792</size><url>https://scihub.copernicus.eu/dhus/odata/v1/Products('13168064-4914-4320-aecc-2482b7f9d005')/$value</url></file>

This is the wget statement. It worked for one file and I want a loop or some other way to feed the whole list to it and make it run over each item.

Code:

wget --content-disposition --continue --user=... --password=... "https://scihub.copernicus.eu/dhus/odata/v1/Products('80ba113d-b5c5-4f4f-8e4e-3231bd2c2859')/\$value"

I'm using Slackware64 14.2, but I don't think that matter so much, as long as the code will work in linux.

You could use a specialist tool like xmlstarlet, although personally I'd use Perl which has module(s) specifically for doing that and you could incorporate the equiv of wget as well.

Yeah, ruby for me and python for others. You can bash it (pardon the pun), but the sed/awk/grep option might be encumbersome.

You need an XML parser to handle XML, but getting XmlStarlet to extract the URLs is not quite as straight forward as it could be...

Code:

$ xmlstarlet select -t -v '/_:metalink/_:file/_:url' -n input.xml

https://scihub.copernicus.eu/dhus/odata/v1/Products('80ba113d-b5c5-4f4f-8e4e-3231bd2c2859')/$value

https://scihub.copernicus.eu/dhus/odata/v1/Products('d0d43133-8089-4ec9-ab35-37d26323af63')/$value

https://scihub.copernicus.eu/dhus/odata/v1/Products('cefbc29e-f567-4410-b0b3-860617cf6a19')/$value

https://scihub.copernicus.eu/dhus/odata/v1/Products('63e2415f-249c-4d68-a908-d26efc34bf82')/$value

https://scihub.copernicus.eu/dhus/odata/v1/Products('c7de7226-5769-451d-9348-b8bf85e7b93f')/$value

https://scihub.copernicus.eu/dhus/odata/v1/Products('13168064-4914-4320-aecc-2482b7f9d005')/$value

Where input.xml is a file containing your code, but with the missing "</metalink>" appended. (You could instead from stdin.)

See "xmlstarlet select --help" (or online user guide) for explanation of the syntax (which does not follow typical command-line conventions).

The "_:" bit in the XPath expression is required due to the xmlns attribute on the root element.

Without a xmlns it would just be "/metalink/file/url" (and trying to use _: there results in an "Undefined namespace prefix" error.)

If there might be other "url" tags outside of "file" tags, you could use:

Code:

$ xmlstarlet select -t -v '//_:url' -n input.xml

The output of both is a newline-delimited list, which can be looped in the usual way and passed to wget/curl/whatever.

The "-n" is needed to output a trailing newline at the end, which some command-line tools are picky about.