[SOLVED] Bash script to read list from xml file for requests with wget

Tsuga · 05-12-2022, 07:42 PM

I have a list of Sentinel-2 files in XML format. I don't really know how to parse XML in code. I want to download the files. I have a wget statement that can do list item at time. I want to code a wget statement that can get each item in the list.

This is a small part of the list:

Code:

<?xml version="1.0" encoding="UTF-8"?><metalink xmlns="urn:ietf:params:xml:ns:metalink">
<file name="S2A_MSIL1C_20151116T161042_N0204_R097_T18TTM_20151116T161042"><hash type="MD5">7c3d80be6658e02f408329cf9f194bce</hash><size>814488077</size><url>https://scihub.copernicus.eu/dhus/odata/v1/Products('80ba113d-b5c5-4f4f-8e4e-3231bd2c2859')/$value</url></file><file name="S2A_MSIL1C_20151116T161042_N0204_R097_T17TPG_20151116T161042"><hash type="MD5">b4bbfe460677864d7e31fa6833981df1</hash><size>653453581</size><url>https://scihub.copernicus.eu/dhus/odata/v1/Products('d0d43133-8089-4ec9-ab35-37d26323af63')/$value</url></file><file name="S2A_MSIL1C_20151116T161042_N0204_R097_T17TQJ_20151116T161042"><hash type="MD5">461F32396E40CD40D431A3BCB0AEB587</hash><size>680393012</size><url>https://scihub.copernicus.eu/dhus/odata/v1/Products('cefbc29e-f567-4410-b0b3-860617cf6a19')/$value</url></file><file name="S2A_MSIL1C_20151116T161042_N0204_R097_T18TUP_20151116T161042"><hash type="MD5">C46D1D72A3D2C45B2D5C86C28AA1B2FE</hash><size>668335329</size><url>https://scihub.copernicus.eu/dhus/odata/v1/Products('63e2415f-249c-4d68-a908-d26efc34bf82')/$value</url></file><file name="S2A_MSIL1C_20151116T161042_N0204_R097_T17TPJ_20151116T161042"><hash type="MD5">64540FA552D95E3C61B9AC8040469DB2</hash><size>273103410</size><url>https://scihub.copernicus.eu/dhus/odata/v1/Products('c7de7226-5769-451d-9348-b8bf85e7b93f')/$value</url></file><file name="S2A_MSIL1C_20151116T161042_N0204_R097_T18TUN_20151116T161042"><hash type="MD5">17C2E9C89EC1DEDBE3D04BB528184028</hash><size>822199792</size><url>https://scihub.copernicus.eu/dhus/odata/v1/Products('13168064-4914-4320-aecc-2482b7f9d005')/$value</url></file>

This is the wget statement. It worked for one file and I want a loop or some other way to feed the whole list to it and make it run over each item.

Code:

wget --content-disposition --continue --user=... --password=... "https://scihub.copernicus.eu/dhus/odata/v1/Products('80ba113d-b5c5-4f4f-8e4e-3231bd2c2859')/\$value"

I'm using Slackware64 14.2, but I don't think that matter so much, as long as the code will work in linux.

chrism01 · 05-12-2022, 11:39 PM

You could use a specialist tool like xmlstarlet, although personally I'd use Perl which has module(s) specifically for doing that and you could incorporate the equiv of wget as well.

grail · 05-13-2022, 12:07 AM

Yeah, ruby for me and python for others. You can bash it (pardon the pun), but the sed/awk/grep option might be encumbersome.

boughtonp · 05-13-2022, 07:58 AM

You need an XML parser to handle XML, but getting XmlStarlet to extract the URLs is not quite as straight forward as it could be...

Code:

$ xmlstarlet select -t -v '/_:metalink/_:file/_:url' -n input.xml
https://scihub.copernicus.eu/dhus/odata/v1/Products('80ba113d-b5c5-4f4f-8e4e-3231bd2c2859')/$value
https://scihub.copernicus.eu/dhus/odata/v1/Products('d0d43133-8089-4ec9-ab35-37d26323af63')/$value
https://scihub.copernicus.eu/dhus/odata/v1/Products('cefbc29e-f567-4410-b0b3-860617cf6a19')/$value
https://scihub.copernicus.eu/dhus/odata/v1/Products('63e2415f-249c-4d68-a908-d26efc34bf82')/$value
https://scihub.copernicus.eu/dhus/odata/v1/Products('c7de7226-5769-451d-9348-b8bf85e7b93f')/$value
https://scihub.copernicus.eu/dhus/odata/v1/Products('13168064-4914-4320-aecc-2482b7f9d005')/$value

Where input.xml is a file containing your code, but with the missing "</metalink>" appended. (You could instead from stdin.)

See "xmlstarlet select --help" (or online user guide) for explanation of the syntax (which does not follow typical command-line conventions).

The "_:" bit in the XPath expression is required due to the xmlns attribute on the root element.

Without a xmlns it would just be "/metalink/file/url" (and trying to use _: there results in an "Undefined namespace prefix" error.)

If there might be other "url" tags outside of "file" tags, you could use:

Code:

$ xmlstarlet select -t -v '//_:url' -n input.xml

The output of both is a newline-delimited list, which can be looped in the usual way and passed to wget/curl/whatever.

The "-n" is needed to output a trailing newline at the end, which some command-line tools are picky about.