[SOLVED] Loop through list of URLs in txt file, parse out parameters, pass to wget in bash.

dchol · 07-26-2011, 02:19 PM

What I have:

1. list of URLs in text file (i.e. in this form http://www.domain.tld/more-stuff-here)

2. script that extracts parameters from text file with URLs (example below)

3. script that downloads file with wget (example below)

I want to create a loop that:

1. takes a text file of URLs

2. parses $host and $host_and_domain from each URL

3. sends $host and $host_and_domain to the wget script

4. creates a file name by appending $host with time/date (i.e. mm:dd:yy:hh:mm:ss)

Feel free to let me know if I could clarify anything. Also open to code examples to play with instead of outright answers.

Thanks!

Example of URL parsing script:

Code:

#!/bin/sh
# test-domain-parse.sh
# (Can't remember where I found this, but I didn't write it)

# ask for URL.  note: want to pull in URLs from txt file (instead of printf)
# and then pass $host and $host_and_path to wget script

printf "Paste the URL you would like to normalize:  -> "

read full_url

# extract the protocol
proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"
# remove the protocol
url="$(echo ${full_url/$proto/})"
# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"
# extract the host
host="$(echo ${url/$user@/} | cut -d/ -f1)"
# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"
host_and_path="$(echo ${url/$user@/} )"

echo "  host: $host"
echo "  host_and_path: $host_and_path"

Example of wget script:

Code:

#!/bin/sh

# wget-url-test.sh

# note: I would like to pass URL from the parsing script and NOT use printf
printf "What URL would you like to PDF? ->"

read URL

# echo $URL

# note: I would like to pass NORMALIZED_URL from parsing script ($host)
# and to append with yy:mm:dd:hh:mm:ss instead of naming file with printf
printf "What would you like to name the file? ->"

read NORMALIZED_URL

wget -O $NORMALIZED_URL.png --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.19) Gecko/20110707 Firefox/3.6.19" "pdfmyurl.com?url=$URL&--png&--page-size=A1&--redirect-delay=500"

chrism01 · 07-26-2011, 07:54 PM

Unfortunately, I can't get to that link. Anyway, what you need is a loop like

Code:

# assumes no spaces in urls
for full_url in $(cat <yourfilehere> )
do

#insert first script here, skipping printf & read cmds

# append 2nd script here, skipping printf, read cmds. (Not sure what a 'normalized_url) is.
# if it is $host and you want to append time, then 
$host=${host}${yy:mm:dd:hh:mm:ss}

# but you'll have to get the timestamp from somewhere

done

Here are some good bash links
http://rute.2038bug.com/index.html.gz
http://tldp.org/LDP/Bash-Beginners-G...tml/index.html
http://www.tldp.org/LDP/abs/html/

that should get you started

dchol · 07-26-2011, 11:42 PM

Thanks Chris. That definitely helped. I used "export date=$(date +%s)" to append to "$host" for file naming convention.

Strangely, instead of iterating through the list, processing each line, the script only processes the last line of the text file. "test-urls.txt" is an 11 line file with no spaces. It contains URLs in this form: http://[host.com]/[pages-go-here]

I'll look into this but if you have any suggestions in the meantime feel free to share.

Here's the updated code:

Code:

#!/bin/sh

# wget-url-test.sh

for full_url in $(cat test-urls.txt)
do

# extract the protocol
proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"
# remove the protocol
url="$(echo ${full_url/$proto/})"
# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"
# extract the host
host="$(echo ${url/$user@/} | cut -d/ -f1)"
# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"
host_and_path="$(echo ${url/$user@/} )"

export date=$(date +%s)

wget -O "${host}${date}".png --referer="http://www.google.com"  \
--user-agent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.19) Gecko/20110707 Firefox/3.6.19" \
"pdfmyurl.com?url=$host_and_path&--png&--page-size=A1&--redirect-delay=500"

done

Tinkster · 07-26-2011, 11:47 PM

What line separators are you using, is this a Linux or DOS file?

Cheers,
Tink

grail · 07-27-2011, 12:18 AM

Just out of curiosity, you are aware that your greps you are using are doing nothing?

If we assume the format you provided is correct for each line of the file (http://[host.com]/[pages-go-here]), then
something like:

Code:

proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"

Here the full_url is only one line so doing a grep of one line serves no purpose, at lest not without any switches
to reduce what has been past in.

Also, I am not exactly sure what details are in the url lines in the file but are you aware that wget can read url information directly from a file? (just a thought)

chrism01 · 07-27-2011, 12:22 AM

Also, no need to use the 'export' keyword to define a var unless you want it to be visible to a sub-shell.

dchol · 07-27-2011, 12:22 AM

Tink, I'm using CR line terminators and this is a Mac OS file.

Thanks

dchol · 07-27-2011, 12:32 AM

grail, no, I wasn't aware of that. Übernoob with this stuff.

re: your thought, would this be wget's "-i" option? If so, the reason I didn't use this was because I also want to parse out the host of each URL and use the value of host to name the files I'm downloading. But if something else I could look into it. Thanks

In case it helps, here're sample lines from the text file:

Quote:

http://www.winostuff.com/RandomStuff.htm
http://www.boredatuni.com/stuff.php
http://randomthingstodo.com/
http://randomstuff.1hwy.com/
http://www.randomstuff.org/
http://www.random-good-stuff.com/
http://www.innoq.com/blog/st/
http://khitschicago.radio.com/category/random-stuff/

Chris, thanks for the feedback.

catkin · 07-27-2011, 01:08 AM

Do you have to use sh? Can you use bash? That is, could the first line be #!/bin/bash?

Reason for asking is that sh may effectively be several different shells depending on the distro and, even if it is linked to bash, bash when called as sh has a subset of its full functionality.

Regards only getting the last line and evolving the code to work with URLs including spaces, the outer loop could be changed to

Code:

while read -r full_url
do
    ...
done < test-urls.txt

catkin · 07-27-2011, 01:12 AM

Code:

url="$(echo ${full_url/$proto/})"

is functionally equivalent to

Code:

url=${full_url/$proto/}

but, as the protocol must be the leftmost part of the full URL, this is more appropriate

Code:

url=${full_url#$proto}

dchol · 07-27-2011, 01:21 AM

Thanks, put #!/bin/bash instead.

So it appears the special characters in the URLs could prevent the script from working as intended.

To troubleshoot I substituted the URLs with random strings without any special characters and can echo each line just fine. However even using the -r option in the below script doesn't produce any output when I reinsert the URLs into the text file.

Code:

#!/bin/bash

# test-echo-urls.sh

while read -r full_url
do 
    echo "$full_url"
done < test-urls.txt

Or (without quotes around $full_url)

Code:

#!/bin/bash

# test-echo-urls.sh

while read -r full_url
do 
    echo $full_url
done < test-urls.txt

dchol · 07-27-2011, 03:46 AM

Problem solved: it was a filetype issue, as Tink may have alluded to earlier. The script copied/pasted in comment #3 works.

I noticed when the "file" command returned "ASCII test" and nothing else for the file in question, the script worked.

However, when the "file" command returned, for example, "ASCII text, with CR line terminators," the script did not work.

catkin · 07-27-2011, 04:04 AM

Quote:

Originally Posted by dchol

Problem solved: it was a filetype issue, as Tink may have alluded to earlier. The script copied/pasted in comment #3 works.

I noticed when the "file" command returned "ASCII test" and nothing else for the file in question, the script worked.

However, when the "file" command returned, for example, "ASCII text, with CR line terminators," the script did not work.

Good

So how does the script look now? There may be things that can be tided up such as replacing

Code:

user="$(echo $url | grep @ | cut -d@ -f1)"

with

Code:

user=${url%%@*}

if you are interested.

grail · 07-27-2011, 10:40 AM

May I also ask if the content of the text file with urls you posted in #8 is incomplete? I ask as it obviously has no user and or host details anywhere in it (this may be confidential)
so the other lines for setting user and host seem to have nothing to work on??

dchol · 07-27-2011, 11:57 AM

catkin, yep, will happily paste in the new code, assuming I'm able to later today. Thanks

grail, the script I adapted for processing the URLs was written by someone else, and I didn't (still don't, to a certain extent) understand all the code. There was actually no user information in the original file, I just kept that line in there because I wasn't quite ready to mess with that part. Before posting the finished script, however, I plan to strip out all the superfluous code so you'll be able to see how I'm using it then. Thanks