[SOLVED] Loop through list of URLs in txt file, parse out parameters, pass to wget in bash.
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
2. script that extracts parameters from text file with URLs (example below)
3. script that downloads file with wget (example below)
I want to create a loop that:
1. takes a text file of URLs
2. parses $host and $host_and_domain from each URL
3. sends $host and $host_and_domain to the wget script
4. creates a file name by appending $host with time/date (i.e. mm:dd:yy:hh:mm:ss)
Feel free to let me know if I could clarify anything. Also open to code examples to play with instead of outright answers.
Thanks!
Example of URL parsing script:
Code:
#!/bin/sh
# test-domain-parse.sh
# (Can't remember where I found this, but I didn't write it)
# ask for URL. note: want to pull in URLs from txt file (instead of printf)
# and then pass $host and $host_and_path to wget script
printf "Paste the URL you would like to normalize: -> "
read full_url
# extract the protocol
proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"
# remove the protocol
url="$(echo ${full_url/$proto/})"
# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"
# extract the host
host="$(echo ${url/$user@/} | cut -d/ -f1)"
# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"
host_and_path="$(echo ${url/$user@/} )"
echo " host: $host"
echo " host_and_path: $host_and_path"
Example of wget script:
Code:
#!/bin/sh
# wget-url-test.sh
# note: I would like to pass URL from the parsing script and NOT use printf
printf "What URL would you like to PDF? ->"
read URL
# echo $URL
# note: I would like to pass NORMALIZED_URL from parsing script ($host)
# and to append with yy:mm:dd:hh:mm:ss instead of naming file with printf
printf "What would you like to name the file? ->"
read NORMALIZED_URL
wget -O $NORMALIZED_URL.png --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.19) Gecko/20110707 Firefox/3.6.19" "pdfmyurl.com?url=$URL&--png&--page-size=A1&--redirect-delay=500"
Unfortunately, I can't get to that link. Anyway, what you need is a loop like
Code:
# assumes no spaces in urls
for full_url in $(cat <yourfilehere> )
do
#insert first script here, skipping printf & read cmds
# append 2nd script here, skipping printf, read cmds. (Not sure what a 'normalized_url) is.
# if it is $host and you want to append time, then
$host=${host}${yy:mm:dd:hh:mm:ss}
# but you'll have to get the timestamp from somewhere
done
Thanks Chris. That definitely helped. I used "export date=$(date +%s)" to append to "$host" for file naming convention.
Strangely, instead of iterating through the list, processing each line, the script only processes the last line of the text file. "test-urls.txt" is an 11 line file with no spaces. It contains URLs in this form: http://[host.com]/[pages-go-here]
I'll look into this but if you have any suggestions in the meantime feel free to share.
Here's the updated code:
Code:
#!/bin/sh
# wget-url-test.sh
for full_url in $(cat test-urls.txt)
do
# extract the protocol
proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"
# remove the protocol
url="$(echo ${full_url/$proto/})"
# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"
# extract the host
host="$(echo ${url/$user@/} | cut -d/ -f1)"
# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"
host_and_path="$(echo ${url/$user@/} )"
export date=$(date +%s)
wget -O "${host}${date}".png --referer="http://www.google.com" \
--user-agent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.19) Gecko/20110707 Firefox/3.6.19" \
"pdfmyurl.com?url=$host_and_path&--png&--page-size=A1&--redirect-delay=500"
done
Just out of curiosity, you are aware that your greps you are using are doing nothing?
If we assume the format you provided is correct for each line of the file (http://[host.com]/[pages-go-here]), then
something like:
Code:
proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"
Here the full_url is only one line so doing a grep of one line serves no purpose, at lest not without any switches
to reduce what has been past in.
Also, I am not exactly sure what details are in the url lines in the file but are you aware that wget can read url information directly from a file? (just a thought)
grail, no, I wasn't aware of that. Übernoob with this stuff.
re: your thought, would this be wget's "-i" option? If so, the reason I didn't use this was because I also want to parse out the host of each URL and use the value of host to name the files I'm downloading. But if something else I could look into it. Thanks
In case it helps, here're sample lines from the text file:
Do you have to use sh? Can you use bash? That is, could the first line be #!/bin/bash?
Reason for asking is that sh may effectively be several different shells depending on the distro and, even if it is linked to bash, bash when called as sh has a subset of its full functionality.
Regards only getting the last line and evolving the code to work with URLs including spaces, the outer loop could be changed to
Code:
while read -r full_url
do
...
done < test-urls.txt
So it appears the special characters in the URLs could prevent the script from working as intended.
To troubleshoot I substituted the URLs with random strings without any special characters and can echo each line just fine. However even using the -r option in the below script doesn't produce any output when I reinsert the URLs into the text file.
Code:
#!/bin/bash
# test-echo-urls.sh
while read -r full_url
do
echo "$full_url"
done < test-urls.txt
Or (without quotes around $full_url)
Code:
#!/bin/bash
# test-echo-urls.sh
while read -r full_url
do
echo $full_url
done < test-urls.txt
May I also ask if the content of the text file with urls you posted in #8 is incomplete? I ask as it obviously has no user and or host details anywhere in it (this may be confidential)
so the other lines for setting user and host seem to have nothing to work on??
catkin, yep, will happily paste in the new code, assuming I'm able to later today. Thanks
grail, the script I adapted for processing the URLs was written by someone else, and I didn't (still don't, to a certain extent) understand all the code. There was actually no user information in the original file, I just kept that line in there because I wasn't quite ready to mess with that part. Before posting the finished script, however, I plan to strip out all the superfluous code so you'll be able to see how I'm using it then. Thanks
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.