BASH scripting: problem with file formatting.

suse_nerd · 12-03-2009, 04:28 AM

I am trying to do a script which depends on

Code:

cat file | while read i;
do
....
.... 
done

I do some scripting on the html output. An example source file can be found here (which would be $s in the line of code below)
http://www.dodgybloke.co.uk/11191S2E

Code:

cat $s | egrep 'RB|RT' |  sed '1,2d' |  sed -e :a -e 's/<[^>]*>//g;/</N;//ba' |  sed 's/^[ \t]*//' |  tr ',' '\n' >> $s.trackingfound

The file produced looks like this (when a simple command such as

Code:

 cat *.trackingfound >> broken

is performed to get all the data into one file

Code:

george@linux-z40o:~> cat broken

RB116413492HK

RB116413492HK
RT040029841HK
RT040029461HK
RT040029841HK
RT040029461HK

However closer examination reveals this is how read is seeing it.

Code:

george@linux-z40o:~/> cat broken | while read i; do  echo "*S*" $i "*E*"; done
 *E*
 *E*RB116413492HK
 *E*
 *E*RB116413492HK
*S* RT040029841HK *E*
 *E*RT040029461HK
*S* RT040029841HK *E*
 *E*RT040029461HK

You can see the file is, as the name suggests, completely broken and only the last and third from last line is read in correctly into my script. I would like to know how to fix it or get each "word" into a variable using another method.

SED and awk commands to remove blank lines have been fruitless.
Perhaps I need to put everything back onto a single line, then re-separate at the point of RB or RT or after every 13th character. In which case, some of the commands describe above can probably be changed.

As you can see, I am trying to parse the "tracking numbers" from the html.

David the H. · 12-03-2009, 07:11 AM

Let me see if I understand correctly. You want to extract only the tracking numbers from the source code of the html documents like the example page you gave, right? And the numbers you want always start with RB or RT?

I tried the extraction string you posted on the page you gave, and it gave me the following output:

Code:

testpage=$(wget -O- http://www.dodgybloke.co.uk/11191S2E)

echo "$testpage" | egrep 'RB|RT' |  sed '1,2d' |  sed -e :a -e 's/<[^>]*>//g;/</N;//ba' |  sed 's/^[ \t]*//' |  tr ',' '\n'

size="2">RB116413492HK
href="http://app3.hongkongpost.com/CGI/mt/genresult.jsp?tracknbr=RB116413492HK" target=_blank>
RB116413492HK

Something tells me you don't want all that extra garbage. Besides, I think you're making it much more complicated than it needs to be. I can get the tracking number with just the following command:

Code:

$$ sed -rn '0,/tracknbr/ s/^.*=((RB|RT)[^"]+).*/\1/p' <<<$testpage

RB116413492HK

"0,/tracknbr/" says to only search the file up to the first line that has "tracknbr" in it, then it uses the s/// expression to extract the actual number. You may have to modify it a little if the input can vary significantly.

Finally, it's better to avoid using pipes and external commands like cat whenever possible for efficiency purposes. Pipes also run subsequent commands in subshells that can cause confusing behavior with variables. So your while loop can be written better this way:

Code:

while read i;
do
....
.... 
done <file

suse_nerd · 12-07-2009, 06:32 AM

Many thanks for the reply. It has fixed it. I thought I would upload my entire script, it all works fine, but I expect there are better ways of doing it. I tried changing the commands to what you suggested, that didnt work though, but could have been because of other problems.

http://www.dodgybloke.co.uk/trackingscript.sh

ghostdog74 · 12-07-2009, 07:08 AM

no need complicated regex

Code:

# wget -q -O- http://www.dodgybloke.co.uk/11191S2E | awk -F"tracknbr=" '/tracknbr=/{sub(/\".*/,"",$2);print $2}'
RB116413492HK
RB116413492HK

suse_nerd · 12-07-2009, 04:31 PM

Quote:

Originally Posted by ghostdog74

no need complicated regex

Hi ghostdog, which line are you saying I could replace. It is not as simple as just getting the file from the above site, it was provided as an example only. The lynx script logs in to dealextreme.com and gets the file.

The complicated regex gets rid of the duplicate lines like the above, as the next part of the script checks the tracking number against the hong kong post website and would do the same tracking number twice otherwise.

Code:

 sed '$!N; /^\(.*\)\n\1$/!P; D'

Deletes duplicate non-consecutive lines

Code:

 sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'

Deletes duplicate consecutive lines