I am trying to do a script which depends on
Code:
cat file | while read i;
do
....
....
done
I do some scripting on the html output. An example source file can be found here (which would be $s in the line of code below)
http://www.dodgybloke.co.uk/11191S2E
Code:
cat $s | egrep 'RB|RT' | sed '1,2d' | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | sed 's/^[ \t]*//' | tr ',' '\n' >> $s.trackingfound
The file produced looks like this (when a simple command such as
Code:
cat *.trackingfound >> broken
is performed to get all the data into one file
Code:
george@linux-z40o:~> cat broken
RB116413492HK
RB116413492HK
RT040029841HK
RT040029461HK
RT040029841HK
RT040029461HK
However closer examination reveals this is how
read is seeing it.
Code:
george@linux-z40o:~/> cat broken | while read i; do echo "*S*" $i "*E*"; done
*E*
*E*RB116413492HK
*E*
*E*RB116413492HK
*S* RT040029841HK *E*
*E*RT040029461HK
*S* RT040029841HK *E*
*E*RT040029461HK
You can see the file is, as the name suggests, completely broken and only the last and third from last line is read in correctly into my script. I would like to know how to fix it or get each "word" into a variable using another method.
SED and awk commands to remove blank lines have been fruitless.
Perhaps I need to put everything back onto a single line, then re-separate at the point of RB or RT or after every 13th character. In which case, some of the commands describe above can probably be changed.
As you can see, I am trying to parse the "tracking numbers" from the html.