wget failed to download a html page
Good day,
I am using wget to download a web page but the page I save on my hard drive is not the same than the one I see with my browser (firefox). Especially, the page download with wget is incomplete and an old version of the webpage. The location of the page is: http://www.discoverygc.com/server/players_online.html I used different combination of wget options but up untill now I was never able to download the latest version of the webpage. I also note that lynx faces the same problem: not able to completely load the page (and it is an old version of the page). if anyone can help I would be pleased, thanks for your attention moeb ps: here is the result of the debug: Code:
moebus>wget http://www.discoverygc.com/server/players_ |
1 Attachment(s)
No problems here with
wget http://www.discoverygc.com/server/players_online.html The file "players_online.html", 23.2 kB is downloaded. The file is attached as players_online.html.txt ( .txt is the allowed suffix for attaching this type of file.) .. |
If you have a good version of Firefox, it's Download Manager will work better than Kget and Gwget on Difficult files. The problem is most likely the Server or the Network.
|
Hi, thanks for your replies.
Knudfl, the file you attached is incomplete and is not the last version. Some data in this file are missing: ping loss and lag column are not present in the second table and at the end you have some character chains that should be in the table. The last update is not supposed to be 31/01/2012 06:15:47 [UTC] since the file is being updated each 10s. If I check now with firefox, the last update is: 31/01/2012 13:58:12 [UTC] while if I download it now with wget it is 31/01/2012 06:15:47 [UTC] KbStockpiler, I plan to download the file each minute and to use a script to do that. I am not sure I can use firefox in such script. |
Quote:
wget works fine and retrieves the latest version of the page. Maybe there is a caching issue specific to your installation/firewall? |
2 Attachment(s)
Hi again,
I'm surprised, the file is supposed to be 46k, not 23k. I will check my cache and try it on another internet connection. (up until now I've tried on two different computers with different OS but connected on the same rooter) thanks for your comment. |
I'm no web expert, but two things to check--both of which depend on the web server:
1. Does the web server offer different pages based on the user-agent string supplied by the request? firefox and wget certainly provide different user-agents by default. I believe wget can be told to supply a different user-agent on the command line 2. Does the web server offer different pages for (ro)bots? As far as I understand it, wget honors site restrictions for robots. And as far as I know, that adherence is fixed--not configurable via command line. EDIT: I just tried running the following command: Code:
wget --user-agent="Mozilla/5.0 (Windows NT 6.2; rv:9.0.1) Gecko/20100101 Firefox/9.0.1" http://www.discoverygc.com/server/players_online.html" Code:
wget -c --user-agent="Mozilla/5.0 (Windows NT 6.2; rv:9.0.1) Gecko/20100101 Firefox/9.0.1" http://www.discoverygc.com/server/players_online.html" That would seem to say the problem is on the web server side. Perhaps Firefox has some error-recovery code that wget does not. EDIT2: The same problem occurs for curl as well. EDIT3: I just tried the following command: Code:
curl --compressed http://www.discoverygc.com/server/players_online.html > players_online.html I do not know if wget has a similar option. |
I don't think this is somthing common with "break in transfer" due the displaying old time in "last update". Maybe it is a bug in nginx server and it serve old cached data. As some clients request compressed data and some plain, and it only can cache one type.
When clearing "network.http.accept-encoding" in "about:config" in Firefox, disabling compression, you also get corrupted page (and old time). |
Good point about the old update time. I'd forgotten about that and focused on the amount of data transferred.
|
wow : )
thank you very much, I've just tried the curl --compressed http://www.discoverygc.com/server/players_online.html > players_online.html and it is working here too. So, I will use curl in my script instead of wget (since I did not find any option related to 'compress') and it will be ok for me. Should I edit the title of the topic and add a [SOLVED] in it? |
Quote:
|
thanks again.
|
All times are GMT -5. The time now is 05:27 PM. |