LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   wget failed to download a html page (https://www.linuxquestions.org/questions/linux-general-1/wget-failed-to-download-a-html-page-926634/)

moebus 01-30-2012 10:07 PM

wget failed to download a html page
 
Good day,

I am using wget to download a web page but the page I save on my hard drive is not the same than the one I see with my browser (firefox). Especially, the page download with wget is incomplete and an old version of the webpage.

The location of the page is:
http://www.discoverygc.com/server/players_online.html

I used different combination of wget options but up untill now I was never able to download the latest version of the webpage.
I also note that lynx faces the same problem: not able to completely load the page (and it is an old version of the page).

if anyone can help I would be pleased,
thanks for your attention

moeb

ps: here is the result of the debug:

Code:

moebus>wget http://www.discoverygc.com/server/players_
online.html -d
DEBUG output created by Wget 1.11.4 on Windows-MSVC.

--2012-01-30 23:02:32--  http://www.discoverygc.com/server/players_online.html
Resolving www.discoverygc.com... seconds 0.00, 78.46.88.89
Caching www.discoverygc.com => 78.46.88.89
Connecting to www.discoverygc.com|78.46.88.89|:80... seconds 0.00, connected.
Created socket 1892.
Releasing 0x00982880 (new refcount 1).

---request begin---
GET /server/players_online.html HTTP/1.0
User-Agent: Wget/1.11.4
Accept: */*
Host: www.discoverygc.com
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Server: nginx/1.0.11
Date: Tue, 31 Jan 2012 04:02:40 GMT
Content-Type: text/html
Content-Length: 22375
Last-Modified: Tue, 31 Jan 2012 04:02:31 GMT
Connection: keep-alive
Accept-Ranges: bytes

---response end---
200 OK
Registered socket 1892 for persistent reuse.
Length: 22375 (22K) [text/html]
Saving to: `players_online.html'

100%[======================================>] 22,375      107K/s  in 0.2s

2012-01-30 23:02:32 (107 KB/s) - `players_online.html' saved [22375/22375]


knudfl 01-31-2012 02:43 AM

1 Attachment(s)
No problems here with

wget http://www.discoverygc.com/server/players_online.html

The file "players_online.html", 23.2 kB is downloaded.

The file is attached as players_online.html.txt
( .txt is the allowed suffix for attaching this type of file.)
..

theKbStockpiler 01-31-2012 07:26 AM

If you have a good version of Firefox, it's Download Manager will work better than Kget and Gwget on Difficult files. The problem is most likely the Server or the Network.

moebus 01-31-2012 07:58 AM

Hi, thanks for your replies.

Knudfl, the file you attached is incomplete and is not the last version.
Some data in this file are missing: ping loss and lag column are not present in the second table and at the end you have some character chains that should be in the table.
The last update is not supposed to be 31/01/2012 06:15:47 [UTC] since the file is being updated each 10s.
If I check now with firefox, the last update is: 31/01/2012 13:58:12 [UTC] while if I download it now with wget it is 31/01/2012 06:15:47 [UTC]

KbStockpiler, I plan to download the file each minute and to use a script to do that. I am not sure I can use firefox in such script.

qlue 01-31-2012 11:57 AM

Quote:

Originally Posted by moebus (Post 4589369)
Hi, thanks for your replies.

Knudfl, the file you attached is incomplete and is not the last version.
Some data in this file are missing: ping loss and lag column are not present in the second table and at the end you have some character chains that should be in the table.
The last update is not supposed to be 31/01/2012 06:15:47 [UTC] since the file is being updated each 10s.
If I check now with firefox, the last update is: 31/01/2012 13:58:12 [UTC] while if I download it now with wget it is 31/01/2012 06:15:47 [UTC]

KbStockpiler, I plan to download the file each minute and to use a script to do that. I am not sure I can use firefox in such script.

I get the same results as Knudfl :scratch: .
wget works fine and retrieves the latest version of the page. Maybe there is a caching issue specific to your installation/firewall?

moebus 01-31-2012 12:33 PM

2 Attachment(s)
Hi again,
I'm surprised, the file is supposed to be 46k, not 23k.
I will check my cache and try it on another internet connection.
(up until now I've tried on two different computers with different OS but connected on the same rooter)
thanks for your comment.

Dark_Helmet 01-31-2012 12:50 PM

I'm no web expert, but two things to check--both of which depend on the web server:

1. Does the web server offer different pages based on the user-agent string supplied by the request? firefox and wget certainly provide different user-agents by default. I believe wget can be told to supply a different user-agent on the command line

2. Does the web server offer different pages for (ro)bots? As far as I understand it, wget honors site restrictions for robots. And as far as I know, that adherence is fixed--not configurable via command line.

EDIT:
I just tried running the following command:
Code:

wget --user-agent="Mozilla/5.0 (Windows NT 6.2; rv:9.0.1) Gecko/20100101 Firefox/9.0.1" http://www.discoverygc.com/server/players_online.html"
The transfer halted after about 1 or 2 KB. I then ran:
Code:

wget -c --user-agent="Mozilla/5.0 (Windows NT 6.2; rv:9.0.1) Gecko/20100101 Firefox/9.0.1" http://www.discoverygc.com/server/players_online.html"
The wget output indicated that the second command did, indeed, continue the earlier transaction. However, after looking at the file, it was still incomplete.

That would seem to say the problem is on the web server side. Perhaps Firefox has some error-recovery code that wget does not.

EDIT2:
The same problem occurs for curl as well.

EDIT3:
I just tried the following command:
Code:

curl --compressed http://www.discoverygc.com/server/players_online.html > players_online.html
I received the whole, correct page twice in a row now. I can't say that it's not a fluke, but perhaps the request for the web server to compress the page forces the web server to collect all the information at once before sending it--possibly preventing a mid-stream break in the transfer.

I do not know if wget has a similar option.

eSelix 01-31-2012 03:10 PM

I don't think this is somthing common with "break in transfer" due the displaying old time in "last update". Maybe it is a bug in nginx server and it serve old cached data. As some clients request compressed data and some plain, and it only can cache one type.

When clearing "network.http.accept-encoding" in "about:config" in Firefox, disabling compression, you also get corrupted page (and old time).

Dark_Helmet 01-31-2012 03:30 PM

Good point about the old update time. I'd forgotten about that and focused on the amount of data transferred.

moebus 01-31-2012 04:01 PM

wow : )
thank you very much,
I've just tried the curl --compressed http://www.discoverygc.com/server/players_online.html > players_online.html and it is working here too.
So, I will use curl in my script instead of wget (since I did not find any option related to 'compress') and it will be ok for me.

Should I edit the title of the topic and add a [SOLVED] in it?

Dark_Helmet 01-31-2012 04:27 PM

Quote:

Originally Posted by moebus
Should I edit the title of the topic and add a [SOLVED] in it?

You don't need to edit the thread title. There should be a checkbox or option of some kind to "mark thread as solved." If memory serves, there's a "Thread tools" menu somewhere near the top of the thread. Check that menu for something appropriate.

moebus 01-31-2012 09:58 PM

thanks again.


All times are GMT -5. The time now is 05:27 PM.