LinuxQuestions.org - [SOLVED] wget failed to download a html page

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - wget failed to download a html page (https://www.linuxquestions.org/questions/linux-general-1/wget-failed-to-download-a-html-page-926634/)

wget failed to download a html page

Good day,

I am using wget to download a web page but the page I save on my hard drive is not the same than the one I see with my browser (firefox). Especially, the page download with wget is incomplete and an old version of the webpage.

The location of the page is:
http://www.discoverygc.com/server/players_online.html

I used different combination of wget options but up untill now I was never able to download the latest version of the webpage.
I also note that lynx faces the same problem: not able to completely load the page (and it is an old version of the page).

if anyone can help I would be pleased,
thanks for your attention

moeb

ps: here is the result of the debug:

Code:

moebus>wget http://www.discoverygc.com/server/players_

online.html -d

DEBUG output created by Wget 1.11.4 on Windows-MSVC.



--2012-01-30 23:02:32--  http://www.discoverygc.com/server/players_online.html

Resolving www.discoverygc.com... seconds 0.00, 78.46.88.89

Caching www.discoverygc.com => 78.46.88.89

Connecting to www.discoverygc.com|78.46.88.89|:80... seconds 0.00, connected.

Created socket 1892.

Releasing 0x00982880 (new refcount 1).



---request begin---

GET /server/players_online.html HTTP/1.0

User-Agent: Wget/1.11.4

Accept: */*

Host: www.discoverygc.com

Connection: Keep-Alive



---request end---

HTTP request sent, awaiting response...

---response begin---

HTTP/1.1 200 OK

Server: nginx/1.0.11

Date: Tue, 31 Jan 2012 04:02:40 GMT

Content-Type: text/html

Content-Length: 22375

Last-Modified: Tue, 31 Jan 2012 04:02:31 GMT

Connection: keep-alive

Accept-Ranges: bytes



---response end---

200 OK

Registered socket 1892 for persistent reuse.

Length: 22375 (22K) [text/html]

Saving to: `players_online.html'



100%[======================================>] 22,375      107K/s  in 0.2s



2012-01-30 23:02:32 (107 KB/s) - `players_online.html' saved [22375/22375]

No problems here with

wget http://www.discoverygc.com/server/players_online.html

The file "players_online.html", 23.2 kB is downloaded.

The file is attached as players_online.html.txt
( .txt is the allowed suffix for attaching this type of file.)
..

If you have a good version of Firefox, it's Download Manager will work better than Kget and Gwget on Difficult files. The problem is most likely the Server or the Network.

Hi, thanks for your replies.

Knudfl, the file you attached is incomplete and is not the last version.
Some data in this file are missing: ping loss and lag column are not present in the second table and at the end you have some character chains that should be in the table.
The last update is not supposed to be 31/01/2012 06:15:47 [UTC] since the file is being updated each 10s.
If I check now with firefox, the last update is: 31/01/2012 13:58:12 [UTC] while if I download it now with wget it is 31/01/2012 06:15:47 [UTC]

KbStockpiler, I plan to download the file each minute and to use a script to do that. I am not sure I can use firefox in such script.

Quote:

Originally Posted by moebus (Post 4589369)

I get the same results as Knudfl :scratch: .
wget works fine and retrieves the latest version of the page. Maybe there is a caching issue specific to your installation/firewall?

Hi again,
I'm surprised, the file is supposed to be 46k, not 23k.
I will check my cache and try it on another internet connection.
(up until now I've tried on two different computers with different OS but connected on the same rooter)
thanks for your comment.

I'm no web expert, but two things to check--both of which depend on the web server:

1. Does the web server offer different pages based on the user-agent string supplied by the request? firefox and wget certainly provide different user-agents by default. I believe wget can be told to supply a different user-agent on the command line

2. Does the web server offer different pages for (ro)bots? As far as I understand it, wget honors site restrictions for robots. And as far as I know, that adherence is fixed--not configurable via command line.

EDIT:
I just tried running the following command:

Code:

wget --user-agent="Mozilla/5.0 (Windows NT 6.2; rv:9.0.1) Gecko/20100101 Firefox/9.0.1" http://www.discoverygc.com/server/players_online.html"

The transfer halted after about 1 or 2 KB. I then ran:

Code:

wget -c --user-agent="Mozilla/5.0 (Windows NT 6.2; rv:9.0.1) Gecko/20100101 Firefox/9.0.1" http://www.discoverygc.com/server/players_online.html"

The wget output indicated that the second command did, indeed, continue the earlier transaction. However, after looking at the file, it was still incomplete.

That would seem to say the problem is on the web server side. Perhaps Firefox has some error-recovery code that wget does not.

EDIT2:
The same problem occurs for curl as well.

EDIT3:
I just tried the following command:

Code:

curl --compressed http://www.discoverygc.com/server/players_online.html > players_online.html

I received the whole, correct page twice in a row now. I can't say that it's not a fluke, but perhaps the request for the web server to compress the page forces the web server to collect all the information at once before sending it--possibly preventing a mid-stream break in the transfer.

I do not know if wget has a similar option.

I don't think this is somthing common with "break in transfer" due the displaying old time in "last update". Maybe it is a bug in nginx server and it serve old cached data. As some clients request compressed data and some plain, and it only can cache one type.

When clearing "network.http.accept-encoding" in "about:config" in Firefox, disabling compression, you also get corrupted page (and old time).

Good point about the old update time. I'd forgotten about that and focused on the amount of data transferred.

wow : )
thank you very much,
I've just tried the curl --compressed http://www.discoverygc.com/server/players_online.html > players_online.html and it is working here too.
So, I will use curl in my script instead of wget (since I did not find any option related to 'compress') and it will be ok for me.

Should I edit the title of the topic and add a [SOLVED] in it?

Quote:

Originally Posted by moebus

Should I edit the title of the topic and add a [SOLVED] in it?

You don't need to edit the thread title. There should be a checkbox or option of some kind to "mark thread as solved." If memory serves, there's a "Thread tools" menu somewhere near the top of the thread. Check that menu for something appropriate.