LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 01-30-2012, 10:07 PM   #1
moebus
LQ Newbie
 
Registered: Jan 2012
Posts: 5

Rep: Reputation: Disabled
wget failed to download a html page


Good day,

I am using wget to download a web page but the page I save on my hard drive is not the same than the one I see with my browser (firefox). Especially, the page download with wget is incomplete and an old version of the webpage.

The location of the page is:
http://www.discoverygc.com/server/players_online.html

I used different combination of wget options but up untill now I was never able to download the latest version of the webpage.
I also note that lynx faces the same problem: not able to completely load the page (and it is an old version of the page).

if anyone can help I would be pleased,
thanks for your attention

moeb

ps: here is the result of the debug:

Code:
moebus>wget http://www.discoverygc.com/server/players_
online.html -d
DEBUG output created by Wget 1.11.4 on Windows-MSVC.

--2012-01-30 23:02:32--  http://www.discoverygc.com/server/players_online.html
Resolving www.discoverygc.com... seconds 0.00, 78.46.88.89
Caching www.discoverygc.com => 78.46.88.89
Connecting to www.discoverygc.com|78.46.88.89|:80... seconds 0.00, connected.
Created socket 1892.
Releasing 0x00982880 (new refcount 1).

---request begin---
GET /server/players_online.html HTTP/1.0
User-Agent: Wget/1.11.4
Accept: */*
Host: www.discoverygc.com
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Server: nginx/1.0.11
Date: Tue, 31 Jan 2012 04:02:40 GMT
Content-Type: text/html
Content-Length: 22375
Last-Modified: Tue, 31 Jan 2012 04:02:31 GMT
Connection: keep-alive
Accept-Ranges: bytes

---response end---
200 OK
Registered socket 1892 for persistent reuse.
Length: 22375 (22K) [text/html]
Saving to: `players_online.html'

100%[======================================>] 22,375       107K/s   in 0.2s

2012-01-30 23:02:32 (107 KB/s) - `players_online.html' saved [22375/22375]
 
Old 01-31-2012, 02:43 AM   #2
knudfl
LQ 5k Club
 
Registered: Jan 2008
Location: Copenhagen DK
Distribution: PCLinuxOS2023 Fedora38 + 50+ other Linux OS, for test only.
Posts: 17,516

Rep: Reputation: 3641Reputation: 3641Reputation: 3641Reputation: 3641Reputation: 3641Reputation: 3641Reputation: 3641Reputation: 3641Reputation: 3641Reputation: 3641Reputation: 3641
No problems here with

wget http://www.discoverygc.com/server/players_online.html

The file "players_online.html", 23.2 kB is downloaded.

The file is attached as players_online.html.txt
( .txt is the allowed suffix for attaching this type of file.)
..
Attached Files
File Type: txt players_online.html.txt (23.2 KB, 20 views)
 
Old 01-31-2012, 07:26 AM   #3
theKbStockpiler
Member
 
Registered: Sep 2009
Location: Central New York
Distribution: RPM Distros,Mostly Mandrake Forks;Drake Tools/Utilities all the way!GO MAGEIA!!!
Posts: 986

Rep: Reputation: 53
If you have a good version of Firefox, it's Download Manager will work better than Kget and Gwget on Difficult files. The problem is most likely the Server or the Network.
 
Old 01-31-2012, 07:58 AM   #4
moebus
LQ Newbie
 
Registered: Jan 2012
Posts: 5

Original Poster
Rep: Reputation: Disabled
Hi, thanks for your replies.

Knudfl, the file you attached is incomplete and is not the last version.
Some data in this file are missing: ping loss and lag column are not present in the second table and at the end you have some character chains that should be in the table.
The last update is not supposed to be 31/01/2012 06:15:47 [UTC] since the file is being updated each 10s.
If I check now with firefox, the last update is: 31/01/2012 13:58:12 [UTC] while if I download it now with wget it is 31/01/2012 06:15:47 [UTC]

KbStockpiler, I plan to download the file each minute and to use a script to do that. I am not sure I can use firefox in such script.

Last edited by moebus; 01-31-2012 at 08:01 AM.
 
Old 01-31-2012, 11:57 AM   #5
qlue
Member
 
Registered: Aug 2009
Location: Umzinto, South Africa
Distribution: Crunchbangified Debian 8 (Jessie)
Posts: 747
Blog Entries: 1

Rep: Reputation: 172Reputation: 172
Quote:
Originally Posted by moebus View Post
Hi, thanks for your replies.

Knudfl, the file you attached is incomplete and is not the last version.
Some data in this file are missing: ping loss and lag column are not present in the second table and at the end you have some character chains that should be in the table.
The last update is not supposed to be 31/01/2012 06:15:47 [UTC] since the file is being updated each 10s.
If I check now with firefox, the last update is: 31/01/2012 13:58:12 [UTC] while if I download it now with wget it is 31/01/2012 06:15:47 [UTC]

KbStockpiler, I plan to download the file each minute and to use a script to do that. I am not sure I can use firefox in such script.
I get the same results as Knudfl .
wget works fine and retrieves the latest version of the page. Maybe there is a caching issue specific to your installation/firewall?
 
Old 01-31-2012, 12:33 PM   #6
moebus
LQ Newbie
 
Registered: Jan 2012
Posts: 5

Original Poster
Rep: Reputation: Disabled
Hi again,
I'm surprised, the file is supposed to be 46k, not 23k.
I will check my cache and try it on another internet connection.
(up until now I've tried on two different computers with different OS but connected on the same rooter)
thanks for your comment.
Attached Files
File Type: txt players_online(wget).html.txt (23.2 KB, 13 views)
File Type: txt players_online(firefox).html.txt (46.5 KB, 15 views)

Last edited by moebus; 01-31-2012 at 12:36 PM. Reason: add attachement
 
Old 01-31-2012, 12:50 PM   #7
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 374Reputation: 374Reputation: 374Reputation: 374
I'm no web expert, but two things to check--both of which depend on the web server:

1. Does the web server offer different pages based on the user-agent string supplied by the request? firefox and wget certainly provide different user-agents by default. I believe wget can be told to supply a different user-agent on the command line

2. Does the web server offer different pages for (ro)bots? As far as I understand it, wget honors site restrictions for robots. And as far as I know, that adherence is fixed--not configurable via command line.

EDIT:
I just tried running the following command:
Code:
wget --user-agent="Mozilla/5.0 (Windows NT 6.2; rv:9.0.1) Gecko/20100101 Firefox/9.0.1" http://www.discoverygc.com/server/players_online.html"
The transfer halted after about 1 or 2 KB. I then ran:
Code:
wget -c --user-agent="Mozilla/5.0 (Windows NT 6.2; rv:9.0.1) Gecko/20100101 Firefox/9.0.1" http://www.discoverygc.com/server/players_online.html"
The wget output indicated that the second command did, indeed, continue the earlier transaction. However, after looking at the file, it was still incomplete.

That would seem to say the problem is on the web server side. Perhaps Firefox has some error-recovery code that wget does not.

EDIT2:
The same problem occurs for curl as well.

EDIT3:
I just tried the following command:
Code:
curl --compressed http://www.discoverygc.com/server/players_online.html > players_online.html
I received the whole, correct page twice in a row now. I can't say that it's not a fluke, but perhaps the request for the web server to compress the page forces the web server to collect all the information at once before sending it--possibly preventing a mid-stream break in the transfer.

I do not know if wget has a similar option.

Last edited by Dark_Helmet; 01-31-2012 at 01:15 PM.
 
1 members found this post helpful.
Old 01-31-2012, 03:10 PM   #8
eSelix
Senior Member
 
Registered: Oct 2009
Location: Wroclaw, Poland
Distribution: Arch, Kubuntu
Posts: 1,281

Rep: Reputation: 320Reputation: 320Reputation: 320Reputation: 320
I don't think this is somthing common with "break in transfer" due the displaying old time in "last update". Maybe it is a bug in nginx server and it serve old cached data. As some clients request compressed data and some plain, and it only can cache one type.

When clearing "network.http.accept-encoding" in "about:config" in Firefox, disabling compression, you also get corrupted page (and old time).
 
1 members found this post helpful.
Old 01-31-2012, 03:30 PM   #9
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 374Reputation: 374Reputation: 374Reputation: 374
Good point about the old update time. I'd forgotten about that and focused on the amount of data transferred.
 
Old 01-31-2012, 04:01 PM   #10
moebus
LQ Newbie
 
Registered: Jan 2012
Posts: 5

Original Poster
Rep: Reputation: Disabled
Thumbs up

wow : )
thank you very much,
I've just tried the curl --compressed http://www.discoverygc.com/server/players_online.html > players_online.html and it is working here too.
So, I will use curl in my script instead of wget (since I did not find any option related to 'compress') and it will be ok for me.

Should I edit the title of the topic and add a [SOLVED] in it?
 
Old 01-31-2012, 04:27 PM   #11
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 374Reputation: 374Reputation: 374Reputation: 374
Quote:
Originally Posted by moebus
Should I edit the title of the topic and add a [SOLVED] in it?
You don't need to edit the thread title. There should be a checkbox or option of some kind to "mark thread as solved." If memory serves, there's a "Thread tools" menu somewhere near the top of the thread. Check that menu for something appropriate.

Last edited by Dark_Helmet; 01-31-2012 at 04:30 PM.
 
1 members found this post helpful.
Old 01-31-2012, 09:58 PM   #12
moebus
LQ Newbie
 
Registered: Jan 2012
Posts: 5

Original Poster
Rep: Reputation: Disabled
thanks again.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] How do I use wget to download only images from a single web page? errigour Linux - Newbie 1 11-29-2011 06:57 PM
[SOLVED] How to use wget to download a html book. errigour Linux - Newbie 3 11-02-2011 07:20 AM
grabbing linked .svg files from a html page with wget silviolorusso Programming 2 10-29-2011 07:27 AM
[SOLVED] wget HTML only? mrwall-e Linux From Scratch 1 07-19-2010 02:52 PM
How do I output information from a PHP page to an HTML page? SentralOrigin Programming 3 01-10-2009 01:54 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 11:42 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration