[SOLVED] make database of website on internet

ac_kumar · 08-19-2012, 02:49 PM

Hi I want to make a script which search for all the websites on internet and than add them in mysql database with there meta tags as keywords.
Any help is appreciated.
thanks in advance

TobiSGD · 08-19-2012, 03:22 PM

According to estimates the web currently contains about 7 billion webpages (http://www.worldwidewebsize.com/). If you assume an average data volume of 200 bytes per page (address + meta tags) you would need about 1.4 Petabyte disk space. So the first thing you should do is to buy a huge number of harddisks and servers to host your database.

devnull10 · 08-19-2012, 05:10 PM

Also be prepared to wait a little while...

Code:

 tmp $ time wget -O - www.google.co.uk > /dev/null
--2012-08-19 23:11:01--  http://www.google.co.uk/
Resolving www.google.co.uk (www.google.co.uk)... 173.194.67.94, 2a00:1450:4007:803::101f
Connecting to www.google.co.uk (www.google.co.uk)|173.194.67.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "STDOUT"

    [ <=>                                                       ] 13,143      --.-K/s   in 0.03s   

2012-08-19 23:11:01 (504 KB/s) - written to stdout [13143]


real    0m0.116s
user    0m0.005s
sys     0m0.000s

and that's with a pretty fast site and relatively small page. Add into that the time taken to parse the data, write to disk, mysql to release its resources etc... Even with multiple threads, you're looking quite a number of YEARS of processing.

TB0ne · 08-19-2012, 05:12 PM

Quote:

Originally Posted by ac_kumar

Hi I want to make a script which search for all the websites on internet and than add them in mysql database with there meta tags as keywords.
Any help is appreciated.
thanks in advance

Ahh...that's already been done. It's called "Google".

earthnet · 08-20-2012, 10:30 AM

Quote:

Originally Posted by ac_kumar

Any help is appreciated.

Help with what? You didn't ask any questions.

ac_kumar · 08-20-2012, 11:46 AM

Quote:

Originally Posted by TB0ne

Ahh...that's already been done. It's called "Google".

Do you think I dont know about google.
If you were advicing Linus Torvalds you would have said why to make Linux we have already invented Dos.
I tell you one thing re-inventing have made technology go further for eg light bulbs to led bulbs, stone wheels to rubber wheels.

ac_kumar · 08-20-2012, 12:00 PM

Quote:

Originally Posted by devnull10

Also be prepared to wait a little while...

Code:

 tmp $ time wget -O - www.google.co.uk > /dev/null
--2012-08-19 23:11:01--  http://www.google.co.uk/
Resolving www.google.co.uk (www.google.co.uk)... 173.194.67.94, 2a00:1450:4007:803::101f
Connecting to www.google.co.uk (www.google.co.uk)|173.194.67.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "STDOUT"

    [ <=>                                                       ] 13,143      --.-K/s   in 0.03s   

2012-08-19 23:11:01 (504 KB/s) - written to stdout [13143]


real    0m0.116s
user    0m0.005s
sys     0m0.000s

and that's with a pretty fast site and relatively small page. Add into that the time taken to parse the data, write to disk, mysql to release its resources etc... Even with multiple threads, you're looking quite a number of YEARS of processing.

Could you please explain what this command is doing. as far as I know wget get pages from internet.

TB0ne · 08-20-2012, 01:42 PM

Quote:

Originally Posted by ac_kumar

Could you please explain what this command is doing. as far as I know wget get pages from internet.

Did you read/look at the man pages for time and wget? It will explain what the commands do, and what the options are doing. The time command lets you see how much time the given command takes. In this case, the wget command with the -O puts things to a file.

Quote:

Originally Posted by ac_kumar

Do you think I dont know about google.

Apparently not, since you're asking how to re-create it.

Quote:

If you were advicing Linus Torvalds you would have said why to make Linux we have already invented Dos.

No, since the only way Linux is like DOS is that they both have a command line. There is no duplication of functionality.

Quote:

I tell you one thing re-inventing have made technology go further for eg light bulbs to led bulbs, stone wheels to rubber wheels.

Yes, each is better is some way than what has come before. What, exactly, is going to be different and better about what you're doing? There are MANY web-crawlers you can find easily, written in pretty much every programming language. What language are you wanting to write this in, and what problems are you having now?? You've essentially asked a very open ended question, with MANY different answers.

guyonearth · 08-20-2012, 03:00 PM

The technical problems of doing what you asked about have been explained. In short, your idea makes no sense given that it has already been done many times by search engines like Google. Given the nature of your question it wouldn't appear you are going to be inventing a better mousetrap any time soon.

devnull10 · 08-20-2012, 03:49 PM

Quote:

Originally Posted by ac_kumar

Could you please explain what this command is doing. as far as I know wget get pages from internet.

Yes, I was merely illustrating the time it takes for a moderately powered PC on a fairly fast internet connection to return a single small/fast webpage from the internet. Scale that up and account for slower responses and you're looking at years and years of processing. Sure, you can have a "vision" but what we are trying to tell you is that in all reality, it's pretty much asking the impossible - Google does a good job of it - but not perfect by any means.
How do you intend to traverse sites? By developing a robot which recursively parses links on each site it finds? Then you have to check whether you have already visited every link else you'll end up with cycles (which could be massively long and wasteful).

This is a serious comment - if you have got several hundred thousand pound to simply start this project off, never mind fund it, and are able to fund a full time team of analysts, developers etc then you might get a small way into doing it.

ac_kumar · 08-21-2012, 02:00 PM

thanks you all for very helpful answers.
See I am very fascinated how google works and yes sometimes i don't find it usefull.
So i was thinking that i could make a step down web search engine to experiment.
As for storage problem i can manage and add few website in database than work on further.
I just want to do this project for fun.

TobiSGD · 08-21-2012, 02:38 PM

If you are doing it just for fun have a look here: http://www.udacity.com/overview/Course/cs101
It is a series of video tutorials, they build a search engine using Python to teach the basics of computer science.