ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hi,
after years of bash scripting I just completed my first scraper in python.
This scraper should be the first one of an app for automatically feeding an e-commerce platform with new products from my suppliers (one scraper for each supplier).
Now I'm a bit stuck moving forward, as I can't figure out the best way to integrate the scraper into a larger app.
Should the scraper itself being responsible for writing data to the DB? Or should it just return the scraped data letting the main application writing the DB?
Same doubts about the querying part: should the scraper contain a function for querying the DB? Or should the main app do this instead?
those all depends on you. From my side I would prefer a modularized solution, but you need to write classes for that. One for DB access (read/write), another one for processing data, ...
Thanks for your reply.
The scraper is modularized already. What is not clear to me is if the DB access class should be a module of the main app, a module of the scraper package or a separated package.
I would probably have each scraper output appropriate (validated/normalized/etc) data to a single combined queue (whether as files in a directory, or items in a data store, or whatever), and be unaware of the ultimate target database to reduce dependencies.
Then I would have a separate process with the sole purpose of reading unprocessed items from the queue and inserting into the main database.
When there are issues, you can then examine the queue to help isolate where those issues are occurring, as well as manually re-processing items as needed.
How that gets split into packages/modules/classes... well, examine how the main application is structured and pick something which is consistent with that.
Before you go any farther with this, I encourage you to look at sites such as github and sourceforge.
Because ...
"Actum Ne Agas: Do Not Do A Thing Already Done."
Full Disclosure: I have not researched your particular requirement, nor do I intend to. But it "smells" to me like the sort of thing that many other people have confronted before, and have since elegantly solved and freely shared. Before you go further, check it out.
(And: "report back here." Was I right or wrong? If you did find a solution, what was it?) Someday, someone else will "stumble upon" this thread, and they will thank you.)
(And: "report back here." Was I right or wrong? If you did find a solution, what was it?)
i'm afraid you were wrong...
i'm a github compulsive user but still i was not able to find anything useful. there is plenty of scrapers examples, but they're very simple apps far away from the requirements of my application. in particular, these are the hardest for me to satisfy:
keep a db which includes discontinued products removed from a supplier website
track scraped product specs changes
merge specs of different products from different suppliers
i'm trying to solve this problem right now: how to deal with changes in the scraped data structure?
sample code:
now, let's say that the next month that suppliers changes its website layout. i would have to fix the scraper but i may also find that the data structure has changed and the same prod_id returns a different dict:
in this case i would have to perform some migrations in the products db, then change the products orm classes and so on...
to deal with such cases i was thinking (thanks to some suggestions in the previous answers) about something like this:
scapers simply scrape raw data, no matter if the data structure changes over time.
a 'queue_manager' package checks each scraped raw_prod against the 'raw_data' mongodb. if the raw_prod is new or changed, it's added to the 'raw_data_queue' mongodb.
a 'normalizer' package tries to normalize each raw_prod of the queue. if it succeeds, the normalized object is added or updated to the 'normalized_data' db and the raw_prod is moved from the 'raw_data_queue' mongodb to the 'raw_data' mongodb. if it fails, it just raises an error.
with such a setup, scrapers should only need to be fixed when the layout of scraped pages changes.
if something changes in the raw_prod structure, instead, i would have to fix the 'normalizer' package and only migrate the 'normalized_data' db if the new data structure contains new relevant information.
do you think this could be a good approach to the problem?
thanks!
There is no general solution to this problem. The approach is ok (from my side), Bowadays websites [usually] have a [rest] api and you can communicate with them much better using this api (but it obviously depends on those websites).
There is no general solution to this problem. The approach is ok (from my side), Bowadays websites [usually] have a [rest] api and you can communicate with them much better using this api (but it obviously depends on those websites).
scraping apis is always my first choice, but even when they're available i have no control on when they're subject to changes.
i hope my setup will allow me to manage these changes, as well.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.