LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-31-2022, 04:25 AM   #1
masavini
Member
 
Registered: Jun 2008
Posts: 285

Rep: Reputation: 6
Python app design


Hi,
after years of bash scripting I just completed my first scraper in python.

This scraper should be the first one of an app for automatically feeding an e-commerce platform with new products from my suppliers (one scraper for each supplier).

Now I'm a bit stuck moving forward, as I can't figure out the best way to integrate the scraper into a larger app.

Should the scraper itself being responsible for writing data to the DB? Or should it just return the scraped data letting the main application writing the DB?

Same doubts about the querying part: should the scraper contain a function for querying the DB? Or should the main app do this instead?

Thanks for your help!
 
Old 01-31-2022, 04:54 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,096

Rep: Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365
those all depends on you. From my side I would prefer a modularized solution, but you need to write classes for that. One for DB access (read/write), another one for processing data, ...
 
Old 01-31-2022, 05:00 AM   #3
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
Thanks for your reply.
The scraper is modularized already. What is not clear to me is if the DB access class should be a module of the main app, a module of the scraper package or a separated package.
 
Old 01-31-2022, 06:04 AM   #4
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,096

Rep: Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365
Again, it depends on you. All solutions/variants can be used. From my side I would write a separate package.
 
Old 01-31-2022, 06:59 AM   #5
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,643

Rep: Reputation: 2561Reputation: 2561Reputation: 2561Reputation: 2561Reputation: 2561Reputation: 2561Reputation: 2561Reputation: 2561Reputation: 2561Reputation: 2561Reputation: 2561

I would probably have each scraper output appropriate (validated/normalized/etc) data to a single combined queue (whether as files in a directory, or items in a data store, or whatever), and be unaware of the ultimate target database to reduce dependencies.

Then I would have a separate process with the sole purpose of reading unprocessed items from the queue and inserting into the main database.

When there are issues, you can then examine the queue to help isolate where those issues are occurring, as well as manually re-processing items as needed.

How that gets split into packages/modules/classes... well, examine how the main application is structured and pick something which is consistent with that.

 
Old 02-09-2022, 11:33 AM   #6
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,706
Blog Entries: 4

Rep: Reputation: 3949Reputation: 3949Reputation: 3949Reputation: 3949Reputation: 3949Reputation: 3949Reputation: 3949Reputation: 3949Reputation: 3949Reputation: 3949Reputation: 3949
Before you go any farther with this, I encourage you to look at sites such as github and sourceforge.

Because ...

"Actum Ne Agas: Do Not Do A Thing Already Done."

Full Disclosure: I have not researched your particular requirement, nor do I intend to. But it "smells" to me like the sort of thing that many other people have confronted before, and have since elegantly solved and freely shared. Before you go further, check it out.

(And: "report back here." Was I right or wrong? If you did find a solution, what was it?) Someday, someone else will "stumble upon" this thread, and they will thank you.)
 
Old 03-11-2022, 05:18 AM   #7
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
Quote:
Originally Posted by sundialsvcs View Post
(And: "report back here." Was I right or wrong? If you did find a solution, what was it?)
i'm afraid you were wrong...
i'm a github compulsive user but still i was not able to find anything useful. there is plenty of scrapers examples, but they're very simple apps far away from the requirements of my application. in particular, these are the hardest for me to satisfy:
  • keep a db which includes discontinued products removed from a supplier website
  • track scraped product specs changes
  • merge specs of different products from different suppliers

i'm trying to solve this problem right now: how to deal with changes in the scraped data structure?
sample code:
Code:
def scrape_prod_specs(prod_id):
    """
Return a dict like this:
{
    'name': 'product 1',
    'price': 30,
    'image_path': '/prod_images/product_1.jpg'
}
    """
    prod_url = get_prod_url(prod_id)
    return get_prod_specs(prod_url)
now, let's say that the next month that suppliers changes its website layout. i would have to fix the scraper but i may also find that the data structure has changed and the same prod_id returns a different dict:
Code:
{
    'name': 'product 1',
    'prices': {
        'dealer': 20,
        'end_user': 30
    },
    'images': ['/prod_images/product_1.jpg', '/prod_images/product_2.jpg']
}
in this case i would have to perform some migrations in the products db, then change the products orm classes and so on...

to deal with such cases i was thinking (thanks to some suggestions in the previous answers) about something like this:
  1. scapers simply scrape raw data, no matter if the data structure changes over time.
  2. a 'queue_manager' package checks each scraped raw_prod against the 'raw_data' mongodb. if the raw_prod is new or changed, it's added to the 'raw_data_queue' mongodb.
  3. a 'normalizer' package tries to normalize each raw_prod of the queue. if it succeeds, the normalized object is added or updated to the 'normalized_data' db and the raw_prod is moved from the 'raw_data_queue' mongodb to the 'raw_data' mongodb. if it fails, it just raises an error.
with such a setup, scrapers should only need to be fixed when the layout of scraped pages changes.
if something changes in the raw_prod structure, instead, i would have to fix the 'normalizer' package and only migrate the 'normalized_data' db if the new data structure contains new relevant information.

do you think this could be a good approach to the problem?
thanks!

Last edited by masavini; 03-11-2022 at 05:20 AM.
 
Old 03-11-2022, 06:00 AM   #8
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,096

Rep: Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365Reputation: 7365
There is no general solution to this problem. The approach is ok (from my side), Bowadays websites [usually] have a [rest] api and you can communicate with them much better using this api (but it obviously depends on those websites).
 
Old 03-11-2022, 09:33 AM   #9
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
Quote:
Originally Posted by pan64 View Post
There is no general solution to this problem. The approach is ok (from my side), Bowadays websites [usually] have a [rest] api and you can communicate with them much better using this api (but it obviously depends on those websites).
scraping apis is always my first choice, but even when they're available i have no control on when they're subject to changes.
i hope my setup will allow me to manage these changes, as well.
 
  


Reply

Tags
application design



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Mankato Web Design brings affordable and easy web design services for law firms in Minnesota. LXer Syndicated Linux News 0 11-19-2011 06:41 AM
[SOLVED] Database design using IDs as opposed as Candidate keys as Primary key (UML design) angel115 Programming 1 07-27-2011 08:58 AM
Linux for Graphic Design, web design, and publishing maelstrom209 Linux - Software 8 07-17-2011 11:35 AM
LXer: Python Python Python (aka Python 3) LXer Syndicated Linux News 0 08-05-2009 08:30 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:39 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration