Python app design

masavini · 01-31-2022, 04:25 AM

Hi,
after years of bash scripting I just completed my first scraper in python.

This scraper should be the first one of an app for automatically feeding an e-commerce platform with new products from my suppliers (one scraper for each supplier).

Now I'm a bit stuck moving forward, as I can't figure out the best way to integrate the scraper into a larger app.

Should the scraper itself being responsible for writing data to the DB? Or should it just return the scraped data letting the main application writing the DB?

Same doubts about the querying part: should the scraper contain a function for querying the DB? Or should the main app do this instead?

Thanks for your help!

pan64 · 01-31-2022, 04:54 AM

those all depends on you. From my side I would prefer a modularized solution, but you need to write classes for that. One for DB access (read/write), another one for processing data, ...

masavini · 01-31-2022, 05:00 AM

Thanks for your reply.
The scraper is modularized already. What is not clear to me is if the DB access class should be a module of the main app, a module of the scraper package or a separated package.

pan64 · 01-31-2022, 06:04 AM

Again, it depends on you. All solutions/variants can be used. From my side I would write a separate package.

boughtonp · 01-31-2022, 06:59 AM

I would probably have each scraper output appropriate (validated/normalized/etc) data to a single combined queue (whether as files in a directory, or items in a data store, or whatever), and be unaware of the ultimate target database to reduce dependencies.

Then I would have a separate process with the sole purpose of reading unprocessed items from the queue and inserting into the main database.

When there are issues, you can then examine the queue to help isolate where those issues are occurring, as well as manually re-processing items as needed.

How that gets split into packages/modules/classes... well, examine how the main application is structured and pick something which is consistent with that.

sundialsvcs · 02-09-2022, 11:33 AM

Before you go any farther with this, I encourage you to look at sites such as github and sourceforge.

Because ...

"Actum Ne Agas: Do Not Do A Thing Already Done."

Full Disclosure: I have not researched your particular requirement, nor do I intend to. But it "smells" to me like the sort of thing that many other people have confronted before, and have since elegantly solved and freely shared. Before you go further, check it out.

(And: "report back here." Was I right or wrong? If you did find a solution, what was it?) Someday, someone else will "stumble upon" this thread, and they will thank you.)

masavini · 03-11-2022, 05:18 AM

Quote:

Originally Posted by sundialsvcs

(And: "report back here." Was I right or wrong? If you did find a solution, what was it?)

i'm afraid you were wrong...
i'm a github compulsive user but still i was not able to find anything useful. there is plenty of scrapers examples, but they're very simple apps far away from the requirements of my application. in particular, these are the hardest for me to satisfy:

keep a db which includes discontinued products removed from a supplier website
track scraped product specs changes
merge specs of different products from different suppliers

i'm trying to solve this problem right now: how to deal with changes in the scraped data structure?
sample code:

Code:

def scrape_prod_specs(prod_id):
    """
Return a dict like this:
{
    'name': 'product 1',
    'price': 30,
    'image_path': '/prod_images/product_1.jpg'
}
    """
    prod_url = get_prod_url(prod_id)
    return get_prod_specs(prod_url)

now, let's say that the next month that suppliers changes its website layout. i would have to fix the scraper but i may also find that the data structure has changed and the same prod_id returns a different dict:

Code:

{
    'name': 'product 1',
    'prices': {
        'dealer': 20,
        'end_user': 30
    },
    'images': ['/prod_images/product_1.jpg', '/prod_images/product_2.jpg']
}

in this case i would have to perform some migrations in the products db, then change the products orm classes and so on...

to deal with such cases i was thinking (thanks to some suggestions in the previous answers) about something like this:

scapers simply scrape raw data, no matter if the data structure changes over time.
a 'queue_manager' package checks each scraped raw_prod against the 'raw_data' mongodb. if the raw_prod is new or changed, it's added to the 'raw_data_queue' mongodb.
a 'normalizer' package tries to normalize each raw_prod of the queue. if it succeeds, the normalized object is added or updated to the 'normalized_data' db and the raw_prod is moved from the 'raw_data_queue' mongodb to the 'raw_data' mongodb. if it fails, it just raises an error.

with such a setup, scrapers should only need to be fixed when the layout of scraped pages changes.
if something changes in the raw_prod structure, instead, i would have to fix the 'normalizer' package and only migrate the 'normalized_data' db if the new data structure contains new relevant information.

do you think this could be a good approach to the problem?
thanks!

pan64 · 03-11-2022, 06:00 AM

There is no general solution to this problem. The approach is ok (from my side), Bowadays websites [usually] have a [rest] api and you can communicate with them much better using this api (but it obviously depends on those websites).

masavini · 03-11-2022, 09:33 AM

Quote:

Originally Posted by pan64

There is no general solution to this problem. The approach is ok (from my side), Bowadays websites [usually] have a [rest] api and you can communicate with them much better using this api (but it obviously depends on those websites).

scraping apis is always my first choice, but even when they're available i have no control on when they're subject to changes.
i hope my setup will allow me to manage these changes, as well.