Pandas with SQLAlchemy (and Postgres)

Context More mental contortions today to understand how sqlalchemy, pandas and database sessions fit together. I got the data into postgres finally, but without cleaning it. But still, phew, it works.

Modules Organization ¶

In the sps package, I have:

__init__.py
database.py
models.py
stoneland.py
utils.py

__init__.py I moved all the functions into utils.py so here it’s just initializing logging and importing all relevant modules. I will need to clean this up to make a nice API to the package. It’s very messy right now.

database.py This has two methods - connect and session_scope.

connect: This returns a sqlalchemy engine object with configuration loaded from app_config.yml session_scope: This is used to create a session with the database while completing a transaction.

models.py This has three models right now - Item, Color, Location. These translate to tables in postgres.

Item: Contains one row per product. This is not being used right now, since I set my table as “inventory”. Color: Contains color names. Will be used in the Item table to ‘tag’ each product. Not being used right now. Location: Contains location names, including fabricators where we keep slabs for display.

Inventory: This model was created by pandas, when I used its ‘to_sql’ method. This is not defined in models.py and the data in here is not clean.

stoneland.py Added a method report_to_db, which is basically a wrapper around the ‘to_sql’ pandas method. It takes a database_name (although now that I’m thinking about it, it used it as a table name), an sqlalchemy engine object, and the pandas dataframe that needs to be inserted into the postgres db. I also have ‘if_exists’ set to replace, so each time I run it, it drops the old table completely.

utils.py Moved random utility methods that had cropped up into this module.

Current Situation ¶

I create a run.py in the root of the directory and fill it with the content below. It seems like an ugly mix of objects and methods, not at all intuitive. Will need

I’ve also mapped out how the pandas dataframe will map to the postgres database, but haven’t implemented it yet. I suspect it would be easiest to make changes to dataframe and then insert it. Don’t want to manipulate the data once it’s in db.

import sps

# Create a Stoneland object
stl = sps.Stoneland()

# Gets inventory in a pandas dataframe
inventory = stl.inventory_report()

base = sps.models.base

dbengine = sps.database.connect()

sps.models.base.metadata.create_all(dbengine)

with session_scope(dbengine) as sesh:
  stl.report_to_db(sesh, inventory)

Next Steps ¶

Need to investigate how to enter pandas data into an existing postgres schema.
Need to clean up data the pandas data to only relevant data with correct headers.
Need to clean up API for cleaner object and methods access. This could involve wrapping existing methods into a class, or perhaps wrapping them up in neat methods.