Pipelining our scraped data into our Postgres database.
We’ve setup our spider to crawl and parse the HTML, and we’ve set up our database to take the parse data. Now we have to connect the two together through a pipeline.
Create a file called
my_scraper/scraper_app/ directory. In
pipelines.py, we will define a session (the actual connecting to the database), as well as the feeding/writing of data to the database.
We’ll need to import SQLAlchemy’s
sessionmaker function to bind to the database (create that connection), as well as import our models.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
from sqlalchemy.orm import sessionmaker from models import Deals, db_connect, create_deals_table class LivingSocialPipeline(object): """Livingsocial pipeline for storing scraped items in the database""" def __init__(self): """ Initializes database connection and sessionmaker. Creates deals table. """ engine = db_connect() create_deals_table(engine) self.Session = sessionmaker(bind=engine) def process_item(self, item, spider): """Save deals in the database. This method is called for every item pipeline component. """ session = self.Session() deal = Deals(**item) try: session.add(deal) session.commit() except: session.rollback() raise finally: session.close() return item
Here, we create a class,
LivingSocialPipeline(). We have a constructor function,
def __init__(self) to initialize the class by defining the engine, the deals table, and binding/connecting to the database with the defined engine.
We then define
process_item() that takes the parameters,
spider. We establish a session with the database, then unpack an item (the data of one scraped deal) within our
Deal() model. We then add the
deal to our database by calling
session.add() – on this step, it’s not saved into the database - it’s still on SQLAlchemy level. Then, by calling
session.commit(), it will put the into the database and the transaction will be committed.
However, if something went wrong during the
commit() portion of the database transaction, we will need to “undo” or
rollback() the data that was committed since we do not want partial data in our database. The
rollback() pair is meant to ensure that we have made all the changes or none if there was any sort of failure during the transaction.
In Python, we do this with a
except clause. A
except clause allows us to “catch” any errors if our desired operation fails.
We use the
finally keyword to close the session – this basically means that whether or not we were successful in committing data, close the session/connection with the database.
Nearly there – we need to add a variable to
settings.py that tells scrapy where to find our pipeline when processing data.
settings.py, add another variable,
ITEM_PIPELINES = ['scraper_app.pipelines.LivingSocialPipeline']
This is the directory/module path to the pipeline we just defined.
Now to put the project all together!