Part 5: Running our Scraper

Putting all the pieces together to scrape our data.

scrapy.cfg file

Scrapy needs a configuration file to direct it to where our project lies. The config file needs to be in the project root directory. An example of our directory structure for our application:

.
├── my_scraper
│   ├── scraper_app
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── models.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   └── spiders
│   │       ├── __init__.py
│   │       └── livingsocial_spider.py
│   └── scrapy.cfg  # we must create this file now
├── README.md
└── requirements.txt

Create a file called scrapy.cfg within /my_scraper/ directory. This file should contain the name of the python module that defines the project settings:

[settings]
default = scraper_app.settings

Manually run the scraper

Within your terminal, with your ScrapeProj virtualenv activated:

1
2
(ScrapeProj) $ cd my_scraper/scraper_app
(ScrapeProj) $ scrapy crawl livingsocial

Where livingsocial is the BOT_NAME defined in settings.py.

Next, to see the data that’s saved into the database, start up Postgres and enter these commands:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
$ psql -h localhost
psql (9.1.4, server 9.1.3)
Type "help" for help.

lynnroot=# \connect scrape
psql (9.1.4, server 9.1.3)
You are now connected to database "scrape" as user "lynnroot".
scrape=# select * from deals limit 5;
 id |                       title                       |                                               link                                               |     location      | original_price | price |      end_date
----+---------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------+----------------+-------+---------------------
  1 | Three-Course Prix-Fixe Contemporary American Meal | https://www.livingsocial.com/deals/1365378-three-course-prix-fixe-contemporary-american-meal     | Calistoga, CA     |  132           | 66    | 2015-12-24 08:00:00
  2 | Thrilling Live-Action Escape Game for Six         | https://www.livingsocial.com/deals/1370546-thrilling-live-action-escape-game-for-six             | San Francisco, CA |  169           | 99    | 2015-02-08 08:00:00
  3 | $30 to Spend on Food and Drink                    | https://www.livingsocial.com/cities/15-san-francisco/deals/1234440-30-to-spend-on-food-and-drink | San Francisco, CA |  30            | 15    | 2015-07-29 11:59:00
  4 | $50 Toward Asian-Inspired Seafood, Steaks & More  | https://www.livingsocial.com/deals/1333444-50-toward-asian-inspired-seafood-steaks-more          | San Francisco, CA |  50            | 25    | 2015-02-19 12:59:00
  5 | Spa Packages with Jacuzzi and Pool Access         | https://www.livingsocial.com/deals/1278806-spa-packages-with-jacuzzi-and-pool-access             | San Francisco, CA |  120           | 80    | 2015-09-17 11:59:59
(5 rows)

Try a few of these select queries:

1
2
3
4
scrape=# select title from deals limit 30;
scrape=# select link from deals where price < 50;
scrape=# select title from deals where end_date < '2015-02-08';
scrape=# select * from deals where title like ('%Yoga');

Notice how the strings we’re searching for contains % – this is a wildcard character. This is essentially saying “find deals where the title contains “Yoga”. Learn more about querying Postgres.

Hook up to a Cron job

It’d be pretty annoying if we had to manually run this script regularly. This is where cron jobs come in.

We’ll first create a bash script that simulates what we would do if we were to run scrapy manually. There is a sample one in the new-coder/scrape/living_social/ directory called scrape.sh. Edit the bash script to where your (ScrapeProj) virtualenv is as well as where your scraper root directory is (where the scrapy.cfg file lies).

Next, within your terminal, type:

1
$ crontab -e

to edit your crontab file. This opens up the editor for your cron tab. Add a line:

0 13 * * * sh ~/Projects/new-coder/scrape/living_social/scrapy.sh

This says that ever day at hour 13 (1pm, relative to your local machine time), run the scrapy.sh script. To schedule your cron job at a different time, check out Wiki’s overview.

NOTE

The cron job will run automatically for whenever you schedule it to run (in this example, daily at 1pm).

But! It will only run when your computer is on (not hibernate/sleep or powered off), and in particular with this script, connected to the internet.

To run the cron job regardless of the state your computer is in, you would host the scraper code + the bash script and cron job on a separate server (either your own, or a company’s) that will always be ‘on’. You can host your own cron jobs on OpenShift.