Building the scraper setup portion of the tutorial.
Create the following file/directory hierarchy within your
scrape_workspace folder, mimicking that of the
scrape directory of the repo:
1 2 3
└── my_scraper/ └── scraper_app/ └── spiders/
You can do so with the following commands:
(ScrapeProj) $ mkdir -p my_scraper/scraper_app/spiders
items.py file in the
my_scraper/scraper_app/ directory. In our
items.py file, scrapy needs us to define containers for the data that we plan to scrape. If you have worked through the Django tutorial at one point, you’ll see that the
items.py is similar to
models.py in Django.
First, using scrapy’s item module, we import
from scrapy.item import Item, Field
Simple enough. Now we’ll create a class, and name it after the kind of data that we’ll scrape,
class LivingSocialDeal(Item): """Livingsocial container (dictionary-like object) for scraped data"""
LivingSocialDeal class, we inherit from
Item - which basically takes some pre-defined objects that scrapy has already built for us.
class Human(object), that will have some human attributes - like a function for running, for bathing, eating, etc. Then we can inherit from the
Humanclass to make a new class,
class Superwoman(Human). Because we inherit from
Human, we can still access the running, eating, bathing functions. But - perhaps Superwoman runs faster than the average human, so we can redefine the running function. This basically rewrites over Human’s running function.
Or maybe we want to add to
eating() method by adding an intact of 1000 more calories (being a Superwoman requires a lot of energy!). We can define an
eating() function within our
Superwoman class, then call
super() on the method,
Perhaps Superwomen should also fly. We can define a separate
Flying(object) class. Now, when we define our
Superwoman() class, we can inherit both from
Flying – called multiple inheritance:
class Superwoman(Human, Flying).
Let’s add some items that we actually want to collect. We assign them to
Field() because that is how we specify metadata to scrapy:
1 2 3 4 5 6 7 8
class LivingSocialDeal(Item): """Livingsocial container (dictionary-like object) for scraped data""" title = Field() link = Field() location = Field() original_price = Field() price = Field() end_date = Field()
Nothing too hard - that was it. In scrapy, there are no other field types, unlike Django. So, we’re sort of stuck with
Let’s play around with this in the Python terminal. Make sure your
ScrapeProj virtualenv is activated.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
>>> from scrapy.item import Item, Field >>> from items import LivingSocialDeal >>> deal = LivingSocialDeal(title="$20 off yoga classes", price="50") >>> print deal LivingSocialDeal(title='$20 off yoga classes', price='50') >>> deal['title'] '$20 off yoga classes' >>> deal.get('title') '$20 off yoga classes' >>> deal['price'] '50' >>> deal['location'] = "New York" >>> deal['location'] 'New York'
Item class behaves very similar to Python’s dictionaries with the ability to get keys and values.
Now that it’s all setup, let’s write spider!