Building the scraper setup portion of the tutorial.
Create the following file/directory hierarchy within your scrape_workspace
folder, mimicking that of the scrape
directory of the repo:
1 2 3 | └── my_scraper/
└── scraper_app/
└── spiders/
|
You can do so with the following commands:
1 | (ScrapeProj) $ mkdir -p my_scraper/scraper_app/spiders
|
Create an items.py
file in the my_scraper/scraper_app/
directory. In our items.py
file, scrapy needs us to define containers for the data that we plan to scrape. If you have worked through the Django tutorial at one point, you’ll see that the items.py
is similar to models.py
in Django.
First, using scrapy’s item module, we import Item
and Field
:
1 | from scrapy.item import Item, Field
|
Simple enough. Now we’ll create a class, and name it after the kind of data that we’ll scrape, LivingSocialDeal
:
1 2 | class LivingSocialDeal(Item):
"""Livingsocial container (dictionary-like object) for scraped data"""
|
For our LivingSocialDeal
class, we inherit from Item
- which basically takes some pre-defined objects that scrapy has already built for us.
class Human(object)
, that will have some human attributes - like a function for running, for bathing, eating, etc. Then we can inherit from the Human
class to make a new class, class Superwoman(Human)
. Because we inherit from Human
, we can still access the running, eating, bathing functions. But - perhaps Superwoman runs faster than the average human, so we can redefine the running function. This basically rewrites over Human’s running function.
Or maybe we want to add to Human
’s eating()
method by adding an intact of 1000 more calories (being a Superwoman requires a lot of energy!). We can define an eating()
function within our Superwoman
class, then call super()
on the method,
Perhaps Superwomen should also fly. We can define a separate Flying(object)
class. Now, when we define our Superwoman()
class, we can inherit both from Human
and Flying
– called multiple inheritance: class Superwoman(Human, Flying)
.
Let’s add some items that we actually want to collect. We assign them to Field()
because that is how we specify metadata to scrapy:
1 2 3 4 5 6 7 8 | class LivingSocialDeal(Item):
"""Livingsocial container (dictionary-like object) for scraped data"""
title = Field()
link = Field()
location = Field()
original_price = Field()
price = Field()
end_date = Field()
|
Nothing too hard - that was it. In scrapy, there are no other field types, unlike Django. So, we’re sort of stuck with Field()
.
Let’s play around with this in the Python terminal. Make sure your ScrapeProj
virtualenv is activated.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | >>> from scrapy.item import Item, Field
>>> from items import LivingSocialDeal
>>> deal = LivingSocialDeal(title="$20 off yoga classes", price="50")
>>> print deal
LivingSocialDeal(title='$20 off yoga classes', price='50')
>>> deal['title']
'$20 off yoga classes'
>>> deal.get('title')
'$20 off yoga classes'
>>> deal['price']
'50'
>>> deal['location'] = "New York"
>>> deal['location']
'New York'
|
The scrapy Item
class behaves very similar to Python’s dictionaries with the ability to get keys and values.
Now that it’s all setup, let’s write spider!