Part 2: Writing our Spider

Writing the spider portion of our scraper.

Defining our spider

Create a file called livingsocial_spider.py in my_scraper/scraper_app/spiders/ directory. This is where the magic happens - this is where we tell scrapy how to find the exact data we’re looking for. As you can imagine, writing a spider is specific to a web page. This won’t work on Groupon or another website.

In the livingsocial_spider.py file, we will define one class, LivingSocialSpider with common attributes, like name and url. We’ll also define one function within our LivingSocialSpider class.

First we’ll setup our LivingSocialSpider class with attributes (variables that are defined within a class, also referred to as fields). We’ll inherit from scrapy’s BaseSpider:

from scrapy.spider import BaseSpider

from scraper_app.items import LivingSocialDeal

class LivingSocialSpider(BaseSpider):
    """Spider for regularly updated livingsocial.com site, San Francisco Page"""
    name = "livingsocial"
    allowed_domains = ["livingsocial.com"]
    start_urls = ["http://www.livingsocial.com/cities/15-san-francisco"]

    deals_list_xpath = '//li[@dealid]'
    item_fields = {
        'title': './/span[@itemscope]/meta[@itemprop="name"]/@content',
        'link': './/a/@href',
        'location': './/a/div[@class="deal-details"]/p[@class="location"]/text()',
        'original_price': './/a/div[@class="deal-prices"]/div[@class="deal-strikethrough-price"]/div[@class="strikethrough-wrapper"]/text()',
        'price': './/a/div[@class="deal-prices"]/div[@class="deal-price"]/text()',
        'end_date': './/span[@itemscope]/meta[@itemprop="availabilityEnds"]/@content'
    }

I’ve chosen to not build out the scaffolding with comments, but to throw this at you instead. Let’s walk it through.

The first few variables are self-explanatory: the name defines the name of the Spider, the allowed_domains list the base-URLs for the allowed domains for the spider to crawl, and the start_urls is a list of URLs for the spider to start crawling from. All subsequent URLs will start from the data that the spider downloads from the start_urls.

Next, scrapy uses XPath selectors to extract data from a website - they select certain parts of the HTML data based on a given XPath. As said in their documentation, “XPath is a language for selecting nodes in XML documents, which can also be used with HTML.” You may read more about XPath selectors in their docs.

We basically tell scrapy where to start looking for information based on a defined Xpath. Let’s navigate to our LivingSocial site and right-click to “View Source”:

View Source of LivingSocial

I mean – look at that mess. We need to give the spider a little guidance.

You see that deals_list_xpath = '//ul[@class="unstyled cities-items"]/li[@dealid]' sort of looks like the code we see with HTML. You can read about how to contruct an XPath and working with relative XPaths in their docs. But essentially, the '//ul[@class="unstyled cities-items"]/li[@dealid]' is saying: within all <ul> elements, if a <ul class= is defined as “unstyled cities-items”, then go within that <ul> element to find <li> elements that have a parameter called dealid.

Try it out: within your “View Source” page of the Living Social website, search within the source itself (either pressing CMD+F or CTRL+F within the page) and search for "unstyled cities-items" - you will see:

screenshot

(highlighted with the portion of searched text). Scroll a few lines down to see something like <li dealid="123456">. BAM! those are where our deals are specifically located on the web site.

NOTE: When scraping your own sites and trying to figure out XPaths, Chrome’s Dev Tools offers the ability to inspect html elements, allowing you to just copy xpath of any element you want. It also gives the ability to test xpaths just in the JavaScript console by using $x, for example $x(“//img”). While not explored when writing this tutorial, Firefox has an add-on, FirePath that can edit, inspect, and generate XPaths as well.

Next – the item_fields. This should look similar – it’s a dictionary of all of our items we defined in Items.py earlier (and imported above), with the associated values as their XPaths, relative to deals_list_xpath. The .// before the location means it is relative to deals_list_xpath. The Spider would only grab data from those paths if the deals_list_xpath preceded it.

Okay – next is the actual parse() function. We have to add a few more import statements from scrapy to make use of XPaths. Our import statements, including the new ones, are now:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose

from scraper_app.items import LivingSocialDeal

We’re using the HtmlXPathSelector – this will handle the response of when we request a webpage, and give us the ability to select certain parts of that response, defined by our deals_list_xpath field. For understanding of scrapy’s handling of responses, read what happens under the hood.

We’re also using XPathItemLoader to load data into our item_fields.

Lastly, we import Join and MapCompose for processing our data. MapCompose() will help the input processing of our data, and will be used to help clean up the data that we extract. The Join() will help the output processing of our data, and will join together the elements that we process. A better explanation for these two functions can be found in their documentation.

Here’s what our parse() function looks like:

class LivingSocialSpider(BaseSpider)
"""Spider for regularly updated livingsocial.com site, New York page"""

# <--snip-->

    def parse(self, response):
        """
        Default callback used by Scrapy to process downloaded responses

        Testing contracts:
        @url http://www.livingsocial.com/cities/15-san-francisco
        @returns items 1
        @scrapes title link

        """
        selector = HtmlXPathSelector(response)

        # iterate over deals
        for deal in selector.select(self.deals_list_xpath):
            loader = XPathItemLoader(LivingSocialDeal(), selector=deal)

            # define processors
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()

            # iterate over fields and add xpaths to the loader
            for field, xpath in self.item_fields.iteritems():
                loader.add_xpath(field, xpath)
            yield loader.load_item()

We’ll go through this line-by-line again. As an aside, the parse() function is actually referred to as a method, as it is a method of the LivingSocialSpider class.

The parse() method takes in one parameter: response. Hey, wait – what about this self thing – looks like two parameters!

Each instance method (in this case, parse() is an instance method) receives a reference to itself as its first argument. It’s conventionally called “self”. The explicit self allows the coder to do some fun – for example:

>>> class C:
...   def f(self, s):
...     print s
...
>>> c = C()
>>> C.f(c, "Hello!")
Hello!

The response parameter is what the spider gets back in return after making a request to the Living Social site. We are parsing that response with our XPaths.

First, we will instantiate HtmlXPathSelector() by giving it the parameter, response and assigning it to the variable selector. We’ll be able access HtmlXPathSelector()’s method, select() to grab the exact data we want using the xpaths we defined before.

Now, since there are multiple deals within one page,

1	for deal in selector.select(self.deals_list_xpath):

we’ll iterate over each deal we find from the deals_list_xpath, and then we load them so we can process the data:

    loader = XPathItemLoader(LivingSocialDeal(), selector=deal)

    # define processors
    loader.default_input_processor = MapCompose(unicode.strip)
    loader.default_output_processor = Join()

Here we grab the deal and pass it into XPathItemLoader through the selector parameter, along with the LivingSocialDeal() class, and assign the loader variable. We then setup the process for deal data first by stripping out white-space of unicode strings, then join the data together. Since we did not define any separater within Join(), the data items are just joined by a space, and is helpful for when we have multi-line data.

We then iterate over each key and value of items_fields and add a the specific data piece’s xpath to the loader.

Finally, with each deal, we process each data parcel by calling load_item(), which will grab each item field, ‘title’, ‘link’, etc, for each deal, get its xpath, process its data with the input & output processer. We finally then yield each item, then move on to the next deal that we find:

    # iterate over fields and add xpaths to the loader
    for field, xpath in self.item_fields.iteritems():
        loader.add_xpath(field, xpath)
    yield loader.load_item()

For the Curious

In our for-loop, we are using handy method on our item_fields dictionary – iteritems(). This method returns an iterator object, and allows you to iterate the (key, value) of items in a dictionary. If we just wanted to loop through the keys of our dictionary, we would write: for field in self.item_fields.iterkeys(), and same with values with .itervalues().

This is different than if we were to use self.item_fields.items(). The items() method returns a list of (key, value) tuples, rather than an iterator object that iteritems() returns. A list is an iterable, and a for-loop calls iter() on a list (or string, dictionary, tuple, etc) .

For the Curious

The yield keyword is similar to return. The parse() function, specifically the for deal in selector bit, we’ve essentially built a Generator (it will generate data on the fly). StackOverflow has a good explanation of what’s happening in our function: The first time the function will run, it will run from the beginning until it hits yield, then it’ll return the first value of the loop. Then, each other call will run the loop you have written in the function one more time, and return the next value, until there is no value to return. The generator is considered empty once the function runs but does not hit yield anymore. It can be because the loop had come to ends, or because you do not satisfy a “if/else” anymore.

We’ve now implemented our Spider based off of our Items that we are seeking.

Let’s continue with how we setup our data model for eventual saving into the database!