A walkthrough of grabbing raw data from publicly available information.
Let’s first think about the organization of how we want this script to run. We’ll have a main
function again like we did in the previous tutorial. We’ll also have helper functions and classes defined outside of the main
function. But for the actual logic of grabbing CPI and game platform data, parsing, validating, plotting, and saving as a file will be in our main function.
Within the api.py
file you created in Part 0, let’s first build some scaffolding:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | def main():
"""This function handles the actual logic of this script."""
# Grab CPI/Inflation data.
# Grab API/game platform data.
# Figure out the current price of each platform.
# This will require looping through each game platform we received, and
# calculate the adjusted price based on the CPI data we also received.
# During this point, we should also validate our data so we do not skew
# our results.
# Generate a plot/bar graph for the adjusted price data.
# Generate a CSV file to save for the adjusted price data.
|
Doesn’t seem too bad; we’ve laid out what we want our script to do. Now let’s tackle each comment/process one at a time.
Before we start off with CPI data, let’s look at our first import statement:
1 | from __future__ import print_function
|
You might be curious as to why we’re importing a print_function
, and why it’s from __future__
. This is a gentle introduction to the differences between Python 2.x and Python 3.x. In Python 3, print()
is a function, while in Python 2, print
is a keyword. For now, the difference is just that using print
now requires paretheses around what you are printing.
First, we’ll grab the CPI data from the FRED. This is where we’ll use the requests
library:
1 | import requests
|
And we’ll be grabbing data from a specific URL, so let’s create a global variable first:
1 | CPI_DATA_URL = 'http://research.stlouisfed.org/fred2/data/CPIAUCSL.txt'
|
Next, we should create a CPI class to initialize the CPI data, load data from the URL, load data from a file, and get the adapted price.
In Python, a class is just another object. It allows us to create a blueprint to create another object: instances. It also allows us to group like-things together. For example,
1 2 3 4 5 6 | class Human(object):
def __init__(self, name, birthday):
self.name = name
self.birthday = birthday
def get_sleep_time(self):
return "8 hours"
|
So every new human that we make from Human
can have a name, birthday, and has a method to return hours of sleep:
1 2 3 4 5 6 7 | >>> bob = Human(name="bob", birthday="Jan 15th, 1967")
>>> bob.name
'bob'
>>> bob.birthday
'Jan 15th, 1967'
>>> bob.get_sleep_time()
'8 hours'
|
It wouldn’t make sense if we included a method that returned the value of how many eggs we laid (should probably go in a Fowl
class).
Classes also give us the ability to inherit from other classes, like so:
1 2 3 | class Superwoman(Human):
def get_sleep_time(self):
return None
|
Superwoman
still has ‘access’ to the constructor that we defined in Human
, __init__()
, but we redefined the get_sleep_time()
function:
1 2 3 4 5 6 7 | >>> jill = Superwoman("Jill", "Oct 8th, 1972")
>>> jill.name
'Jill'
>>> jill.birthday
'Oct 8th, 1972'
>>> jill.get_sleep_time()
>>>
|
We explore inheritance a bit more in our next tutorial, Web Scraping.
The scaffolding for our class CPIData
will include a constructor method, the __init__
method, as well as methods to load data from a URL, load data from a file, and return adjusted prices for when we want to compare platform prices between different years:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | class CPIData(object):
"""Abstraction of the CPI data provided by FRED.
This stores internally only one value per year.
"""
def __init__(self):
self.year_cpi = {}
self.last_year = None
self.first_year = None
def load_from_url(self, url, save_as_file=None):
"""Loads data from a given url.
The downloaded file can also be saved into a location for later
re-use with the "save_as_file" parameter specifying a filename.
After fetching the file this implementation uses load_from_file
internally.
"""
def load_from_file(self, fp):
"""Loads CPI data from a given file-like object."""
def get_adjusted_price(self, price, year, current_year=None):
"""Returns the adapted price from a given year compared to what current
year has been specified.
"""
|
We first initialize our CPIData
class with year_cpi
, last_year
, and first_year
, as these are all common attributes for a piece of CPI data.
1 2 3 4 5 6 7 8 9 10 11 | def __init__(self):
# Each year available to the dataset will end up as a simple key-value
# pair within this dict. We don't really need any order here so going
# with a plain old dictionary is the best approach.
self.year_cpi = {}
# Later on we will also remember the first and the last year we
# have found in the dataset to handle years prior or after the
# documented time span.
self.last_year = None
self.first_year = None
|
init()
. These are called magic methods or dunders, but there is nothing magical about them.
When we write a class, MyClass
, and later instantiate that class with x = MyClass()
, what Python is doing under the hood is calling x.__init__()
to initialize that class. If you want the informal representation of a string, you would call str(x)
and Python does x.__str__()
.
There are many magic methods that are just given to a class: init
, str
, repr
, dir
, etc, and you can overwrite them, which is what we did above with our def init(self)
method. We want to give additional initialized parameters for every time we instantiate a new CPIData
class.
Dive into Python has a handy little tool to learn more about these methods; Rafe Kettler wrote up a nice series of blogs about what each one does.
Next, we define a function that will take in a url, and where/what to save our output file as. Comments are inline to help you walk through:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | def load_from_url(self, url, save_as_file=None):
"""
Loads data from a given url. The downloaded file can also be saved
into a location for later re-use with the "save_as_file" parameter
specifying a filename.
After fetching the file this implementation uses load_from_file
internally.
"""
# We don't really know how much data we are going to get here, so
# it is recommended to just keep as little data as possible in memory
# at all times. Since python-requests supports gzip-compression by
# default and decoding these chunks on their own isn't that easy,
# we just disable gzip with the empty "Accept-Encoding" header.
fp = requests.get(url, stream=True,
headers={'Accept-Encoding': None}).raw
# If we did not pass in a save_as_file parameter, we just return the
# raw data we got from the previous line.
if save_as_file is None:
return self.load_from_file(fp)
# Else, we write to the desired file.
else:
with open(save_as_file, 'wb+') as out:
while True:
buffer = fp.read(81920)
if not buffer:
break
out.write(buffer)
with open(save_as_file) as fp:
return self.load_from_file(fp)
|
After we’ve grabbed the data from the URL, we then pass it to our function, load_from_file()
. Comments inline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | def load_from_file(self, fp):
"""
Loads CPI data from a given file-like object.
"""
# When iterating over the data file we will need a handful of temporary
# variables:
current_year = None
year_cpi = []
for line in fp:
# The actual content of the file starts with a header line
# starting with the string "DATE ". Until we reach this line
# we can skip ahead.
while not line.startswith("DATE "):
pass
# Each line ends with a new-line character which we strip here
# to make the data easier usable.
data = line.rstrip().split()
# While we are dealing with calendar data the format is simple
# enough that we don't really need a full date-parser. All we
# want is the year which can be extracted by simple string
# splitting:
year = int(data[0].split("-")[0])
cpi = float(data[1])
if self.first_year is None:
self.first_year = year
self.last_year = year
# The moment we reach a new year, we have to reset the CPI data
# and calculate the average CPI of the current_year.
if current_year != year:
if current_year is not None:
self.year_cpi[current_year] = sum(year_cpi) / len(year_cpi)
year_cpi = []
current_year = year
year_cpi.append(cpi)
# We have to do the calculation once again for the last year in the
# dataset.
if current_year is not None and current_year not in self.year_cpi:
self.year_cpi[current_year] = sum(year_cpi) / len(year_cpi)
|
For the last portion of our class CPIData
, we need to define a method to return the CPI price from a specific year when needed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | def get_adjusted_price(self, price, year, current_year=None):
"""Returns the price of a purchased item from a given year compared to
what current year has been specified.
This essentially is the calculated inflation for an item.
"""
# Currently there is no CPI data for 2014
if current_year is None or current_year > 2013:
current_year = 2013
# If our data range doesn't provide a CPI for the given year, use
# the edge data.
if year < self.first_year:
year = self.first_year
elif year > self.last_year:
year = self.last_year
year_cpi = self.year_cpi[year]
current_cpi = self.year_cpi[current_year]
return float(price) / year_cpi * current_cpi
|
In review, we’ve essentially defined the container, our CPIData
class, to handle the the processing of our CPI data. We initialize each field for a piece of CPI data in __init__
, we define how to load data from a given URL (of which we define as a global variable, CPI_DATA_URL
before we defined our class), we define how to load and parse that data that we just grabbed from the URL and saved, and lastly, we define a method to grab the price for a given year (adjusted if we didn’t grab that specific year from the FRED earlier).