Parse our sample SF crime data.
Open up parse.py
, found: new-coder/dataviz/tutorial_source/parse.py
The beginning of the module is an introduction as well as any copyright and/or license information.
When you start using third-party libraries, pay attention to the license and/or copyright information that is written out. Generally, if the library/code has no license, it means all rights reserved for the author (do not use the code). If it’s [GPL](http://www.gnu.org/licenses/gpl.html), your application/script must also be licensed under the GPL. Although, technically any license that is GPL-compatible is fine to use too: GPL Compatible Licenses or GPL compatibility.
If it’s MIT/2-clause BSD you can do whatever you want (no need to use the same license, or even have a license), if it’s 3-clause BSD you can do whatever you want but have to credit the original author.
Code that is up on GitHub does _not_ mean that it is free to use. If you want to use a library, ask the developer if s/he has plans to include a LICENSE
file or in the headers of the files if it’s not there already.
If you want to open source your code (yay, go you!), include your desired license either as a separate file or within the preamble/beginning of your code. Licensing your code is simply copying & pasting the required language of a license of your choice into your codebase.
CAUTION! Double check with your employer agreement. Sometimes, especially if you are in any tech-related role, there are statements in your employment contract that stipulates what and when code is actually the employers. It may be only code that is written on their equipment, and/or during work hours. Or it may be any code written during the time of employment. The stipulations can even change across states and countries within a single employer.
FYI: For reference, this tutorial is licensed under the Creative Commons license, specifically, Creative Commons Attribution 3.0 Unported license, with the code under zlib/libpng simply because it’s short.
In order to read a CSV/Excel file, we have to import the csv
module from Python’s standard library.
1 | import csv
|
MY_FILE
is defining a global - notice how it‘s all caps, a convention for variables we won’t be changing. Included in this repo is a sample file to which this variable is assigned.
1 | MY_FILE = "../data/sample_sfpd_incident_all.csv"
|
In defining the function, we know that we want to give it the CSV file, as well as the delimiter in which the CSV file uses to delimit each element/column.
1 | def parse(raw_file, delimiter):
|
We also know that we want to return a JSON-like object. A JSON file/object is just a collection of dictionaries, much like Python’s dictionary.
1 2 3 | def parse(raw_file, delimiter):
return parsed_data
|
Let’s be good coders and write a documentation-string (doc-string) for future folks that may read our code. Notice the triple-quotes:
1 2 3 4 | def parse(raw_file, delimiter):
"""Parses a raw CSV file to a JSON-line object."""
return parsed_data
|
"""docstrings"""
and # comments
have to do with who the reader will be. Within the a Python shell, if you call help
on a particular function or class, it will return the """docstring"""
that the developer has written.
There are also documentation programs that look specifically for """docstrings"""
to help the developer automatically produce documentation separated out of the code. Within docstrings, it’s helpful to say imperatively what the function/method or class is supposed to do. Examples of how the documented code should work can also be written in the docstrings (and, subsequently, tested). # comments
, on the otherhand, are for those reading through the code — the comments are to simply say what a specific piece/line of code is meant to do. Inline # comments
are always appreciated by those reading through your code. Many developers also litter # TODO
or # FIXME
statements for combing through later.
What we have now is a pretty good skeleton - we know what parameters the function will take (raw_file
and delimiter
), what it is supposed to do (our """doc-string"""
), and what it will return, parsed_data
. Notice how the parameters and the return value is descriptive in itself.
Let’s sketch out, with comments, how we want this function to take a raw file and give us the format that we want. First, let’s open the file, and the read the file, then build the parsed_data element.
1 2 3 4 5 6 7 8 9 10 11 12 | def parse(raw_file, delimiter):
"""Parses a raw CSV file to a JSON-line object"""
# Open CSV file
# Read CSV file
# Close CSV file
# Build a data structure to return parsed_data
return parsed_data
|
Thankfully, there are a lot of built-in methods that Python has that we can use to do all the steps that we’ve outlined with our comments. The first one we’ll use is open
and pass raw_file
to it, which we got from defining our own parameters in the parse
function:
1 | opened_file = open(raw_file)
|
So we’ve told Python to open the file, now we have to read the file. We have to use the CSV module that we imported earlier:
1 | csv_data = csv.reader(opened_file, delimiter=delimiter)
|
Here, csv.reader
is a function of the CSV module. We gave it two parameters: opened_file, and delimiter. It’s easy to get confused when parameters and variables share names. In delimiter=delimiter
, the first delimiter
is referring to the name of the parameter that csv.reader
needs; the second delimiter
refers to the argument that our parse
function takes in.
Just to quickly put these two lines in our parse
function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | def parse(raw_file, delimiter):
"""Parses a raw CSV file to a JSON-line object"""
# Open CSV file
opened_file = open(raw_file)
# Read the CSV data
csv_data = csv.reader(opened_file, delimiter=delimiter)
# Build a data structure to return parsed_data
# Close the CSV file
return parsed_data
|
csv_data object
, in Python terms, is now an iterator. In very simple terms, this means we can get each element in csv_data
one at a time.
Alright — the building of the data structure might seem tricky. The best way to start off is to set up an empty Python list to our parsed_data
variable so we can add every row of data that we will parse through.
1 | parsed_data = []
|
Good — we have a good data structure to add to. Now let’s first address our column headers that came with the CSV file. They will be the first row, and we’ll assign them to the variable fields
:
1 | fields = csv_data.next()
|
.next
method on csv_data
because it is a generator. We just call .next
once, since headers are in the 1st and only row of our CSV file.
Let’s loop over each row now that we have the headers properly taken care of. With each loop, we will add a dictionary that maps a field (those column headers) to the value in the CSV cell.
1 2 | for row in csv_data:
parsed_data.append(dict(zip(fields, row)))
|
Here, we iterated over each row in the csv_data
item. With each loop, we appended a dictionary (dict()
) to our list,parsed_data
. We use Python’s built-in zip()
function to zip together header → value to make our dictionary of every row.
Now let’s put the function together:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | def parse(raw_file, delimiter):
"""Parses a raw CSV file to a JSON-like object"""
# Open CSV file
opened_file = open(raw_file)
# Read the CSV data
csv_data = csv.reader(opened_file, delimiter=delimiter)
# Setup an empty list
parsed_data = []
# Skip over the first line of the file for the headers
fields = csv_data.next()
# Iterate over each row of the csv file, zip together field -> value
for row in csv_data:
parsed_data.append(dict(zip(fields, row)))
# Close the CSV file
opened_file.close()
return parsed_data
|
Let’s define a main()
function to act as the starting point for our script,
and use our new parse()
function:
1 2 3 4 5 6 | def main():
# Call our parse function and give it the needed parameters
new_data = parse(MY_FILE, ",")
# Let's see what the data looks like!
print new_data
|
We called our function parse()
and gave it the MY_FILE
global variable that we defined at the beginning, as well as the delimiter ","
.
We assign the function to the variable new_data
since the parse()
function will return a parsed_data
object. Last, we print new_data
to see our list of dictionaries!
One final bit — when running a Python file from the command line, Python will execute all of the code found on it. Since the following bit is True
,
1 2 | if __name__ == "__main__":
main()
|
it will call the main()
function. By doing the name == __main__
check, you can have that code only execute when you want to run the module as a program (via the command line) and not have it execute when someone just wants to import the parse()
function itself into another Python file. This is referred to as “boilerplate code” — code doesn’t really do anything and yet is necessary.
So you’ve written the parse function and your parse.py
file looks like mine. Now what? Let’s run it and parse some d*mn files!
Be sure to have your virtualenv activated that you created earlier in setup. Your terminal prompt should look something like this:
1 | (DataVizProj) $
|
Within the new-coder/dataviz/
directory, let’s make a directory for the python files you are writing with the bash command mkdir [Directory_Name]
:
1 2 3 4 5 6 | (DataVizProj) $ mkdir MySourceFiles
(DataVizProj) $ ls # list available files and directories where we are
README.me requirements.txt data full_source MySourceFiles tutorial_source
(DataVizProj) $ pwd # current location in our directory structure
Users/lynnroot/MyProjects/new-coder/dataviz/
(DataVizProj) $ cd MySourceFiles # change directories into our new directory
|
Go ahead and save your copy of parse.py
into MySourceFiles
(through “Save As” within your text editor). You should see the file in the directory if you return to your terminal and type ls
.
To run the python code, you have to tell the terminal to execute the parse.py file with python:
1 | (DataVizProj) $ python parse.py
|
If you got a traceback, or an error message, compare your parse.py
file with new-coder/dataviz/tutorial_source/parse.py
. Perhaps a typo, or you don’t have your virtualenv setup properly.
The output from the (DataVizProj) $ python parse.py
should look like a bunch of dictionaries in one list. For reference, the last bit of output you should see in your terminal should look like (doesn’t have to be exact data, but the structure of {“key”: “value”} should look familiar):
1 2 3 4 | 'ARRESTED, BOOKED'},{'Category': 'OTHER OFFENSES', 'IncidntNum': '030204238',
'DayOfWeek': 'Tuesday', 'Descript': 'OBSCENE PHONE CALLS(S)', 'PdDistrict':
'PARK', 'Y': '37.7773636900243', 'Location': '800 Block of CENTRAL AV', 'Time':
'18:59', 'Date': '02/18/2003', 'X': '-122.445006858202', 'Resolution': 'NONE'}]
|
You see this output because in the def main()
function, and you explicitly say print new_data
which feeds to the output of the terminal. You could, for instance, not print the new_data
variable, and just pass the new_data
variable to another function. Coincidently, that’s what Part II and Part III are about!
Play around with parse.py
within the Python interpreter itself. Make sure you’re in your MySourceFiles
directory, then start the Python interpreter from there:
1 2 3 4 5 | (DataVizProj) $ python
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
|
To exit out of the Python shell, press CTRL-D
.
Next, import your parse.py
file into the interpreter. Notice there is no need to include the .py
portion when importing:
1 2 | >>> import parse
>>>
|
If all things go well with import parse
you should just see the >>>
prompt. If there’s an error, perhaps you are not in the correct directory from two steps ago.
Play with the following commands. Notice to access any object defined in parse.py (object meaning a variable, function, etc), you must preface it with parse
:
1 2 3 4 5 6 7 8 9 | >>> parse.MY_FILE
'../data/sample_sfpd_incident_all.csv'
>>> type(parse.MY_FILE)
<type: 'str'>
>>> copy_my_file = parse.MY_FILE
>>> copy_my_file
'../data/sample_sfpd_incident_all.csv'
>>> type(copy_my_file)
<type: 'str'>
|
So we made what seems like a copy. Not so! check it out:
1 2 3 4 5 | >>> id(copy_my_file)
4404350288
>>> id(parse.MY_FILE)
4404350288
>>>
|
Those numbers from calling the id
function reflect where the variable is saved in the computer’s memory. Since they are the same number, Python has set up a reference from copy_my_file
to the same location that parse.MY_FILE
was saved. No need to allocate new space in memory for what is essentially the same variable with a different name.
Let’s play with the parser function a bit:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | >>> new_data = parse.parse(copy_my_file, ",")
>>> type(new_data)
<type: 'list'>
>>> type(new_data[0])
<type: 'dict'>
>>> type(new_data[0]["DayOfWeek"])
<type: 'str'>
>>> new_data[0].keys()
['Category', 'IncidntNum', 'DayOfWeek', 'Descript', 'PdDistrict', 'Y', 'Location', 'Time', 'Date', 'X', 'Resolution']
>>> new_data[0].values()
['FRAUD', '030203898', 'Tuesday', 'FORGERY, CREDIT CARD', 'NORTHERN', '37.8014488257836', '2800 Block of VAN NESS AV', '16:30', '02/18/2003', '-122.424612993055', 'NONE']
>>> for dict_item in new_data:
... print dict_item["Descript"]
...
DRIVERS LICENSE, SUSPENDED OR REVOKED
LOST PROPERTY
POSS OF LOADED FIREARM
<--snip-->
BATTERY
OBSCENE PHONE CALLS(S)
>>>
|
Here we checked ot the type of data that gets returned back to use from the parse function, as well as ways to simply check out what is the contents of the parsed data.
You can continue to play around; try >>> help(parse.parse)
to see our docstring, see what happens if you feed the parse function a different file, delimiter, or just a different variable. Challenge yourself to see if you can create a new file to save the parsed data, rather than just a variable. The example in the python docs may help.