Author: Spencer Lyon
Many research tasks require you to use and analyze data. Most of the time before you can actually get to the analysis you have to collect, organize, clean, munge, and otherwise prepare the data. As most data scientists will tell you, the preparation phase takes the majority of the users time. As with most things, there are difficult ways and easier ways to work with data on a computer. Domain specific platforms like Stata, SQL, SASS, and R are very capable data analysis and handling packages, but they are all lacking in one way or another. Python is a very flexible and powerful language that overcomes most if not all of the major pitfalls inherent in the domain specific languages, but historically has not been as good at data handling. Enter pandas. pandas is an extremely efficient, powerful, and flexible data package for python. The following is a quote taken from the pandas website:
[pandas] has the broader goal of becoming the most powerful and flexible open source data analysis manipulation tool available in any language. It is already well on its way toward this goal.
The purpose of this lab is to introduce you to the basic data structures in pandas, teach you some helpful pandas tricks, and get you some experience handling and analyzing real data. The text of this lab is somewhat detailed and should be thought of as more of a reference to look back on when you need to look something up, rather than a text you need to fully understand on the first read.
The remainder of this lab will be structured a bit differently than the other labs you have seen so far. I will present a lot of information at the beginning and save most assignments until the end of the lab. It is structured this way because there are a number of foundational things you need to know about pandas before being able to do meaningful analysis with actual data. As such, I recommend that you spend the first hour or so reading through the material and the last 3 hours (probably a bit more) working through the exercises. Don’t worry about mastering all of the material presented on your initial reading. It is meant to introduce you to the topics and serve as a reference for you to look back on while completing the assignments.
Note
A lot of the content of this lab was drawn from two main sources: the pandas documentation, and the excellent book by Wes McKinney (main author of pandas) Python for Data Analysis
Complete the introduction found on the pandas objects page.
To use in examples below, I define the following objects:
In [1]: ser = Series(np.arange(10, 16))
In [2]: df = DataFrame(np.arange(30).reshape(6, 5), columns=list('abcde'))
In [3]: mul_df = DataFrame(np.arange(12).reshape((4, 3)),
...: index=[['a', 'a', 'b', 'b'],
...: [1, 2, 1, 2]],
...: columns=[['Ohio', 'Ohio', 'Colorado'],
...: ['Green', 'Red', 'Green']])
...:
In [4]: mul_df.index.names = ['let', 'num']
In [5]: mul_df.columns.names = ['state', 'color']
In [6]: ts_df = pd.util.testing.makeTimeDataFrame(nper=500)
In [7]: big_df = DataFrame(np.random.randint(500, size=(500, 10)),
...: columns=list('abcdefghij'))
...:
Now that we understand how to construct pandas objects, I will go over some basic functionality.
pandas has an intelligent printing system that won’t take over your screen by trying to print something that is too large.
In [8]: big_df
Out[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 499
Data columns (total 10 columns):
a 500 non-null values
b 500 non-null values
c 500 non-null values
d 500 non-null values
e 500 non-null values
f 500 non-null values
g 500 non-null values
h 500 non-null values
i 500 non-null values
j 500 non-null values
dtypes: int64(10)
There are many ways to force pandas to show you at least parts of your data.
# Slicing
In [9]: big_df.ix[:5]
Out[9]:
a b c d e f g h i j
0 413 478 357 74 354 71 258 397 187 460
1 172 479 289 44 60 82 122 357 15 65
2 423 144 374 161 233 379 143 357 187 321
3 449 473 12 183 110 414 289 294 450 391
4 214 461 119 381 54 26 117 220 230 52
5 294 320 400 232 157 436 268 20 402 52
In [10]: big_df.ix[:5, ::5]
Out[10]:
a f
0 413 71
1 172 82
2 423 379
3 449 414
4 214 26
5 294 436
# Using the head() method -- shows first n rows
In [11]: big_df.head()
Out[11]:
a b c d e f g h i j
0 413 478 357 74 354 71 258 397 187 460
1 172 479 289 44 60 82 122 357 15 65
2 423 144 374 161 233 379 143 357 187 321
3 449 473 12 183 110 414 289 294 450 391
4 214 461 119 381 54 26 117 220 230 52
In [12]: big_df.head(5) # default is 5
Out[12]:
a b c d e f g h i j
0 413 478 357 74 354 71 258 397 187 460
1 172 479 289 44 60 82 122 357 15 65
2 423 144 374 161 233 379 143 357 187 321
3 449 473 12 183 110 414 289 294 450 391
4 214 461 119 381 54 26 117 220 230 52
In [13]: big_df.head(7)
Out[13]:
a b c d e f g h i j
0 413 478 357 74 354 71 258 397 187 460
1 172 479 289 44 60 82 122 357 15 65
2 423 144 374 161 233 379 143 357 187 321
3 449 473 12 183 110 414 289 294 450 391
4 214 461 119 381 54 26 117 220 230 52
5 294 320 400 232 157 436 268 20 402 52
6 274 312 280 213 30 336 182 22 31 445
# Using the tail method -- shows last n rows
In [14]: big_df.tail()
Out[14]:
a b c d e f g h i j
495 219 312 243 351 459 303 297 282 73 109
496 179 425 31 233 281 24 330 278 225 56
497 179 231 381 342 476 286 63 454 231 288
498 212 257 17 174 416 228 225 83 290 434
499 369 245 401 109 253 87 71 380 29 171
In [15]: big_df.tail(5) # default is 5
Out[15]:
a b c d e f g h i j
495 219 312 243 351 459 303 297 282 73 109
496 179 425 31 233 281 24 330 278 225 56
497 179 231 381 342 476 286 63 454 231 288
498 212 257 17 174 416 228 225 83 290 434
499 369 245 401 109 253 87 71 380 29 171
In [16]: big_df.tail(7)
Out[16]:
a b c d e f g h i j
493 50 241 293 407 50 265 119 199 404 247
494 69 340 56 381 38 471 288 442 17 439
495 219 312 243 351 459 303 297 282 73 109
496 179 425 31 233 281 24 330 278 225 56
497 179 231 381 342 476 286 63 454 231 288
498 212 257 17 174 416 228 225 83 290 434
499 369 245 401 109 253 87 71 380 29 171
A very important method for pandas objects is reindex, which means to create a new object with the data conformed to a new index.
In [17]: ser
Out[17]:
0 10
1 11
2 12
3 13
4 14
5 15
dtype: int64
In [18]: ser.reindex([1, 2, 4, 6])
Out[18]:
1 11
2 12
4 14
6 NaN
dtype: float64
Notice that the call to reindex used the data contained in ser where possible and put placeholders for missing values where the data didn’t exist in ser. The same holds for DataFrame objects:
In [19]: df
Out[19]:
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
5 25 26 27 28 29
In [20]: df.reindex([1, 3, 7])
Out[20]:
a b c d e
1 5 6 7 8 9
3 15 16 17 18 19
7 NaN NaN NaN NaN NaN
If you didn’t want missing data to appear in the reindexed object, you could use some interpolation logic by passing an interpolation rule to the method argument. The two main arguments that can be passed are ffill and bfill.
# Read docstrings for more information
In [21]: df.reindex([1, 3, 5, 7], method='ffill')
Out[21]:
a b c d e
1 5 6 7 8 9
3 15 16 17 18 19
5 25 26 27 28 29
7 25 26 27 28 29
With a DataFrame you can reindex the rows like we have been doing, but you can also reindex the rows and take only a few columns or just reindex the columns themselves.
In [22]: df.reindex([1, 3], columns=['a', 'b', 'e'])
Out[22]:
a b e
1 5 6 9
3 15 16 19
In [23]: df.reindex(columns=['a', 'b', 'e'])
Out[23]:
a b e
0 0 1 4
1 5 6 9
2 10 11 14
3 15 16 19
4 20 21 24
5 25 26 29
Note
As will be seen later, pandas has very flexible data alignment and handling routines. A lot of the work done behind the scenes to align data sets uses the reindex method.
Also recall the use if the ix indexing field that was introduced with the data structures on the other page. It is a very robust and powerful way to accomplish many re-indexing tasks.
We have already seen that when we do basic arithmetic pandas objects, the intrinsic link between data and labels is not broken. This results in a new pandas object whose row/column index is the union of indexes of the original objects, with NaN propogated where indexes didn’t align. This is very useful for doing computation with data sets that need to be aligned, but sometimes isn’t what we want. All pandas objects have add, sub, mul, and div methods that give you more flexibility over this behavior.
In [24]: df1 = DataFrame(np.arange(9).reshape((3, 3)), columns=list('bcd'),
....: index=['AZ', 'TX', 'NY'])
....:
In [25]: df2 = DataFrame(np.arange(12).reshape((4, 3)), columns=list('bcd'),
....: index=['UT', 'AZ', 'TX', 'NY'])
....:
# Produces NaN values
In [26]: df1 + df2
Out[26]:
b c d
AZ 3 5 7
NY 15 17 19
TX 9 11 13
UT NaN NaN NaN
# Same as above
In [27]: df1.add(df2)
Out[27]:
b c d
AZ 3 5 7
NY 15 17 19
TX 9 11 13
UT NaN NaN NaN
# No NaN's
In [28]: df1.add(df2, fill_value=0)
Out[28]:
b c d
AZ 3 5 7
NY 15 17 19
TX 9 11 13
UT 0 1 2
Arithmetic between DataFrame and Series is well defined and is analogous to operations between 1 and 2 dimensional numpy arrays, with the added benefit of data alignment.
In [29]: series = df2.ix[0]
In [30]: df2 - series
Out[30]:
b c d
UT 0 0 0
AZ 3 3 3
TX 6 6 6
NY 9 9 9
In [31]: ser2 = Series(range(4), index=list('bcde'))
# Note NaNs for columns 'e' that doesn't appear in df2
In [32]: df2 + ser2
Out[32]:
b c d e
UT 0 2 4 NaN
AZ 3 5 7 NaN
TX 6 8 10 NaN
NY 9 11 13 NaN
# Can also align on rows
In [33]: ser3 = df2['c']
In [34]: ser3 + df2
Out[34]:
AZ NY TX UT b c d
UT NaN NaN NaN NaN NaN NaN NaN
AZ NaN NaN NaN NaN NaN NaN NaN
TX NaN NaN NaN NaN NaN NaN NaN
NY NaN NaN NaN NaN NaN NaN NaN
# But we need to use .### methods
In [35]: df2.sub(ser3, axis=0)
Out[35]:
b c d
UT -1 0 1
AZ -1 0 1
TX -1 0 1
NY -1 0 1
As we saw on the other page, pandas objects are a subclass of numpy.ndarray and as such can be passed as arguments to all numpy ufuncs.
In [36]: np.sin(df)
Out[36]:
a b c d e
0 0.000000 0.841471 0.909297 0.141120 -0.756802
1 -0.958924 -0.279415 0.656987 0.989358 0.412118
2 -0.544021 -0.999990 -0.536573 0.420167 0.990607
3 0.650288 -0.287903 -0.961397 -0.750987 0.149877
4 0.912945 0.836656 -0.008851 -0.846220 -0.905578
5 -0.132352 0.762558 0.956376 0.270906 -0.663634
In [37]: np.abs(np.sin(df))
Out[37]:
a b c d e
0 0.000000 0.841471 0.909297 0.141120 0.756802
1 0.958924 0.279415 0.656987 0.989358 0.412118
2 0.544021 0.999990 0.536573 0.420167 0.990607
3 0.650288 0.287903 0.961397 0.750987 0.149877
4 0.912945 0.836656 0.008851 0.846220 0.905578
5 0.132352 0.762558 0.956376 0.270906 0.663634
Another very common operation is to apply a function on 1d arrays made up of the columns or rows of a pandas object. For example, if we wanted to find the total distance spanned by values in a row (column) of a data frame we could do something like this:
In [38]: f = lambda x: x.max() - x.min()
In [39]: rand_df = DataFrame(randn(5, 5), columns=list('abcde'))
In [40]: rand_df
Out[40]:
a b c d e
0 -0.503157 -1.411566 0.707046 2.748645 0.145893
1 -1.826045 0.457516 0.860263 -0.917852 -0.645512
2 -0.747378 -0.531364 -0.172163 -0.860901 -1.235838
3 0.174375 -0.182662 -0.436242 0.083190 -1.520176
4 0.263203 -0.415688 0.239489 0.399649 1.131151
In [41]: rand_df.apply(f)
Out[41]:
a 2.089248
b 1.869083
c 1.296505
d 3.666497
e 2.651327
dtype: float64
In [42]: rand_df.apply(f, axis=1)
Out[42]:
0 4.160211
1 2.686308
2 1.063675
3 1.694551
4 1.546839
dtype: float64
The function sent to apply doesn’t need to just return scalars. In fact, it can any python object you want.
In [43]: f2 = lambda x: 'Max: %.3f, Min: %.3f' % (x.max(), x.min())
In [44]: f3 = lambda x: (x.max(), x.min())
In [45]: f4 = lambda x: Series([x.max(), x.min()], index=['max', 'min'])
In [46]: rand_df.apply(f2)
Out[46]:
a Max: 0.263, Min: -1.826
b Max: 0.458, Min: -1.412
c Max: 0.860, Min: -0.436
d Max: 2.749, Min: -0.918
e Max: 1.131, Min: -1.520
dtype: object
In [47]: rand_df.apply(f3)
Out[47]:
a (0.263202922997, -1.82604541073)
b (0.457516490127, -1.41156615518)
c (0.860262971401, -0.436241770117)
d (2.74864493448, -0.917852431896)
e (1.13115127829, -1.52017570556)
dtype: object
In [48]: rand_df.apply(f4, axis=1)
Out[48]:
max min
0 2.748645 -1.411566
1 0.860263 -1.826045
2 -0.172163 -1.235838
3 0.174375 -1.520176
4 1.131151 -0.415688
You can also apply functions element-wise to every item in a dataframe using the applymap method.
In [49]: formater = lambda x: '%.2f' % x
In [50]: rand_df.applymap(formater)
Out[50]:
a b c d e
0 -0.50 -1.41 0.71 2.75 0.15
1 -1.83 0.46 0.86 -0.92 -0.65
2 -0.75 -0.53 -0.17 -0.86 -1.24
3 0.17 -0.18 -0.44 0.08 -1.52
4 0.26 -0.42 0.24 0.40 1.13
As I am sure you are beginning to see, you can be very creative in your use of apply and applymap to accomplish some very sophisticated and powerful tasks.
Another common task is to sort the data based values or keys. This is very easy and efficient in pandas (I make the note about efficiency because all of the operations below are easy enough to compute on your own, but pandas has highly optimized code for doing them).
In [51]: frame = DataFrame([[4, 0], [7, 1], [-3, 0], [2, 1]],
....: columns=['three', 'one'], index=[20, 40, 10, 30])
....:
In [52]: frame
Out[52]:
three one
20 4 0
40 7 1
10 -3 0
30 2 1
# Sort rows numerically
In [53]: frame.sort_index()
Out[53]:
three one
10 -3 0
20 4 0
30 2 1
40 7 1
# Sort column labels alphabetically
In [54]: frame.sort_index(axis=1)
Out[54]:
one three
20 0 4
40 1 7
10 0 -3
30 1 2
# Sort sort by column 'one'
In [55]: frame.sort_index(by='one')
Out[55]:
three one
20 4 0
10 -3 0
40 7 1
30 2 1
# Sort by 'one', then 'three'
In [56]: frame.sort_index(by=['one', 'three'])
Out[56]:
three one
10 -3 0
20 4 0
30 2 1
40 7 1
# Sort by 'three', then 'one'
In [57]: frame.sort_index(by=['three', 'one'])
Out[57]:
three one
10 -3 0
30 2 1
20 4 0
40 7 1
There is so much more that pandas can do that we do not have time to look at individually right now. I will simply list useful DataFrame or Series methods and group them into logical categories.
Summarizing Data
Membership, unique values, and value counts
Missing Data:
Manipulating Data
Plotting
For an overview of other pandas features, see the documentation.
Time series analysis is a huge topic that could fill books, so I will only briefly highlight some of the key features. For more information see the time series docs.
One common task in economic or financial data analysis is to generate a range of equally spaced dates. An example might be that you know you have quarterly data starting in Q1 2000 and ending in Q4 2012 and you would like to tell python (pandas) about those dates. To do this you will use the pandas.date_range function. I will illustrate with some examples below:
# Example from above. Note robust string parsing
In [58]: pd.date_range(start='1-1-2000', end='12/31/2012', freq='Q')
Out[58]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-03-31 00:00:00, ..., 2012-12-31 00:00:00]
Length: 52, Freq: Q-DEC, Timezone: None
# Maybe we wanted start of period instead of ending period data
In [59]: pd.date_range(start='1-1-2000', end='12/31/2012', freq='QS')
Out[59]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01 00:00:00, ..., 2012-10-01 00:00:00]
Length: 52, Freq: QS-JAN, Timezone: None
# We might have frequency, starting period, and number of periods
In [60]: pd.date_range(start='January 1 2000', freq='QS', periods=52)
Out[60]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01 00:00:00, ..., 2012-10-01 00:00:00]
Length: 52, Freq: QS-JAN, Timezone: None
In the above examples I only tried quarterly data, but there are many different frequencies I could have used. Below I list many of them.
|
|
Note
You can combine any of the above frequencies with a logical anchor point. For example, you can combine weekly with a three letter abbreviation for which day you want repated each week (e.g. W-TUE is every Tuesday). With annual and quarterly frequencies you can give a three letter abbreviation for the month (BQS-JUN is the start of business quarters, starting at June 1st).
I could use any of these frequencies, or any logical combination of them.
# Business annual data
In [61]: pd.date_range(end='1-1-1990', freq='BA', periods=20)
Out[61]:
<class 'pandas.tseries.index.DatetimeIndex'>
[1970-12-31 00:00:00, ..., 1989-12-29 00:00:00]
Length: 20, Freq: BA-DEC, Timezone: None
# Every 5 hours 10 minutes
In [62]: pd.date_range(start='April 25th 1989 3:40 PM', periods=100, freq='5h10min')
Out[62]:
<class 'pandas.tseries.index.DatetimeIndex'>
[1989-04-25 15:40:00, ..., 1989-05-16 23:10:00]
Length: 100, Freq: 310T, Timezone: None
Notice that the return value of the date_range function is an object of type DatetimeIndex. This means that the result could be used as the index for a Series or DataFrame. When you do this, you open up a whole set of time-series specific functionality.
One of the most common and most useful tasks is to resample data at one frequency to another frequency.
# Notice the frequency is B, or business daily
In [63]: ts_df
Out[63]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 500 entries, 2000-01-03 00:00:00 to 2001-11-30 00:00:00
Freq: B
Data columns (total 4 columns):
A 500 non-null values
B 500 non-null values
C 500 non-null values
D 500 non-null values
dtypes: float64(4)
# Get weekly data
In [64]: ts_df.resample('W')
Out[64]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100 entries, 2000-01-09 00:00:00 to 2001-12-02 00:00:00
Freq: W-SUN
Data columns (total 4 columns):
A 100 non-null values
B 100 non-null values
C 100 non-null values
D 100 non-null values
dtypes: float64(4)
# Use how argument
In [65]: ts_df.resample('2M', how='first')
Out[65]:
A B C D
2000-01-31 0.496714 0.926178 1.399355 0.778361
2000-03-31 -0.225776 0.543360 -0.573700 0.582098
2000-05-31 1.356240 -0.408075 -1.119670 -0.281328
2000-07-31 0.257550 -1.778720 -0.623769 0.323079
2000-09-30 0.346448 1.126565 -2.153390 0.325796
2000-11-30 0.385317 -0.309546 0.440475 1.938929
2001-01-31 0.852433 1.014370 1.628397 1.710613
2001-03-31 -1.237815 0.610586 0.956702 0.760056
2001-05-31 -1.214189 1.453384 -1.975467 0.441941
2001-07-31 -0.773010 2.010205 -0.884803 -0.410814
2001-09-30 -1.534114 0.249384 0.455888 0.638051
2001-11-30 -0.792873 1.117296 -1.760763 1.374438
# Different how
In [66]: ts_df.resample('8M', how='max')
Out[66]:
A B C D
2000-01-31 1.579213 1.909417 2.075261 1.854093
2000-09-30 2.720169 2.573360 2.439752 3.193108
2001-05-31 3.852731 2.632382 2.579709 2.170943
2002-01-31 3.078881 2.526932 2.601683 3.137749
# ohlc: open-high-low-close. only works on series objects
In [67]: ts_df['A'].resample('8M', how='ohlc')
Out[67]:
open high low close
2000-01-31 0.496714 1.579213 -1.913280 1.465649
2000-09-30 -0.225776 2.720169 -2.619745 0.173181
2001-05-31 0.385317 3.852731 -3.241267 0.097676
2002-01-31 -0.773010 3.078881 -2.301921 -1.382800
# use lambda function
In [68]: ts_df.resample('8M', how=lambda x: np.sqrt(np.abs(x.min() - x.max())))
Out[68]:
A B C D
2000-01-31 1.868821 1.891777 1.662663 1.928893
2000-09-30 2.310825 2.295702 2.309980 2.472743
2001-05-31 2.663456 2.171247 2.273929 2.207753
2002-01-31 2.319656 2.225042 2.137297 2.465388
There is a lot more available to the resample, method, but I refer the reader to the time series docs for more information.
Other topics that you will want to look at when you have time are given below:
One of the most flexible and powerful constructs in pandas is accessed via the groupby method for a pandas data structure. The process of groupby can be summarized in three stages:
As the steps indicate, when using groupby we will be deciding how to split up our data, what to do with the categorized chunks, and then how to combine it again (really pandas usually takes care of this last step for us). This topic can feel a bit esoteric, but once understood should be appealing to anyone who needs to aggregate and act on data (pretty much everyone who ever works with data!). I will not have time to cover all the possible use-cases of groupby in this material, but the groupby docs are a good place to look for more examples and inspiration.
Note
The tips data below originally came from Bryant % Smith (1995) and has the total bill and tip amount for the tips given at a restaurant based on the time of day, day of the week, the size of the party, sex of the payer, and whether or not the payer was a smoker.
We will need some objects to play with:
In [69]: df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
....: 'key2' : ['one', 'two', 'one', 'two', 'one'],
....: 'data1' : np.random.randn(5),
....: 'data2' : np.random.randn(5)})
....:
In [70]: colors = ['red', 'blue', 'red', 'blue', 'red']
In [71]: ranks = [1, 2, 1, 1, 2]
# Some data I loaded in before
In [72]: tips
Out[72]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 243
Data columns (total 8 columns):
total_bill 244 non-null values
tip 244 non-null values
sex 244 non-null values
smoker 244 non-null values
day 244 non-null values
time 244 non-null values
size 244 non-null values
tip_pct 244 non-null values
dtypes: float64(3), int64(1), object(4)
We can use groubpy on our DataFrame and pass it python iterable(s) of the correct size or we could use the DataFrame’s columns themselves as the keys to aggregate on.
# Group the series of just 'data1' column
In [73]: df['data1'].groupby(colors)
Out[73]: <pandas.core.groupby.SeriesGroupBy at 0x110bdf750>
# Apply it to the whole frame
In [74]: df.groupby(ranks)
Out[74]: <pandas.core.groupby.DataFrameGroupBy at 0x110bdf810>
# Group on more than one thing (also works for series)
In [75]: df.groupby([colors, ranks])
Out[75]: <pandas.core.groupby.DataFrameGroupBy at 0x110bdf210>
# Group by the columns of df
In [76]: by_keys = df.groupby(['key1', 'key2'])
In [77]: by_keys
Out[77]: <pandas.core.groupby.DataFrameGroupBy at 0x110bdf350>
As you have seen, the return value of the groupby object isn’t very useful on its own, but we can do some very cool things with it.
# Compute mean of each group
In [78]: by_keys.mean()
Out[78]:
data1 data2
key1 key2
a one 0.006013 0.693592
two 1.066719 0.630959
b one -0.321913 0.135072
two -0.274178 -1.227735
# Compute max of just the 'data2' column
In [79]: by_keys['data2'].max()
Out[79]:
key1 key2
a one 0.774125
two 0.630959
b one 0.135072
two -1.227735
Name: data2, dtype: float64
# The above was a Series. We can unstack it into a DataFrame
In [80]: by_keys['data2'].prod().unstack()
Out[80]:
key2 one two
key1
a 0.474584 0.630959
b 0.135072 -1.227735
There are a number of highly optimized groupby methods. I will list some of the most common below
|
|
|
|
You can also use the agg method of a GroupBy object to give more complicated functions as well as pass a list of functions to apply at once.
In [81]: ppl_grp = tips.groupby(['sex', 'smoker'])
In [82]: ppl_pct = ppl_grp['tip_pct']
In [83]: ppl_pct.agg('mean')
Out[83]:
sex smoker
Female No 0.156921
Yes 0.182150
Male No 0.160669
Yes 0.152771
Name: tip_pct, dtype: float64
# Notice function names become column labels
In [84]: ppl_pct.agg(['mean', 'sum', 'std', lambda x: x.max() - x.min()])
Out[84]:
mean sum std <lambda>
sex smoker
Female No 0.156921 8.473732 0.036421 0.195876
Yes 0.182150 6.010962 0.071595 0.360233
Male No 0.160669 15.584865 0.041849 0.220186
Yes 0.152771 9.166271 0.090588 0.674707
# Aggregate multiple columns with a list of functions
In [85]: funcs = ['count', 'mean', 'max']
In [86]: ppl_grp['tip_pct', 'total_bill'].agg(funcs)
Out[86]:
tip_pct total_bill
count mean max count mean max
sex smoker
Female No 54 0.156921 0.252672 54 18.105185 35.83
Yes 33 0.182150 0.416667 33 17.977879 44.30
Male No 97 0.160669 0.291990 97 19.791237 48.33
Yes 60 0.152771 0.710345 60 22.284500 50.81
# Compute different aggregations for different columns
In [87]: mapping = {'tip_pct' : ['min', 'max', 'mean', 'std'], 'size' : ['sum', 'mean']}
In [88]: ppl_grp.agg(mapping)
Out[88]:
tip_pct size
min max mean std sum mean
sex smoker
Female No 0.056797 0.252672 0.156921 0.036421 140 2.592593
Yes 0.056433 0.416667 0.182150 0.071595 74 2.242424
Male No 0.071804 0.291990 0.160669 0.041849 263 2.711340
Yes 0.035638 0.710345 0.152771 0.090588 150 2.500000
As with other topics, there is so much more that you can do with groupby than what I can cover here. See the documentation for more information.
pandas is also an excellent tool for getting data in and out of python. This section will illustrate the differences pandas i/o functions and those in other python packages.
Depending on your version of pandas, there are between 4-7 utility functions that can be used get data in and out of pandas.
Spend about 10 minutes reading through the data IO documentation and familiarize yourself with the read_table and read_csv functions. There is a lot of information there, so don’t worry about capturing all of it. The function arguments you should focus on are the following:
To make sure that you have understood the material on the pandas website, you should be able to follow the examples below:
# Show contents of file
In [89]: print open('./source/resources/Data/flat.csv').read()
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
In [90]: pd.read_csv('./source/resources/Data/flat.csv')
Out[90]:
1 2 3 4 hello
0 5 6 7 8 world
1 9 10 11 12 foo
# We really didn't want first row to be header
In [91]: pd.read_csv('./source/resources/Data/flat.csv', header=None)
Out[91]:
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
# Can give column names
In [92]: names = ['a', 'b', 'c', 'd', 'message']
In [93]: pd.read_csv('./source/resources/Data/flat.csv', header=None, names=names)
Out[93]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
# Can set named column as index
In [94]: pd.read_csv('./source/resources/Data/flat.csv', header=None,
....: names=names, index_col='message')
....:
Out[94]:
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12
Imagine you wanted to some basic macroeconomic analysis with common data sets you could obtain from FRED. As you have already seen, there are various functions within numpy and scipy for loading data from text or binary files into numpy arrays. Two of the functions you may or may not have met up to this point are:
While these functions are capable, they are somewhat low-level compared to the functions in pandas. The exercises below will demonstrate this idea.
Todo
Disclaimer: Spend only 10-15 minutes trying to complete this exercise. I don’t actually care if you finish it - I DO however care that you spend at least a little bit of time on it.
Collect the following data sets from the FRED website and read them in to python using tools from numpy/scipy (Note the items listed below are the actual data set names on FRED). Make sure that you collect the data starting January 1 1985 and going out as far as you can:
Use the date information given in the data files to manually align the data. Notice that there are 4 different frequencies for the data sets listed above. To get all of them on the same frequency, compute a period-average. In other words, if you have daily data and you want weekly data, compute the mean of every 7 items and call that the weekly value. (Note: This probably won’t be easy because you will end up needing quarterly data and trying to align data across months of different lengths is difficult.)
Finally, use your resultant data set and rudimentary econometric techniques to solve this equation:
You should get the following as answers:
Todo
Disclaimer: I DO care that you finish this one.
Repeat the above analysis using pandas. Read the data into python in two ways:
Hint: When using read_xxx you will want to make sure to parse the dates and set which column should be the index (the one with the dates is probably a good idea)
The last main topic I will cover here deals with combining data sets from different sources. We have seen many examples of how the intrinsic link between data and labels has aided computation and aggregation. It is also crucial to data alignment and the merging and joining of data sets.
The most general combining function in pandas is the pandas.concat function. It does most of the heavy lifting for the more specialized data merging routines in pandas. The concat function will feel quite familiar to those who are used to using the numpy.concatenate function (or any of its siblings like numpy.row_stack, ect.). I will demonstrate some of this functions capabilities with a few examples (Note this function applies equally well to DataFrame and Series objects):
In [95]: s1 = Series([0, 1], index=['a', 'b'])
In [96]: s2 = Series([2, 3, 4], index=['c', 'd', 'e'])
In [97]: s3 = Series([5, 6], index=['f', 'g'])
In [98]: s4 = pd.concat([s1 + 10, s3])
# Concatenate along rows (default: axis=0)
In [99]: pd.concat([s1, s2, s3])
Out[99]:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
# concat along columns
In [100]: pd.concat([s1, s2, s3], axis=1)
Out[100]:
0 1 2
a 0 NaN NaN
b 1 NaN NaN
c NaN 2 NaN
d NaN 3 NaN
e NaN 4 NaN
f NaN NaN 5
g NaN NaN 6
# Use intersection of index to avoid NaN's
In [101]: pd.concat([s1, s4], axis=1, join='inner')
Out[101]:
0 1
a 0 10
b 1 11
# specify index labels to be used
In [102]: pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
Out[102]:
0 1
a 0 10
c NaN NaN
b 1 11
e NaN NaN
# Give names to concatenated objects (become MultiIndex instead of cols)
In [103]: pd.concat([s1, s1, s3], keys=['one', 'two', 'three'])
Out[103]:
one a 0
b 1
two a 0
b 1
three f 5
g 6
dtype: int64
Now for some more complicated examples using DataFrame objects
In [104]: df1 = DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
.....: columns=['one', 'two'])
.....:
In [105]: df2 = DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
.....: columns=['three', 'four'])
.....:
# Notice that the keys become top-level column index
In [106]: pd.concat([df1, df2], axis=1, keys=['l1', 'l2'])
Out[106]:
l1 l2
one two three four
a 0 1 5 6
b 2 3 NaN NaN
c 4 5 7 8
# Could have passed a dict
In [107]: pd.concat({'l1': df1, 'l2': df2}, axis=1)
Out[107]:
l1 l2
one two three four
a 0 1 5 6
b 2 3 NaN NaN
c 4 5 7 8
# Create heirarchial index with keys and names args
In [108]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
.....: names=['upper', 'lower'])
.....:
Out[108]:
upper level1 level2
lower one two three four
a 0 1 5 6
b 2 3 NaN NaN
c 4 5 7 8
# We can tell it to ignore the index of the original objects
In [109]: pd.concat([df1, df2], ignore_index=1)
Out[109]:
four one three two
0 NaN 0 NaN 1
1 NaN 2 NaN 3
2 NaN 4 NaN 5
3 6 NaN 5 NaN
4 8 NaN 7 NaN
The merge and join functions will feel very familiar to those who have used relational databases like SQL. They combine datasets by linking rows using one or more keys. The best way to understand these operations is by example, so here we go!
Merge examples:
In [110]: df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
.....: 'data1': range(7)})
.....:
In [111]: df2 = DataFrame({'key': ['a', 'b', 'd'],
.....: 'data2': range(3)})
.....:
In [112]: df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
.....: 'data1': range(7)})
.....:
In [113]: df4 = DataFrame({'rkey': ['a', 'b', 'd'],
.....: 'data2': range(3)})
.....:
In [114]: left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
.....: 'value': range(6)})
.....:
In [115]: right1 = DataFrame({'group_val': [3.5, 7]},
.....: index=['a', 'b'])
.....:
In [116]: lefth = DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
.....: 'key2': [2000, 2001, 2002, 2001, 2002],
.....: 'data': np.arange(5.)})
.....:
In [117]: righth = DataFrame(np.arange(12).reshape((6, 2)),
.....: index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
.....: [2001, 2000, 2000, 2000, 2001, 2002]],
.....: columns=['event1', 'event2'])
.....:
In [118]: left2 = DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=['a', 'c', 'e'],
.....: columns=['Ohio', 'Nevada'])
.....:
In [119]: right2 = DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],index=['b', 'c', 'd', 'e'], columns=['Missouri', 'Alabama'])
# merge defaults to combine on over-lapping column names
In [120]: pd.merge(df1, df2)
Out[120]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
# But, it is a good idea to specify column
In [121]: pd.merge(df1, df2, on='key')
Out[121]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
# If column names are different, we can specify separate ones for each frame
In [122]: pd.merge(df3, df4, left_on='lkey', right_on='rkey')
Out[122]:
data1 lkey data2 rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a
# Keep all (union) of the row labels by using 'how=' param
In [123]: pd.merge(df1, df2, how='outer')
Out[123]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
6 3 c NaN
7 NaN d 2
# Use the index of the right frame for the merge
In [124]: pd.merge(left1, right1, left_on='key', right_index=True)
Out[124]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
# Hierarchial indexes work too, but are more complicated
In [125]: lefth
Out[125]:
data key1 key2
0 0 Ohio 2000
1 1 Ohio 2001
2 2 Ohio 2002
3 3 Nevada 2001
4 4 Nevada 2002
In [126]: righth
Out[126]:
event1 event2
Nevada 2001 0 1
2000 2 3
Ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11
In [127]: pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)
Out[127]:
data key1 key2 event1 event2
0 0 Ohio 2000 4 5
0 0 Ohio 2000 6 7
1 1 Ohio 2001 8 9
2 2 Ohio 2002 10 11
3 3 Nevada 2001 0 1
In [128]: pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True,
.....: how='outer')
.....:
Out[128]:
data key1 key2 event1 event2
0 0 Ohio 2000 4 5
0 0 Ohio 2000 6 7
1 1 Ohio 2001 8 9
2 2 Ohio 2002 10 11
3 3 Nevada 2001 0 1
4 4 Nevada 2002 NaN NaN
4 NaN Nevada 2000 2 3
# Merging on index
In [129]: pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
Out[129]:
Ohio Nevada Missouri Alabama
a 1 2 NaN NaN
b NaN NaN 7 8
c 3 4 9 10
d NaN NaN 11 12
e 5 6 13 14
The DataFrame also has a convenient join method for merging on the index. This is used when you have objects with similar row labels, but different columns.
# Should be the same as the last example
In [130]: left2.join(right2, how='outer')
Out[130]:
Ohio Nevada Missouri Alabama
a 1 2 NaN NaN
b NaN NaN 7 8
c 3 4 9 10
d NaN NaN 11 12
e 5 6 13 14
Before you jump into the exercises you will need to download a number of data sets. Below I will give you the urls for the data:
Todo
Using the movie-lens 1M data and pandas.read_table read in all three files (users, ratings, movies) into pandas DataFrames. I recommend giving columns names directly to read_table for each case.
HINT: you will need to open the files in a text editor to see what the separator character. Also, the columns for each file correspond to the following:
Use the data combining tools discussed above to combine these three objects into a single object named movie_data
Todo
Use the movie_data object from the previous exercise and compute the following things:
The 5 movies with the most number of ratings
A new object called active_titles that is made up of movies each having at least 250 ratings
HINT: For some of these operations it might be helpful to compute a pivot table using the pivot_table method of a DataFrame.
NOTE: You might also want to do some analysis on the movie genre. To do this you would have to be comfortable with using pandas vectorized string methods and probably some regular expressions. Those topics are a bit bigger than what we can cover here, but if you are looking for more practice, they are a good place to start.
Todo
Using the baby names data set, define a pandas object called names_data that includes all the years.
HINT: You should probably do this in a for loop. Also, to get a list of all the text files in a directory (so you can loop over them), use the following code:
from glob import glob
files = glob('*.TXT')
Todo
Using the names_data object you just created, generate the following items:
Todo
Use the built in python json module to load the food data into a python list. Your code should look like this
import json
db = json.load(open('foods-2011-10-03.json'))
db will be a list of 6636 python dictionaries, each containing nutritional information for a different food item. Each dictionary will have the following keys:
Create a DataFrame of meta_data using the description, group, id, and manufacturer items in each dictionary.
Todo
Loop over db and construct a list of DataFrames containing the nutritional information for each record in db. Make sure to add a column to each of these DataFrames that contains the unique food id (id key in the dictionary). HINT: This food id will be a constant
Finally, use the pandas combining techniques to create a nutrients DataFrame. After you have done this drop duplicate entries in this DataFrame. For example, if you had named the objects nuts you would do
nuts = nuts.drop_duplicates()
Todo
Use the rename method to make sure that the description and group columns are un-ambiguous for both the meta_data DataFrame and the nutrients DataFrame (These column names are duplicated because every food has a description and group and each nutrient also has those identifiers). HINT: you will need to use the columns and copy arguments of the rename method.
Finally, use the data combining routines to come up with a foods DataFrame containing all the meta_data and nutritional information. Make sure to do an outer style merge on the correct columns.
Todo
Using the foods DataFrame you have been building, compute the following things: